# METL-sim 

This notebook serves allows for the deployment of the Rosetta framework Open Science Grid to run FastRelax on the Rosetta software. This notebook only supports random creation of variants for the simple FastRelax protocol. If you want to have a more complex protocol you will have to follow the steps for setting up `metl-sim` as specified on the [github](https://github.com/gitter-lab/metl-sim) page. 

**Sections**
1. Setup (environment setup and Rosetta software download)
2. Generate variants to simulate.
3. Upload PDB
4. Submit computational jobs to compute Rosetta scores
5. Extract and postprocess output. 

### 1) Setup
To run the `metl-sim` framework you will run each of the cells below. You can either press shift enter, or press the play button at the top of the screen. 

Practice by running the cell below

![Alt Text](img/jupyter.png)

In [None]:
from utils import *

Once the cell is run, it should have a \[1] on the left side (shown below), indicating it ran to completion. When a cell is running it will look like this \[*]. If a cell hasn't been run yet it will look like this \[ ]. We recommend running the cells in chronological order as they show up in this notebook. However, the order in which cells is specified by the number in the brackets. So this is the first cell run, so it contains a \[1]. 

![Alt Text](img/cell.png) 

If at any time you need to start over because of an error, we recommend pressing `Kernel` in the top panel of the screen and pressing `Restart Kernel and Clear Outputs of all Cells`

![Alt Text](img/restart.png) 

The below scripts checks to make sure the Ipython notebook `demo.ipynb` is open in the correct directory `metl-sim/notebooks`


In [None]:

expected_folder1 = "metl-sim"
expected_folder2 = "notebooks"

check_last_two_folders(expected_folder1, expected_folder2)

#### 1a) Setup Conda Environment 

This will download the binaries necessary for running the python scripts associated with this environment. Run the script below to download the environment. 

**__Note:THIS ONLY NEEDS TO BE RUN ONCE__**

In [None]:
# Example usage:
url = "http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosettafy_env_v0.7.11.tar.gz"
output_path = "downloads/rosettafy_env_v0.7.11.tar.gz"

download_file(url, output_path)

Now we will untar the binary file. 

In [None]:
# Example usage:
file_path = "downloads/rosettafy_env_v0.7.11.tar.gz"
extract_dir = "rosetta_env"

untar_file_with_progress(file_path, extract_dir)

### Setup 1b - Rosetta Software

We will need to download and untar the the most recent Rosetta version.

To do this you will need to go to the [Rosetta Commons] download page and download under the commercial license. (https://downloads.rosettacommons.org/software/academic/)

Then click the download Rosetta and current version number. 


![Alt text](img/rosetta_homepage.png)


On the next page right click on the Linux binary link and click `Copy Link Address` 


![Alt Text](img/download_homepage.png)


Paste the link below in the following cell. 


**Note**: Running the code below if you have not already downloaded Rosetta should take approximately 10 minutes.  

#### Helloworld in Open Science Grid (OSG)

In order to use the open science grid to submit jobs, you only need to use three commands (all of which will be explained in much more detail below.) 

The three functions are: 

- `submit_condor_job('job_name','job_type')` (submits a job with a unique 'job_name'; and chooses 'job_type' from :'helloworld' and 'relax'. 
- `job_status()` (looks at the status of all jobs you have run; also removes failed jobs if they are currently on Open Science Grid) 
- `remove_all_condor_jobs()` (only needed to run after you are done using this notebook)



First lets submit a job under the 'helloworld' job_type. This will submit a three jobs under one job_name to open science grid. Each job will print to the console `Helloworld` then the job will be completed. 

**<span style="color:red">Important</span>**: The parameter `job_name` is unique to each job and must be specified by you. You cannot submit two jobs which have the same name. 

In [1]:
from utils import * 
submit_condor_job(job_name='hello_world_test',job_type='helloworld')

[91m❌ A job named 'hello_world_test' already exists. Please specify a new job name.[0m


To see the output of all currently running jobs, **and** remove all failed jobs, simply check the job status with `job_status()`. You can ignore the `💡 Notice:` output unless you are curious what is happening in the background when removing failed jobs. 

The `helloworld` job will not completed for 5 minutes. You can close this page during that time. However, if you come back you must always run `from utils import *` before running any of the three functions. 

In [2]:
from utils import *
job_status()

[33m💡 Notice: Found failed jobs in log files, checking OSG if jobs exist[0m
[33m💡 Notice: No failed jobs currently on OSG[0m
[92m Status of all submitted jobs [0m
                  Running⌛  Completed✅  Failed❌
hello_world_1            0           1        2
hello_world_test         0           3        0


The final function, `remove_all_condor_jobs()` only needs to be run once you are done running jobs on Open science grid. This will remove all jobs, regardless if they are `Running⌛` or `Failed❌` if they are currently running on Open Science Grid. 

**<span style="color:red">Important</span>**:  If you run this command after submitting the above jobs, they will effectively be removed and moved to the `Failed❌` column. 

In [None]:
from utils import * 

# Example usage
remove_all_condor_jobs()

### Running Rosetta Relax on OSG

**Note**: If you have any questions up until this point, or at any further point, we recommend opening a github issues [here](https://github.com/gitter-lab/metl-sim/issues), our team is happy to help.

First upload your pdb file to the folder `metl-sim/pdb_files/raw_pdb_files`.


Then write your pdb file name in the cell below. Here I have specified the example structure `2qmt.pdb` which is the binding domain of Protein G or GB1. 

![alt text](img/pdb.png)


In [4]:
pdb_file_name ='2qmt.pdb'

In [None]:
## need to have a catch all condor_rm function
## to remove anything that my error cases don't catch 


## what is left? 



#1. rosetta (this just uses rosetta minimal. but requires a tar download command, and a 
#                    untar command. then a call to rosetta minimal. 
#                   I can just leave where the files will be store constant actually so that is easy
###                 haha! wait until this afternoon. 



#2. prepare (this uses clean_pdb but requires the codebase to run) 


# do this first!!! 
#3. generate variants (simplest to implement)
#4. prepare a job. (simple) 
#4. submit a job. (find the file which the same name, and then you are done. 
#5. post process a job. (simple) 


## 7. place all the files into osdf as opposed to on my local machine
#  I'm nervous about this, so lets get help with this... 

# just untar the code directory , and place the pdb file and variant list inside
# then retar it up.
# its excessive and can be optimized, but its the easiest solution non the less... 


### Issues? 

If you have issues please post a github issue . 


### Share your results! 



In [None]:
# paste Rosetta link url here: 
Rosetta_Link_URL ="https://downloads.rosettacommons.org/downloads/academic/3.14/rosetta_bin_linux_3.14_bundle.tar.bz2"

run_name = 'Rosetta_Download"


## then if they want to run this job again I say you already have a job of the 
### same name already running, either remove that job, or wait until it finishes
## submit another job... 

# make_rosetta_run
# submit the rosetta run with my version of rosetta specified. 

# check in on the run in question. 


# shell script that downloads rosetta 
# then untar's it 
# activates a conda environment 
# then reduces it using the rosetta minimal script 
# then places those files in an output directory and transfers them back to 
# this location... 

#### so how are you going to do the submission of condor scripts. 
## what do you need? 

## all the code, tar balled up... that is easy ... 
## the conda environment 
## 


## then I need some functions to check condor queue. probably have the
# user specify this run so that they know what they are running. 


# output_path = "downloads/rosetta_bin_linux_3.14_bundle.tar.bz2"

# download_file(Rosetta_Link_URL, output_path)

Now we will untar the file. **Note**, this will take up around 50GB of space. The base allowed amount given by Open Science Grid is 50GB. You may need to request more space if using future versions of Rosetta.  

Untarring this file will take approximately 30 minutes. 

**Do not close the computer during this period!**

In [None]:
!condor_submit condor/rosetta/
# have 1 script that just does condor submit
# have 1 script check for completion (output of condor_q basically but in a pleasing 
# format, and it looks to see if the downloaded files exist. 

In [None]:
## after I get that working

## then I want to basically want to generate variants to 
## simulate in sam's environment of interest but how do I do that not from in the top 
## directory, okay not being in the top directory was fucking stupid. 
# i should change that. 

### run scripts of interest as shell scripts. 
## then have them upload pdb and verify that it is uploaded. 



In [None]:
# 1. environment stuff with packing metl-sim
# 2. instead of using github need a different system to tar up the files. 
# 3. figure out the rosetta problem either just ask for more space 
#     I'm honestly not sure, but whatever, just do whatever is easiest. 
#     - just download it yourself and minimalize  
            # check if the prepared version 
            # make sure rosetta exists , don't the curl and untar again 
            # check if the conda check if the environment to see if you can activate 
            # the environment... 
            # tony has a script for activating the conda environment. 
#     - download through conda  
            # 
#     - group file system 
#     - redistribute rosetta on a public server and deal with the legal headach of that. 


# metl-sim github issues --- tell them how to get help ...

# I'm really starting to like just download yourself and minimalize... 
# 4. UC 



# I think it will be better just to do an scp of the
#  finished environments to this directory... 

# plan 0 is checking to see if you should submit 
# plan A needs to be release... 
# bank that into the system because we know expect a certain failure rate
# all singles and doubles, not the the best simulation workflow. 
# send drafts for jon to look at... 

# whether the instructions make sense and show him the notebook. first round feedback next 
# week. 

# then your basically done, add in some cosmetic stuff that I don't 
# even think is that big of a deal, but then your home free. 

# 1. should look to see if the conda binaries are already their if they are then your good , if not then download the files again. 
# 2. then download rosetta making sure to download linux 
      #  etc.  
# 3. friendly suggestion to share simulations , those are valuable , here is how we 
#     considering uploading (zenodo) and open a git issue if you ever do that. 
#     not requirement; you made something useful... at least for us... maybe for other people
#     files you should upload and here is how you should do that.  
#     cell and then it goes to bryce's OSG storage. 

In [None]:
('http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosetta_min_enc_v2.tar.gz.aa',
 'http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosetta_min_enc_v2.tar.gz.ab',
 'http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosetta_min_enc_v2.tar.gz.ac',
  'http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosettafy_env_v0.7.11.tar.gz')