# METL-Sim in Open Science Grid

This notebook serves allows for the deployment of the Rosetta framework Open Science Grid to run FastRelax from the Rosetta software. This notebook supports random creation of millions of variants to run the FastRelax protocol on the open science grid. The interactive functions in this notebook do not represent the full power fo the `metl-sim` repository; if you wish to customize your simulation or variants please reference the metl-sim [github](https://github.com/gitter-lab/metl-sim) readme. 

**Sections**
1. Introduction Jupyter notebooks and OSG functions (do not skip!!) 
2. Environment setup and Rosetta software download
3. Running Rosetta FastRelax on OSG

**Note**: If you have any questions, or at any point, we recommend opening a github issues [here](https://github.com/gitter-lab/metl-sim/issues), our team is happy to help.


### 1) Introduction to Jupyter notebooks and OSG functions 


#### Introduction to Jupyter notebooks

To run the `metl-sim` framework you will run each of the cells below. You can either press shift enter, or press the play button at the top of the screen. 

Practice by running the cell below

![Alt Text](img/jupyter.png)


In [None]:
from utils import *

Once the cell is run, it should have a \[1] on the left side (shown below), indicating it ran to completion. When a cell is running it will look like this \[*]. If a cell hasn't been run yet it will look like this \[ ]. We recommend running the cells in chronological order as they show up in this notebook.(Besides the cells in the `Setup` portion which are only needed to run once.) However, the order in which cells is specified by the number in the brackets. So this is the first cell run, so it contains a \[1]. 

![Alt Text](img/cell.png) 

If at any time you need to start over because of an error, we recommend pressing `Kernel` in the top panel of the screen and pressing `Restart Kernel and Clear Outputs of all Cells`

![Alt Text](img/restart.png) 

The below scripts checks to make sure the Ipython notebook `demo.ipynb` is open in the correct directory `metl-sim/notebooks`


In [16]:
expected_folder1 = "metl-sim"
expected_folder2 = "notebooks"

check_last_two_folders(expected_folder1, expected_folder2)

[92m✅ Success: You are in 'metl-sim/notebooks'.[0m


#### Helloworld in Open Science Grid (OSG)

In order to use the open science grid to submit jobs, you only need to use three commands (all of which will be explained in much more detail below.) 

The three functions are: 

- `submit_condor_job('job_name','job_type')` (submits a job with a unique 'job_name'; and 'job_type' with options:`'helloworld'`, `'rosetta_download'` and `'relax'`. 
- `job_status()` (looks at the status of all jobs you have run; also removes failed jobs if they are currently on Open Science Grid) 
- `remove_all_condor_jobs()` removes all running and failed jobs. (Only should run after you are done using this notebook)



First lets submit a job under the 'helloworld' job_type. This will submit a three jobs under one job_name to open science grid. Each job will print to the console `Helloworld` then the job will be completed. 

**<span style="color:red">Important</span>**: The parameter `job_name` is unique to each job and must be specified by you. You cannot submit two jobs which have the same name. 


In [None]:
from utils import * 
submit_condor_job(job_name='hello_world_test5',job_type='helloworld')

To see the output of all currently running jobs, **and** remove all failed jobs, simply check the job status with `job_status()`. You can ignore the `💡 Notice:` output unless you are curious what is happening in the background when removing failed jobs. 

The `helloworld` job will not completed for 5 minutes. You can close this page during that time. However, if you come back you must always run `from utils import *` before running any of the three functions. 

In [None]:
from utils import *
job_status()

The final function, `remove_all_condor_jobs()` only needs to be run once you are done running jobs on Open science grid. This will remove all jobs, regardless if they are `Running⌛` or `Failed❌` if they are currently running on Open Science Grid. 

**<span style="color:red">Important</span>**:  If you run this command after submitting the above jobs, they will effectively be removed and moved to the `Failed❌` column. 

In [None]:
from utils import * 
# Example usage (uncomment if you wish to run)
# remove_all_condor_jobs()

### 2) Environment Setup and Rosetta Software Download 

#### Environment Setup 

This will download the binaries necessary for running the python scripts associated with this environment. Run the script below to download the environment. 

**Note:Run the four cells below only once!**

In [None]:
# Example usage:
url = "http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosettafy_env_v0.7.11.tar.gz"
output_path = "downloads/rosettafy_env_v0.7.11.tar.gz"

download_file(url, output_path,'curl')

Now we will untar the binary file. 

In [None]:
file_path = "downloads/rosettafy_env_v0.7.11.tar.gz"
extract_dir = "rosetta_env"

untar_file_with_progress(file_path, extract_dir)

In [None]:
# cell for downloading the conda environment! 
url = "http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/clean_pdb.tar.gz"
output_path = "downloads/clean_pdb.tar.gz"

download_file(url, output_path,'curl')

In [None]:
# Example usage:
file_path = "downloads/clean_pdb.tar.gz"
extract_dir = "clean_pdb"

untar_file_with_progress(file_path, extract_dir)

#### Downloading Rosetta

Downloading Rosetta is a non-trivial task due to the shear size of the Rosetta software (40GB as a tar file). However, the submit node on OSG only allows 50GB of space, making downloading Rosetta on the submit node very difficult (since it needs to be untarred leading to 80GB of needed space). To account for this issue, we will submit a job to OSG that has the proper disk space. This job will download the full version of Rosetta, then transfer relevant functions and data from Rosetta 3.14 (if you wish to change the version of rosetta to download look to `condor/rosetta/run.sh` and change the Rosetta `tar.b2z` file to download) which are relevant for FastRelax. Finally, it will encrypt Rosetta and break it up into multiple files under 1GB in size to make the software compliant to run on the Open Science Grid.

**Note**: This code may take a few hours to run as the downloading, untarring, and encrypting of Rosetta can take some time. However, after submitting the job you are free to close this window and come back later as long as you check the rosetta job is running with the `job_status()` function. You can not contineu onto step 3 without downloading Rosetta properly. 


**<span style="color:red">NOTE</span>**: By downloading Rosetta you are subject the Rosetta Licensing agreement: [link](https://github.com/RosettaCommons/rosetta/blob/main/LICENSE.md). The most important point is that the free version of Rosetta can only be used for **non-commercial** purposes. If you wish to use Rosetta for commercial purposes please consult the Licensing agreement.

**Note:** Run the two cells below only once to completion!



In [None]:
from utils import * 
rosetta_job_name='rosetta_download6'
submit_condor_job(job_name=rosetta_job_name,job_type='rosetta_download')

In [None]:
from utils import * 
job_status()

After checking job status and the `job_name` above is in the `completed` column, run the below function to post process the output from the job. 

In [19]:
from utils import * 
rosetta_job_name='rosetta_download6'
post_process_rosetta_download(rosetta_job_name)

[93m🔔 Notice: Directory 'condor/rosetta_download6/output/rosetta_download' already exists and is not empty. Skipping extraction.[0m
[32m✅ Found Rosetta File: condor/rosetta_download6/output/rosetta_download/output/squid_rosetta/rosetta_min_enc.tar.gz.aa[0m
[32m✅ File transferred to downloads![0m
[32m✅ Found Rosetta File: condor/rosetta_download6/output/rosetta_download/output/squid_rosetta/rosetta_min_enc.tar.gz.ab[0m
[32m✅ File transferred to downloads![0m
[32m✅ Found Rosetta File: condor/rosetta_download6/output/rosetta_download/output/squid_rosetta/rosetta_min_enc.tar.gz.ac[0m
[32m✅ File transferred to downloads![0m
Combining split files for rosetta_min_enc.tar.gz
Decrypting Rosetta
OpenSSL 3.1.1 30 May 2023 (Library: OpenSSL 3.1.1 30 May 2023)
[92m✅ Successfully  decoded rosetta.[0m


Extracting: 100%|██████████| 24858/24858 [02:38<00:00, 157.23file/s]

[92m✅ Success: File untarred to 'rosetta'.[0m





#### Set up shell scripts
We also need to set the permissions of all bash scripts so we can run all the following functions. 

In [None]:
from utils import *
# Set permissions for the bash_scripts directory
set_permissions('bash_scripts', '777')

### 3) Running Rosetta FastRelax on OSG

**Note**: If you have any questions up until this point, or at any further point, we recommend opening a github issues [here](https://github.com/gitter-lab/metl-sim/issues), our team is happy to help.

First upload your pdb file to the folder `metl-sim/pdb_files/raw_pdb_files`.


Then replace `2qmt.pdb` with your pdb file name in the cell below. Here I have specified the example structure `2qmt.pdb` which is the binding domain of Protein G or GB1. 

![alt text](img/pdb.png)


In [33]:
pdb_file_name ='2qmt.pdb'

#### Prepare pdb for Rosetta Relax

To run Rosetta, the developers recommend some preprocessing steps in order to resolve atom clashes that commonly come from structures taken from the PDB database. 

Run the below cell to run this preprocessing step. For higher accuracy you can increase the parameter `relax_nstruct`, however the compute time will start to increase at higher numbers of structures. If you want to run with higher `relax_nstruct`, we recommend you check out the github page to run without the code without the jup

**Note**: Whatever configuration of `run_prepare_script` will be the saved structure used for the large scale Rosetta relax runs. 



In [None]:
from utils import *
run_prepare_script(
        rosetta_main_dir='notebooks/rosetta/rosetta_minimal',
        pdb_fn=f'pdb_files/raw_pdb_files/{pdb_file_name}',
        relax_nstruct=2,
        out_dir_base='output/prepare_outputs'
    )

#### Generate variants for Rosetta Relax

Now we will generate variants to run on OSG. Below you will specify the number of variants to generate with the variable `variants_to_generate`. We recommend the default of 100,000 variants to generate to get good results (but this can vary by protein length).  The `max_subs` and `min_subs` parameters determines the maximum and minimum allowed mutant; the default is a 5 mutant maximum, 2 mutant minimum.  

**Note**: When determining how many variants to generate we expect rouhgly 25\% of jobs to fail. So if you hope to generate 100,000 variants; we suggest you generate 100,000/0.75 or 133,000 variants. 

In [38]:
from utils import * 

variants_to_generate=2
run_variant_script(pdb_fn=f'pdb_files/prepared_pdb_files/{pdb_file_name.split(".")[0]}_p.pdb',\
                   variants_to_generate=variants_to_generate,
                  max_subs=1,
                  min_subs=1)

[32m✅ A variant file with these parameters doesn't exist: 
 	 variants to generate 2
	 maximum substitutions 1 
	 minimum substitutions 1 
--> Generating variants now ⌛[0m
[32m✅ Successfully generate variants![0m


#### Prepare Rosetta Relax job for OSG

We now need to prepare a Rosetta Relax job for open science grid. 

In [40]:
pdb_file_name ='2qmt.pdb'
from utils import * 
job_name='FastRelax_test_54'

# must be the same as the parameters used to generate variants above 
variants_to_generate=2
max_subs=1
min_subs=1

prepare_rosetta_run(job_name,pdb_file_name,variants_to_generate,max_subs,min_subs)

[32m✅ Job name FastRelax_test_54 is availabe, preparing rosetta job[0m
[32m✅ Variant file exists:
 Variants to Generate : 2 
 Max Subs: 1 
 Min Subs :1[0m
[32m✅ Successfully prepared OSG run![0m


#### Submit Rosetta Relax Job

This function submits the Rosetta Relax job to the Open Science Grid. After running this, you should be able to run `job_status()` and see that your job is likely `Running`. Longer proteins with many variants (>1M) will typically take the longest to complete (>24 hours). But this a loose speculation aimed to help the user get an idea of how long each run will take, run times can and will differ substantially between proteins. Our recommendation is to start small, will 100 variants, get a feel for how long it takes to run. Then go back and start a new Rosetta relax run with the number of variants which you wish to pretrain your METL model with.  

In [41]:
from utils import *
submit_condor_job(job_name='FastRelax_test_54',job_type='relax')

[92m✅ No job named 'FastRelax_test_54' exists'. You can use this job name.[0m
[92m✅ Setting up job type `relax`[0m


Extracting: 100%|██████████| 2/2 [00:00<00:00, 443.16file/s]

[92m✅ Success: File untarred to 'condor/FastRelax_test_54'.[0m





[92m✅ Job name: 'FastRelax_test_54' submitted [92m


Running job status, as mentioned before, will state the jobs that are running, completed, and failed for all the jobs you have submitted. You should see your job in the running column. 

In [25]:
job_status()

[33m💡 Notice: No log files found for job: relax4, skipping job [0m
[33m💡 Notice: No log files found for job: FastRelax_test_41, skipping job [0m
[33m💡 Notice: No log files found for job: FastRelax_test_34, skipping job [0m
[33m💡 Notice: No log files found for job: FastRelax_test_40, skipping job [0m
[33m💡 Notice: No log files found for job: relax1, skipping job [0m
[33m💡 Notice: No log files found for job: relax3, skipping job [0m
[33m💡 Notice: No log files found for job: FastRelax_test_36, skipping job [0m
[33m💡 Notice: No log files found for job: relax2, skipping job [0m
[33m💡 Notice: No log files found for job: FastRelax_test_38, skipping job [0m
[33m💡 Notice: No log files found for job: FastRelax_test_35, skipping job [0m
[33m💡 Notice: No log files found for job: relax6, skipping job [0m
[33m💡 Notice: No log files found for job: FastRelax_test_37, skipping job [0m
[33m💡 Notice: No log files found for job: relax5, skipping job [0m
[33m💡 Notice: Found faile

#### Post process Rosetta Relax job

In [3]:
from utils import *
df=run_post_process_script(job_name='FastRelax_test_48')

[33m💡 Notice:FastRelax_test_48 all ready post processed, loading pandas dataframe[0m


Here the variable `df` corresponds to a pandas dataframe which contains the rosetta scores for each variant. Note that not all variants may have completed that you submitted. It is expected behavoir with millions of variants that not all jobs will complete successfully. 

Below only prints out the first 100 variants that were processed, to get all the variants remove `.head(100)`. To view all the statistics for each variant, including rosetta energy terms and run information, simply run `df` in a cell.

In [4]:
df[['variant','total_score']].head(100)

Unnamed: 0,variant,total_score
0,"E15T,K28W,Y45N",-160.529
1,"T18P,Y45N",-148.881
2,"F30A,Q32I,T51A",-165.410
3,"Q32I,N35V",-172.314
4,"I6G,K10M,V39K",-160.650
...,...,...
95,"T16R,A20S,D46P",-156.998
96,"T16R,T53N",-166.944
97,"L12D,T18M,D40I,G41M",-160.459
98,"L7V,T18M",-179.328


To download the csv file to your local machine, download this file:

`metl-sim/notebooks/condor/<job_name>/energies_df.csv`

by right clicking in the file explorer on the left panel. (shown in the image below for the job name `FastRelax_test_48`. 

![alt text](img/download.png)

### Issues? 

If you have issues please post a github issue . 


### Share your results! 

Make your Rosetta results open access by sharing them **here**. 

