# METL-Sim in Open Science Grid

This notebook serves allows for the deployment of the Rosetta framework Open Science Grid to run FastRelax from the Rosetta software. This notebook supports random creation of millions of variants to run the FastRelax protocol on the open science grid. The interactive functions in this notebook do not represent the full power fo the `metl-sim` repository; if you wish to customize your simulation or variants please reference the metl-sim [github](https://github.com/gitter-lab/metl-sim) readme. 

**Sections**
1. Introduction Jupyter notebooks and OSG functions (do not skip!!) 
2. Environment setup and Rosetta software download
3. Running Rosetta FastRelax on OSG
4. Extract and postprocess output.

**Note**: If you have any questions, or at any point, we recommend opening a github issues [here](https://github.com/gitter-lab/metl-sim/issues), our team is happy to help.


### 1) Introduction to Jupyter notebooks and OSG functions 


#### Introduction to Jupyter notebooks

To run the `metl-sim` framework you will run each of the cells below. You can either press shift enter, or press the play button at the top of the screen. 

Practice by running the cell below

![Alt Text](img/jupyter.png)


In [None]:
from utils import *

Once the cell is run, it should have a \[1] on the left side (shown below), indicating it ran to completion. When a cell is running it will look like this \[*]. If a cell hasn't been run yet it will look like this \[ ]. We recommend running the cells in chronological order as they show up in this notebook.(Besides the cells in the `Setup` portion which are only needed to run once.) However, the order in which cells is specified by the number in the brackets. So this is the first cell run, so it contains a \[1]. 

![Alt Text](img/cell.png) 

If at any time you need to start over because of an error, we recommend pressing `Kernel` in the top panel of the screen and pressing `Restart Kernel and Clear Outputs of all Cells`

![Alt Text](img/restart.png) 

The below scripts checks to make sure the Ipython notebook `demo.ipynb` is open in the correct directory `metl-sim/notebooks`


In [None]:

expected_folder1 = "metl-sim"
expected_folder2 = "notebooks"

check_last_two_folders(expected_folder1, expected_folder2)

#### Helloworld in Open Science Grid (OSG)

In order to use the open science grid to submit jobs, you only need to use three commands (all of which will be explained in much more detail below.) 

The three functions are: 

- `submit_condor_job('job_name','job_type')` (submits a job with a unique 'job_name'; and 'job_type' with options:`'helloworld'`, `'rosetta_download'` and `'relax'`. 
- `job_status()` (looks at the status of all jobs you have run; also removes failed jobs if they are currently on Open Science Grid) 
- `remove_all_condor_jobs()` removes all running and failed jobs. (Only should run after you are done using this notebook)



First lets submit a job under the 'helloworld' job_type. This will submit a three jobs under one job_name to open science grid. Each job will print to the console `Helloworld` then the job will be completed. 

**<span style="color:red">Important</span>**: The parameter `job_name` is unique to each job and must be specified by you. You cannot submit two jobs which have the same name. 


In [None]:
from utils import * 
submit_condor_job(job_name='hello_world_test5',job_type='helloworld')

To see the output of all currently running jobs, **and** remove all failed jobs, simply check the job status with `job_status()`. You can ignore the `💡 Notice:` output unless you are curious what is happening in the background when removing failed jobs. 

The `helloworld` job will not completed for 5 minutes. You can close this page during that time. However, if you come back you must always run `from utils import *` before running any of the three functions. 

In [None]:
from utils import *
job_status()

The final function, `remove_all_condor_jobs()` only needs to be run once you are done running jobs on Open science grid. This will remove all jobs, regardless if they are `Running⌛` or `Failed❌` if they are currently running on Open Science Grid. 

**<span style="color:red">Important</span>**:  If you run this command after submitting the above jobs, they will effectively be removed and moved to the `Failed❌` column. 

In [None]:
from utils import * 
# Example usage (uncomment if you wish to run)
# remove_all_condor_jobs()

### 2) Environment Setup and Rosetta Software Download 

#### Environment Setup 

This will download the binaries necessary for running the python scripts associated with this environment. Run the script below to download the environment. 

**Note:Run the four cells below only once!**

In [None]:
# Example usage:
url = "http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosettafy_env_v0.7.11.tar.gz"
output_path = "downloads/rosettafy_env_v0.7.11.tar.gz"

# NOTE you need to retar this environment and make it correct. 
# also you need to replace the osdf_python_env_file 

download_file(url, output_path,'curl')

Now we will untar the binary file. 

In [None]:
file_path = "downloads/rosettafy_env_v0.7.11.tar.gz"
extract_dir = "rosetta_env"

untar_file_with_progress(file_path, extract_dir)

In [None]:
# cell for downloading the conda environment! 
url = "http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/clean_pdb.tar.gz"
output_path = "downloads/clean_pdb.tar.gz"

download_file(url, output_path,'curl')

In [None]:
# Example usage:
file_path = "downloads/clean_pdb.tar.gz"
extract_dir = "clean_pdb"

untar_file_with_progress(file_path, extract_dir)

#### Downloading Rosetta

Downloading Rosetta is a non-trivial task due to the shear size of the Rosetta software (40GB as a tar file). However, the submit node on OSG only allows 50GB of space, making downloading Rosetta on the submit node very difficult (since it needs to be untarred leading to 80GB of needed space). To account for this issue, we will submit a job to OSG that has the proper disk space. This job will download the full version of Rosetta, then transfer relevant functions and data from Rosetta 3.14 (if you wish to change the version of rosetta to download look to `condor/rosetta/run.sh` and change the Rosetta `tar.b2z` file to download) which are relevant for FastRelax. Finally, it will encrypt Rosetta and break it up into multiple files under 1GB in size to make the software compliant to run on the Open Science Grid.

**Note**: This code may take a few hours to run as the downloading, untarring, and encrypting of Rosetta can take some time. However, after submitting the job you are free to close this window and come back later as long as you check the rosetta job is running with the `job_status()` function. You can not contineu onto step 3 without downloading Rosetta properly. 


**<span style="color:red">NOTE</span>**: By downloading Rosetta you are subject the Rosetta Licensing agreement: [link](https://github.com/RosettaCommons/rosetta/blob/main/LICENSE.md). The most important point is that the free version of Rosetta can only be used for **non-commercial** purposes. If you wish to use Rosetta for commercial purposes please consult the Licensing agreement.

**Note:** Run the two cells below only once to completion!



In [None]:
from utils import * 
rosetta_job_name='rosetta_download6'
submit_condor_job(job_name=rosetta_job_name,job_type='rosetta_download')

In [None]:
from utils import * 
job_status()

After checking job status and the `job_name` above is in the `completed` column, run the below function to post process the output from the job. 

In [None]:
from utils import * 
rosetta_job_name='rosetta_download4'
post_process_rosetta_download(rosetta_job_name)

#### Set up shell scripts
We also need to set the permissions of all bash scripts so we can run all the following functions. 

In [None]:
from utils import *
# Set permissions for the bash_scripts directory
set_permissions('bash_scripts', '777')

### 3) Running Rosetta FastRelax on OSG

**Note**: If you have any questions up until this point, or at any further point, we recommend opening a github issues [here](https://github.com/gitter-lab/metl-sim/issues), our team is happy to help.

First upload your pdb file to the folder `metl-sim/pdb_files/raw_pdb_files`.


Then replace `2qmt.pdb` with your pdb file name in the cell below. Here I have specified the example structure `2qmt.pdb` which is the binding domain of Protein G or GB1. 

![alt text](img/pdb.png)


In [1]:
pdb_file_name ='2qmt.pdb'

#### Prepare pdb for Rosetta Relax

To run Rosetta, the developers recommend some preprocessing steps in order to resolve atom clashes that commonly come from structures taken from the PDB database. 

Run the below cell to run this preprocessing step. For higher accuracy you can increase the parameter `relax_nstruct`, however the compute time will start to increase at higher numbers of structures. If you want to run with higher `relax_nstruct`, we recommend you check out the github page to run without the code without the jup

**Note**: Whatever configuration of `run_prepare_script` will be the saved structure used for the large scale Rosetta relax runs. 



In [None]:
from utils import *
run_prepare_script(
        rosetta_main_dir='notebooks/rosetta/rosetta_minimal',
        pdb_fn=f'pdb_files/raw_pdb_files/{pdb_file_name}',
        relax_nstruct=2,
        out_dir_base='output/prepare_outputs'
    )

#### Generate variants for Rosetta Relax

Now we will generate variants to run on OSG. Below you will specify the number of variants to generate with the variable `variants_to_generate`. We recommend the default of 100,000 variants to generate to get good results (but this can vary by protein length).  The `max_subs` and `min_subs` parameters determines the maximum and minimum allowed mutant; the default is a 5 mutant maximum, 2 mutant minimum.  

**Note**: When determining how many variants to generate we expect rouhgly 25\% of jobs to fail. So if you hope to generate 100,000 variants; we suggest you generate 100,000/0.75 or 133,000 variants. 

In [None]:
from utils import * 

variants_to_generate=5000

run_variant_script(pdb_fn=f'pdb_files/prepared_pdb_files/{pdb_file_name.split(".")[0]}_p.pdb',\
                   variants_to_generate=variants_to_generate,
                  max_subs=5,
                  min_subs=2)

#### Prepare Rosetta Relax job for OSG

We now need to prepare a Rosetta Relax job for open science grid. 

In [5]:
from utils import * 
job_name='FastRelax_test_32'

# must be the same as the parameters used to generate variants above 
variants_to_generate=5000
max_subs=5
min_subs=2

prepare_rosetta_run(job_name,pdb_file_name,variants_to_generate,max_subs,min_subs)

[32m✅ Job name FastRelax_test_32 is availabe, preparing rosetta job[0m
[32m✅ Variant file exists:
 Variants to Generate : 5000 
 Max Subs: 5 
 Min Subs :2[0m
Standard Output:
 
Added .gitignore to output/htcondor_runs/condor_energize_2024-11-18_15-37-36_FastRelax_test_32/code.tar.gz under root metl-sim-source_local
Added code to output/htcondor_runs/condor_energize_2024-11-18_15-37-36_FastRelax_test_32/code.tar.gz under root metl-sim-source_local
Added environment to output/htcondor_runs/condor_energize_2024-11-18_15-37-36_FastRelax_test_32/code.tar.gz under root metl-sim-source_local
Added .ipynb_checkpoints to output/htcondor_runs/condor_energize_2024-11-18_15-37-36_FastRelax_test_32/code.tar.gz under root metl-sim-source_local
Added energize_args to output/htcondor_runs/condor_energize_2024-11-18_15-37-36_FastRelax_test_32/code.tar.gz under root metl-sim-source_local
Added rosetta_env to output/htcondor_runs/condor_energize_2024-11-18_15-37-36_FastRelax_test_32/code.tar.gz under

#### Submit Rosetta Relax Job


In [None]:
# copy over the contents and then then sumbit the job. 
# export the parameters that are local to the job (will need a bash environment for this.) 
# not very hard this is the easiest function 

submit_condor_job(job_name='relax1',job_type='relax')

#### Post process Rosetta Relax job

In [None]:
## post process the job in the usual way similar to the 
## previous functions 
run_post_process_script(job_name='relax1',
                    pdb_fn=f'pdb_files/prepared_pdb_files/{pdb_file_name.split(".")[0]}_p.pdb',\
                   variants_to_generate=variants_to_generate,
                  max_subs=5,
                  min_subs=2)

### Issues? 

If you have issues please post a github issue . 


### Share your results! 

Make your Rosetta results open access by sharing them **here**. 



### Notes

In [None]:
## need to have a catch all condor_rm function
## to remove anything that my error cases don't catch 


## what is left? 



# do this first!!! 
#3. generate variants (simplest to implement)
#4. prepare a job. (simple)  
        # transfer both the osdf files, along with the 
        # along with the variant file and 
#4. submit a job. (find the file which the same name, and then you are done. 
#5. post process a job. (simple) 


## 7. place all the files into osdf as opposed to on my local machine
#  I'm nervous about this, so lets get help with this... 

# just untar the code directory , and place the pdb file and variant list inside
# then retar it up.
# its excessive and can be optimized, but its the easiest solution non the less... 


In [None]:


## then if they want to run this job again I say you already have a job of the 
### same name already running, either remove that job, or wait until it finishes
## submit another job... 

# make_rosetta_run
# submit the rosetta run with my version of rosetta specified. 

# check in on the run in question. 


# shell script that downloads rosetta 
# then untar's it 
# activates a conda environment 
# then reduces it using the rosetta minimal script 
# then places those files in an output directory and transfers them back to 
# this location... 

#### so how are you going to do the submission of condor scripts. 
## what do you need? 

## all the code, tar balled up... that is easy ... 
## the conda environment 
## 


## then I need some functions to check condor queue. probably have the
# user specify this run so that they know what they are running. 


# output_path = "downloads/rosetta_bin_linux_3.14_bundle.tar.bz2"

# download_file(Rosetta_Link_URL, output_path)

#### Setup 1b - Rosetta Software

We will need to download and untar the the most recent Rosetta version.

To do this you will need to go to the [Rosetta Commons] download page and download under the commercial license. (https://downloads.rosettacommons.org/software/academic/)

Then click the download Rosetta and current version number. 


![Alt text](img/rosetta_homepage.png)


On the next page right click on the Linux binary link and click `Copy Link Address` 


![Alt Text](img/download_homepage.png)


Paste the link below in the following cell. 


**Note**: Running the code below if you have not already downloaded Rosetta should take approximately 10 minutes.  

Now we will untar the file. **Note**, this will take up around 50GB of space. The base allowed amount given by Open Science Grid is 50GB. You may need to request more space if using future versions of Rosetta.  

Untarring this file will take approximately 30 minutes. 

**Do not close the computer during this period!**

In [None]:
!condor_submit condor/rosetta/
# have 1 script that just does condor submit
# have 1 script check for completion (output of condor_q basically but in a pleasing 
# format, and it looks to see if the downloaded files exist. 

In [None]:
## after I get that working

## then I want to basically want to generate variants to 
## simulate in sam's environment of interest but how do I do that not from in the top 
## directory, okay not being in the top directory was fucking stupid. 
# i should change that. 

### run scripts of interest as shell scripts. 
## then have them upload pdb and verify that it is uploaded. 



In [None]:
# 1. environment stuff with packing metl-sim
# 2. instead of using github need a different system to tar up the files. 
# 3. figure out the rosetta problem either just ask for more space 
#     I'm honestly not sure, but whatever, just do whatever is easiest. 
#     - just download it yourself and minimalize  
            # check if the prepared version 
            # make sure rosetta exists , don't the curl and untar again 
            # check if the conda check if the environment to see if you can activate 
            # the environment... 
            # tony has a script for activating the conda environment. 
#     - download through conda  
            # 
#     - group file system 
#     - redistribute rosetta on a public server and deal with the legal headach of that. 


# metl-sim github issues --- tell them how to get help ...

# I'm really starting to like just download yourself and minimalize... 
# 4. UC 



# I think it will be better just to do an scp of the
#  finished environments to this directory... 

# plan 0 is checking to see if you should submit 
# plan A needs to be release... 
# bank that into the system because we know expect a certain failure rate
# all singles and doubles, not the the best simulation workflow. 
# send drafts for jon to look at... 

# whether the instructions make sense and show him the notebook. first round feedback next 
# week. 

# then your basically done, add in some cosmetic stuff that I don't 
# even think is that big of a deal, but then your home free. 

# 1. should look to see if the conda binaries are already their if they are then your good , if not then download the files again. 
# 2. then download rosetta making sure to download linux 
      #  etc.  
# 3. friendly suggestion to share simulations , those are valuable , here is how we 
#     considering uploading (zenodo) and open a git issue if you ever do that. 
#     not requirement; you made something useful... at least for us... maybe for other people
#     files you should upload and here is how you should do that.  
#     cell and then it goes to bryce's OSG storage. 

In [None]:
('http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosetta_min_enc_v2.tar.gz.aa',
 'http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosetta_min_enc_v2.tar.gz.ab',
 'http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosetta_min_enc_v2.tar.gz.ac',
  'http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosettafy_env_v0.7.11.tar.gz')