# Running Rosetta simulations on the OSG OSPool

This notebook provides an interactive environment to deploy [metl-sim](https://github.com/gitter-lab/metl-sim) on the [OSG OSPool](https://portal.osg-htc.org).

**Sections**
1. Hello world
2. Environment setup and Rosetta software download
3. Running Rosetta FastRelax on OSG

For any questions, please open a GitHub issue [here](https://github.com/gitter-lab/metl-sim/issues). Our team is happy to help.


# Setup
Run these cells to set up the environment.

In [30]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
import os
from utils import *

In [3]:
# expected_folder1 = "metl-sim"
# expected_folder2 = "notebooks"

# check_last_two_folders(expected_folder1, expected_folder2)

# Hello world

There are three main functions used in this notebook. This section shows how they work with a simple "hello world" example.

| Function                         | Description |
|:---------------------------------|:------------|
| `submit_condor_job(job_name, job_type)` | Submits a job with a unique `'job_name'` and `'job_type'`. Available job types: `'helloworld'`, `'rosetta_download'`, and `'relax'`. |
| `job_status()`                  | Checks the status of all jobs you have run. Also removes failed jobs if they are currently on OSPool. |
| `remove_all_condor_jobs()`       | Removes all running and failed jobs. Should only be run after you are done using this notebook. |


## Submit a hello world job

Let's submit a job under the 'helloworld' job_type. This will submit a three jobs under one job_name. Each job will print "Hello world!" to the console. 

**<span style="color:red">Important</span>**: The parameter `job_name` must be unique to each job. You cannot submit two jobs which have the same name. 


In [27]:
submit_condor_job(job_name='hello_world_test9', job_type='helloworld')

[92m‚úÖ No job named 'hello_world_test9' exists'. You can use this job name.[0m
[92m‚úÖ Setting up job type `helloworld`[0m
[92m‚úÖ Job name: 'hello_world_test9' submitted [92m


## Check the status of your jobs

To see the output of all currently running jobs, **and** remove all failed jobs, simply check the job status with `job_status()`. You can ignore the `üí° Notice:` output unless you are curious what is happening in the background when removing failed jobs. 

The `helloworld` job will take 5 minutes to complete. You can close this page during that time. When you come back, remember to run the cells in the `Setup` section before running any of the three functions. 

In [28]:
job_status()

[33müí° Notice: Found failed jobs in log files, checking OSG if jobs exist[0m
[33müí° Notice: Jobs currently running on OSG, job name: hello_world_test9, number jobs:3[0m
[33müí° Notice: No failed jobs currently on OSG[0m
[92m Status of all submitted jobs [0m
                   Running‚åõ  Completed‚úÖ  Failed‚ùå
hello_world_test9         3           0        0
hello_world_test5         0           0        0
hello_world_test6         0           0        0
hello_world_test8         0           2        1
hello_world_test7         0           2        1


## Remove all jobs

The final function, `remove_all_condor_jobs()` will remove all jobs regardless of if they are `Running‚åõ` or `Failed‚ùå`.

**<span style="color:red">Important</span>**:  If you run this command after submitting the above jobs, they will effectively be removed and moved to the `Failed‚ùå` column. 

In [42]:
remove_all_condor_jobs()

[33müí° Notice: Jobs currently running on OSG, job name: rosetta_download7, number jobs:1[0m
[92m‚úÖ Successfully removed 1 jobs.[0m


# Environment Setup and Rosetta Software Download 

## Download the Python environments

Running these cells will download the Python environments necessary for running metl-sim. You only need to do this once.

In [35]:
# Example usage:
url = "http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosettafy_env_v0.7.11.tar.gz"
output_path = "downloads/metl-sim_2025-02-13.tar.gz"

download_file(url, output_path, 'curl')

[93müîî Notice: File 'downloads/metl-sim_2025-02-13.tar.gz' already exists. No download needed.[0m


Now we will untar the binary file. 

In [32]:
file_path = "downloads/rosettafy_env_v0.7.11.tar.gz"
extract_dir = "rosetta_env"

untar_file_with_progress(file_path, extract_dir)

[91m‚ùå Failure: Extraction failed. Directory 'rosetta_env' has been removed.[0m
[91m‚ùå Failure: [Errno 2] No such file or directory: 'downloads/rosettafy_env_v0.7.11.tar.gz'[0m


In [33]:
url = "http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/clean_pdb.tar.gz"
output_path = "downloads/clean_pdb_2025-02-13.tar.gz"

download_file(url, output_path,'curl')

[91m‚ùå Failure: HTTP 404 - Could not download the file.[0m


In [34]:
file_path = "downloads/clean_pdb.tar.gz"
extract_dir = "clean_pdb"

untar_file_with_progress(file_path, extract_dir)

[91m‚ùå Failure: Extraction failed. Directory 'clean_pdb' has been removed.[0m
[91m‚ùå Failure: [Errno 2] No such file or directory: 'downloads/clean_pdb.tar.gz'[0m


## Download Rosetta

We cannot download Rosetta directly to the OSG submit node because we would need ~80GB of free space, which exceeds the 50GB disk quota on the submit node. Instead, we will submit a job to download the full version of Rosetta and package a minimal distribution with just the files needed for metl-sim. 

**Note**: This code may take a few hours to run. After submitting the job, you are free to close this window and come back later as long as you check the rosetta job is running with the `job_status()` function. You can not continue onto the next step without downloading Rosetta.


**<span style="color:red">NOTE</span>**: By downloading Rosetta, you are subject the Rosetta licensing agreement: [link](https://github.com/RosettaCommons/rosetta/blob/main/LICENSE.md). The most important point is that the free version of Rosetta can only be used for **non-commercial** purposes. If you wish to use Rosetta for commercial purposes, please consult the licensing agreement.

**Note:** You only need to run these cells once.

In [51]:
rosetta_job_name='rosetta_download9'
submit_condor_job(job_name=rosetta_job_name,job_type='rosetta_download')

[92m‚úÖ No job named 'rosetta_download9' exists'. You can use this job name.[0m
[92m‚úÖ Setting up job type `rosetta_download`[0m
[92m‚úÖ Job name: 'rosetta_download9' submitted [92m


In [52]:
job_status()

[33müí° Notice: Found failed jobs in log files, checking OSG if jobs exist[0m
[33müí° Notice: Jobs currently running on OSG, job name: rosetta_download9, number jobs:1[0m
[33müí° Notice: No failed jobs currently on OSG[0m
[92m Status of all submitted jobs [0m
                   Running‚åõ  Completed‚úÖ  Failed‚ùå
hello_world_test9         0           3        0
rosetta_download6         0           0        1
rosetta_download9         1           0        0
hello_world_test5         0           0        0
hello_world_test6         0           0        0
hello_world_test8         0           2        1
rosetta_download7         0           0        1
rosetta_download8         0           1        0
hello_world_test7         0           2        1


After checking job status and the `job_name` above is in the `completed` column, run the below function to post process the output from the job. 

In [50]:
rosetta_job_name = 'rosetta_download8'
post_process_rosetta_download(rosetta_job_name)

Extracting: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:24<00:00, 12.25s/file]

[92m‚úÖ Success: File untarred to 'condor/rosetta_download8/output/rosetta_download'.[0m
[31m‚ùå Could not find all rosetta file:condor/rosetta_download8/output/rosetta_download/output/squid_rosetta/rosetta_min_enc.tar.gz.aa for rosetta_download8[0m
[31m‚ùå Please redo the Downloading Rosetta section above and wait until the download of rosetta job has completed[0m
[33müí° Notice: This function could lead to errors if using a different version of Rosetta than the default version[0m
[33müí° Notice: Feel free to post an issue on github if any problems with downloading rosetta[0m





## Set up shell scripts
We also need to set the permissions of all bash scripts so we can run all the following functions. 

In [None]:
# Set permissions for the bash_scripts directory
set_permissions('bash_scripts', '777')

# Running Rosetta FastRelax on OSG

First upload your pdb file to the folder `metl-sim/pdb_files/raw_pdb_files`.

Then replace `2qmt.pdb` with your pdb file name in the cell below. Here I have specified the example structure `2qmt.pdb` which is the binding domain of Protein G or GB1. 

In [33]:
pdb_file_name = '2qmt.pdb'

## Prepare pdb for Rosetta Relax

To run Rosetta, the developers recommend some preprocessing steps in order to resolve atom clashes that commonly come from structures taken from the PDB database. 

Run the below cell to run this preprocessing step. For higher accuracy you can increase the parameter `relax_nstruct`, however the compute time will start to increase at higher numbers of structures. If you want to run with higher `relax_nstruct`, we recommend you check out the github page to run without the code without the jup

**Note**: Whatever configuration of `run_prepare_script` will be the saved structure used for the large scale Rosetta relax runs. 



In [None]:
run_prepare_script(
        rosetta_main_dir='notebooks/rosetta/rosetta_minimal',
        pdb_fn=f'pdb_files/raw_pdb_files/{pdb_file_name}',
        relax_nstruct=2,
        out_dir_base='output/prepare_outputs'
    )

## Generate variants for Rosetta Relax

Now we will generate variants to run on OSG. Below you will specify the number of variants to generate with the variable `variants_to_generate`. We recommend the default of 100,000 variants to generate to get good results (but this can vary by protein length).  The `max_subs` and `min_subs` parameters determines the maximum and minimum allowed mutant; the default is a 5 mutant maximum, 2 mutant minimum.  

**Note**: When determining how many variants to generate we expect rouhgly 25\% of jobs to fail. So if you hope to generate 100,000 variants; we suggest you generate 100,000/0.75 or 133,000 variants. 

In [38]:
variants_to_generate=2
run_variant_script(pdb_fn=f'pdb_files/prepared_pdb_files/{pdb_file_name.split(".")[0]}_p.pdb',\
                   variants_to_generate=variants_to_generate,
                  max_subs=1,
                  min_subs=1)

[32m‚úÖ A variant file with these parameters doesn't exist: 
 	 variants to generate 2
	 maximum substitutions 1 
	 minimum substitutions 1 
--> Generating variants now ‚åõ[0m
[32m‚úÖ Successfully generate variants![0m


## Prepare Rosetta Relax job for OSG

We now need to prepare a Rosetta Relax job for open science grid. 

In [40]:
pdb_file_name ='2qmt.pdb'
job_name='FastRelax_test_54'

# must be the same as the parameters used to generate variants above 
variants_to_generate=2
max_subs=1
min_subs=1

prepare_rosetta_run(job_name, pdb_file_name, variants_to_generate, max_subs, min_subs)

[32m‚úÖ Job name FastRelax_test_54 is availabe, preparing rosetta job[0m
[32m‚úÖ Variant file exists:
 Variants to Generate : 2 
 Max Subs: 1 
 Min Subs :1[0m
[32m‚úÖ Successfully prepared OSG run![0m


## Submit Rosetta Relax Job

This function submits the Rosetta Relax job to the Open Science Grid. After running this, you should be able to run `job_status()` and see that your job is likely `Running`. Longer proteins with many variants (>1M) will typically take the longest to complete (>24 hours). But this a loose speculation aimed to help the user get an idea of how long each run will take, run times can and will differ substantially between proteins. Our recommendation is to start small, will 100 variants, get a feel for how long it takes to run. Then go back and start a new Rosetta relax run with the number of variants which you wish to pretrain your METL model with.  

In [41]:
submit_condor_job(job_name='FastRelax_test_54', job_type='relax')

[92m‚úÖ No job named 'FastRelax_test_54' exists'. You can use this job name.[0m
[92m‚úÖ Setting up job type `relax`[0m


Extracting: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 443.16file/s]

[92m‚úÖ Success: File untarred to 'condor/FastRelax_test_54'.[0m





[92m‚úÖ Job name: 'FastRelax_test_54' submitted [92m


Running job status, as mentioned before, will state the jobs that are running, completed, and failed for all the jobs you have submitted. You should see your job in the running column. 

In [25]:
job_status()

[33müí° Notice: No log files found for job: relax4, skipping job [0m
[33müí° Notice: No log files found for job: FastRelax_test_41, skipping job [0m
[33müí° Notice: No log files found for job: FastRelax_test_34, skipping job [0m
[33müí° Notice: No log files found for job: FastRelax_test_40, skipping job [0m
[33müí° Notice: No log files found for job: relax1, skipping job [0m
[33müí° Notice: No log files found for job: relax3, skipping job [0m
[33müí° Notice: No log files found for job: FastRelax_test_36, skipping job [0m
[33müí° Notice: No log files found for job: relax2, skipping job [0m
[33müí° Notice: No log files found for job: FastRelax_test_38, skipping job [0m
[33müí° Notice: No log files found for job: FastRelax_test_35, skipping job [0m
[33müí° Notice: No log files found for job: relax6, skipping job [0m
[33müí° Notice: No log files found for job: FastRelax_test_37, skipping job [0m
[33müí° Notice: No log files found for job: relax5, skipp

## Post process Rosetta Relax job

In [3]:
df=run_post_process_script(job_name='FastRelax_test_48')

[33müí° Notice:FastRelax_test_48 all ready post processed, loading pandas dataframe[0m


Here the variable `df` corresponds to a pandas dataframe which contains the rosetta scores for each variant. Note that not all variants may have completed that you submitted. It is expected behavoir with millions of variants that not all jobs will complete successfully. 

Below only prints out the first 100 variants that were processed, to get all the variants remove `.head(100)`. To view all the statistics for each variant, including rosetta energy terms and run information, simply run `df` in a cell.

In [4]:
df[['variant','total_score']].head(100)

Unnamed: 0,variant,total_score
0,"E15T,K28W,Y45N",-160.529
1,"T18P,Y45N",-148.881
2,"F30A,Q32I,T51A",-165.410
3,"Q32I,N35V",-172.314
4,"I6G,K10M,V39K",-160.650
...,...,...
95,"T16R,A20S,D46P",-156.998
96,"T16R,T53N",-166.944
97,"L12D,T18M,D40I,G41M",-160.459
98,"L7V,T18M",-179.328


To download the csv file to your local machine, download this file:

`metl-sim/notebooks/condor/<job_name>/energies_df.csv`

by right clicking in the file explorer on the left panel. (shown in the image below for the job name `FastRelax_test_48`. 

![alt text](img/download.png)

# Issues

Please submit issues on the [metl-sim GitHub](https://github.com/gitter-lab/metl-sim/issues).

# Share your results! 

Make your Rosetta results open access by sharing them **here**. 

