# Running metl-sim on the OSG OSPool

This notebook provides an interactive environment to deploy [metl-sim](https://github.com/gitter-lab/metl-sim) on the [OSG OSPool](https://portal.osg-htc.org).

**Sections**
1. Hello world
2. Environment setup and Rosetta software download
3. Running metl-sim on OSG

For any questions, please open a GitHub issue [here](https://github.com/gitter-lab/metl-sim/issues). Our team is happy to help.


# Setup
Run these cells to set up the environment.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
from utils import *

# Hello world

There are three main functions used in this notebook. This section shows how they work with a simple "hello world" example.

| Function                         | Description |
|:---------------------------------|:------------|
| `submit_condor_job(job_name,job_type)` | Submits a job with a unique `'job_name'` and `'job_type'`. Available job types: `'helloworld'`, `'rosetta_download'`, and `'relax'`. |
| `job_status()`                  | Checks the status of all jobs you have run. Also removes failed jobs if they are currently on OSPool. |
| `remove_all_condor_jobs()`       | Removes all running and failed jobs. Should only be run after you are done using this notebook. |


## Submit a hello world job

Let's submit a job under the 'helloworld' job_type. This will submit a three jobs under one job_name. Each job will print "Hello world!" to the console. 

**<span style="color:red">Important</span>**: The parameter `job_name` must be unique to each job. You cannot submit two jobs which have the same name. 


In [31]:
submit_condor_job(job_name='hello_world_1', job_type='helloworld')

[92m‚úÖ No job named 'hello_world_1' exists'. You can use this job name.[0m
[92m‚úÖ Setting up job type `helloworld`[0m
[92m‚úÖ Job name: 'hello_world_1' submitted [92m


## Check the status of your jobs

To see the output of all currently running jobs, **and** remove all failed jobs, simply check the job status with `job_status()`. You can ignore the `üí° Notice:` output unless you are curious what is happening in the background when removing failed jobs. 

The `helloworld` job will take 5 minutes to complete. You can close this page during that time. When you come back, remember to run the cells in the `Setup` section before running any of the three functions. 

In [33]:
job_status()

[92m Status of all submitted jobs [0m
                  Running‚åõ  Completed‚úÖ  Failed‚ùå
hello_world_1            0           3        0
metl_sim_job_1           0           1        0
rosetta_download         0           1        0


## Remove all jobs

The final function, `remove_all_condor_jobs()` will remove all jobs regardless of if they are `Running‚åõ` or `Failed‚ùå`.

**<span style="color:red">Important</span>**:  If you run this command after submitting the above jobs, they will effectively be removed and moved to the `Failed‚ùå` column. 

In [34]:
remove_all_condor_jobs()

[92m No jobs active on Open Science Grid.[0m


# Environment Setup and Rosetta Software Download 

## Download the Python environments

Running these cells will download the Python environments necessary for running metl-sim. You only need to do this once.

In [35]:
# Example usage:
url = "http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/rosettafy_env_v0.7.11.tar.gz"
output_path = "downloads/metl-sim_2025-02-13.tar.gz"

download_file(url, output_path, 'curl')

[93müîî Notice: File 'downloads/metl-sim_2025-02-13.tar.gz' already exists. No download needed.[0m


Now we will untar the binary file. 

In [7]:
file_path = "downloads/metl-sim_2025-02-13.tar.gz"
extract_dir = "env"

untar_file_with_progress(file_path, extract_dir)

Extracting: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 51616/51616 [02:47<00:00, 307.90file/s]

[92m‚úÖ Success: File untarred to 'env'.[0m





In [33]:
url = "http://proxy.chtc.wisc.edu/SQUID/bcjohnson7/clean_pdb.tar.gz"
output_path = "downloads/clean_pdb_2025-02-13.tar.gz"

download_file(url, output_path, 'curl')

[91m‚ùå Failure: HTTP 404 - Could not download the file.[0m


In [8]:
file_path = "downloads/clean_pdb_2025-02-13.tar.gz"
extract_dir = "clean_pdb_env"

untar_file_with_progress(file_path, extract_dir)

Extracting: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6513/6513 [00:17<00:00, 365.90file/s]

[92m‚úÖ Success: File untarred to 'clean_pdb_env'.[0m





## Download Rosetta

We cannot download Rosetta directly to the OSG submit node because we would need ~80GB of free space, which exceeds the 50GB disk quota on the submit node. Instead, we will submit a job to download the full version of Rosetta and package a minimal distribution with just the files needed for metl-sim. 

**Note**: This code may take a few hours to run. After submitting the job, you are free to close this window and come back later as long as you check the rosetta job is running with the `job_status()` function. You cannot continue to the next step without downloading Rosetta.


**<span style="color:red">NOTE</span>**: By downloading Rosetta, you are subject the Rosetta licensing agreement: [link](https://github.com/RosettaCommons/rosetta/blob/main/LICENSE.md). The most important point is that the free version of Rosetta can only be used for **non-commercial** purposes. If you wish to use Rosetta for commercial purposes, please consult the licensing agreement.

**Note:** You only need to run these cells once.

In [3]:
rosetta_job_name = 'rosetta_download'
submit_condor_job(job_name=rosetta_job_name, job_type='rosetta_download')

[92m‚úÖ No job named 'rosetta_download' exists'. You can use this job name.[0m
[92m‚úÖ Setting up job type `rosetta_download`[0m
[92m‚úÖ Job name: 'rosetta_download' submitted [92m


In [3]:
job_status()

[92m Status of all submitted jobs [0m
                  Running‚åõ  Completed‚úÖ  Failed‚ùå
rosetta_download         0           1        0


After checking job status and the `job_name` above is in the `completed` column, run the below function to post process the output from the job. 

In [4]:
rosetta_job_name = 'rosetta_download'
post_process_rosetta_download(rosetta_job_name)

Extracting: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 6/6 [01:11<00:00, 11.84s/file]


[92m‚úÖ Success: File untarred to 'condor/rosetta_download/output/rosetta_download'.[0m
[32m‚úÖ Found Rosetta File: condor/rosetta_download/output/rosetta_download/output/squid_rosetta/rosetta_min_enc.tar.gz.aa[0m
[32m‚úÖ File transferred to downloads![0m
[32m‚úÖ Found Rosetta File: condor/rosetta_download/output/rosetta_download/output/squid_rosetta/rosetta_min_enc.tar.gz.ab[0m
[32m‚úÖ File transferred to downloads![0m
[32m‚úÖ Found Rosetta File: condor/rosetta_download/output/rosetta_download/output/squid_rosetta/rosetta_min_enc.tar.gz.ac[0m
[32m‚úÖ File transferred to downloads![0m
Combining split files for rosetta_min_enc.tar.gz
Decrypting Rosetta
OpenSSL 3.1.1 30 May 2023 (Library: OpenSSL 3.1.1 30 May 2023)
[92m‚úÖ Successfully decoded rosetta.[0m


Extracting: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 24653/24653 [17:14<00:00, 23.84file/s]  

[92m‚úÖ Success: File untarred to 'rosetta'.[0m





## Set up shell scripts
We also need to set the permissions of all bash scripts so we can run all the following functions. 

In [5]:
# Set permissions for the bash_scripts directory
set_permissions('bash_scripts', '777')

# Running metl-sim on OSPool

First, upload your pdb file to the folder `metl-sim/pdb_files/raw_pdb_files`.

Then replace `2qmt.pdb` with your pdb file name in the cell below. The example structure `2qmt.pdb` is the binding domain of Protein G (GB1). 

In [6]:
pdb_file_name = '2qmt.pdb'

## Prepare the PDB file for Rosetta

To run Rosetta, the developers recommend some preprocessing steps in order to resolve atom clashes that commonly occur in structures taken from the PDB database. 

Run the below cell to perform this preprocessing step. For higher accuracy you can increase the parameter `relax_nstruct`, however the compute time will start to increase at higher numbers of structures. 

In [7]:
run_prepare_script(
    rosetta_main_dir='notebooks/osg/rosetta/rosetta_minimal',
    pdb_fn=f'pdb_files/raw_pdb_files/{pdb_file_name}',
    relax_nstruct=1,
    out_dir_base='output/prepare_outputs',
    conda_pack_env="notebooks/osg/clean_pdb_env"
)

[32m‚úÖ Prepare.py executed successfully![0m
[32m‚úÖ File transferred to pdb_files/prepared_pdb_files![0m


## Generate variants

We need to generate the protein variants that we want to model with Rosetta. Below, you can specify the number of variants to generate with the parameter `variants_to_generate`. We recommend at minimum generating 100,000 variants, but you may need more for best results, especially for larger proteins.  The `max_subs` and `min_subs` parameters determine the number of mutations per variant. We recommend `max_subs=5` and `min_subs=2`.

In [8]:
variants_to_generate = 2
run_variant_script(
    pdb_fn=f'pdb_files/prepared_pdb_files/{pdb_file_name.split(".")[0]}_p.pdb',
    variants_to_generate=variants_to_generate,
    max_subs=1,
    min_subs=1,
    seed=2
)

[32m‚úÖ A variant file with these parameters doesn't exist: 
 	 variants to generate: 2
	 maximum substitutions: 1 
	 minimum substitutions: 1 
	 random seed: 2 
	 filename: 2qmt_p_subvariants_TN-2_MAXS-1_MINS-1_filtered-DB-0-2qmt_p_RS-2.txt 
--> Generating variants now ‚åõ[0m
[32m‚úÖ Successfully generate variants![0m


## Set up the metl-sim job

In [9]:
pdb_file_name = "2qmt.pdb"
job_name = "metl_sim_job_1"
variant_fns = ["2qmt_p_subvariants_TN-2_MAXS-1_MINS-1_filtered-DB-0-2qmt_p_RS-2.txt"]

prepare_rosetta_run(job_name, pdb_file_name, variant_fns)

[32m‚úÖ Job name metl_sim_job_1 is available, preparing rosetta job[0m
[32m‚úÖ Variant file exists: 2qmt_p_subvariants_TN-2_MAXS-1_MINS-1_filtered-DB-0-2qmt_p_RS-2.txt[0m
[32m‚úÖ Total number of variants: 2[0m
[32m‚úÖ Successfully prepared OSG run![0m


## Submit the metl-sim job

This function submits the metl-sim job to the OSG OSPool. After running the following cell, you should be able to run `job_status()` and see that your job is likely `Running`. If running many variants, the job may take a long time to complete.

In [10]:
submit_condor_job(job_name='metl_sim_job_1', job_type='relax')

[92m‚úÖ No job named 'metl_sim_job_1' exists'. You can use this job name.[0m
[92m‚úÖ Setting up job type `relax`[0m


Extracting: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2/2 [00:00<00:00, 449.82file/s]

[92m‚úÖ Success: File untarred to 'condor/metl_sim_job_1'.[0m





[92m‚úÖ Job name: 'metl_sim_job_1' submitted [92m


Running job status, as mentioned before, will state the jobs that are running, completed, and failed for all the jobs you have submitted. You should see your job in the running column. 

In [12]:
job_status()

[92m Status of all submitted jobs [0m
                  Running‚åõ  Completed‚úÖ  Failed‚ùå
metl_sim_job_1           0           1        0
rosetta_download         0           1        0


## Post process the metl-sim job

In [19]:
df = run_post_process_script(job_name='metl_sim_job_1')

[32m‚úÖ Successfully post process job name metl_sim_job_1[0m


The variable `df` corresponds to a Pandas dataframe which contains the rosetta scores for each variant. The cell below prints the Rosetta `total_score` for the first 100 variants that were processed.

In [22]:
df[['variant','total_score']].head(100)

Unnamed: 0,variant,total_score
0,D47G,-164.827
1,L7G,-170.524


The full dataframe contains all the computed energies.

In [23]:
df

Unnamed: 0,pdb_fn,variant,job_uuid,start_time,run_time,mutate_run_time,relax_run_time,filter_run_time,centroid_run_time,total_score,...,env,hs_pair,linear_chainbreak,overlap_chainbreak,pair,rg,rsigma,sheet,ss_pair,vdw
0,2qmt_p.pdb,D47G,oSamHFNbmHG6,2025-02-18 16:38:54,39,12,12,13,0,-164.827,...,-19.905,-4.264,0.0,0.0,0.094,33.156,-21.048,0.343,-31.367,0.0
1,2qmt_p.pdb,L7G,oSamHFNbmHG6,2025-02-18 16:39:34,39,12,13,13,0,-170.524,...,-21.988,-4.059,0.0,0.0,-1.545,32.87,-22.924,0.343,-33.863,0.0


# Issues or questions

Please reach out to us on the [metl-sim GitHub](https://github.com/gitter-lab/metl-sim/issues).