# Rosetta Protein-folding workflow

This is a Pegasus workflow for running Rosetta's De novo structure prediction on the OSG. The workflow predicts the 3-dimensional structure of a protein starting with an amino acid sequence, using the [Abinitio Relax](https://new.rosettacommons.org/docs/latest/application_documentation/structure_prediction/abinitio-relax#algorithm) algorithm. This workflow uses ideas from this [tutorial](https://www.rosettacommons.org/demos/latest/tutorials/denovo_structure_prediction/Denovo_structure_prediction).

> Please run the workflow from your [OSG Connect](https://www.osgconnect.net) account. Anyone with a U.S. research affiliation can get access.

## Configure Input files
You will need to have a license to download Rosetta. See the [Rosetta documentation](https://www.rosettacommons.org/demos/latest/tutorials/install_build/install_build) for details on how to obtain the license. Once you have the license, you can download the Rosetta software suite from https://www.rosettacommons.org/software/license-and-download.

Untar the downloaded file by running this command in your terminal:

```tar -xvzf rosetta[releasenumber].tar.gz```

### Binaries

The ab initio executable can be found in ```rosetta*/main/source/bin```. Navigate to this directory and copy the AbinitioRelax file to the ```bin``` directory of the rosetta_workflow. Make sure the file name in the last line of proteinfold.sh matches the one you copied. 

### Database
The Pegasus workflow takes as input the database as a tarball file. Create the tar file of the database folder found in ```rosetta*/main``` and place it in the ```database``` directory of the workflow. 

```cd [path to rosetta*]/main/ && tar -czf [path to rosetta workflow]/database/database.tar.gz database```

### Data inputs
A job in the rosetta workflow requires the following input files for an amino acid sequence:

* Fasta file - Example: 1elwA.fasta

* Fragments files - Example: aa1elwA03_05.200_v1_3 and aa1elwA09_05.200_v1_3

* PDB file. Example - 1elw.pdb

* Psipred secondary structure prediction psipred_ss2 file - Example: 1elwA.psipred_ss2

> **Note**: Rename the input files to have the same base name. 
>               
> Example: data-1.fasta, data-1.pdb, data-1.psipred_ss2, data-1-09_05.200_v1_3, data-1-03_05.200_v1_3 and the folder containing these input files as data-1.

Run the command on the folder ```data-<i>``` containing the above input files for a sequence

```tar -cf data-<i>.tar.gz data-<i> ```


### 1. Creating the workflow
We use Pegasus Workflow API to create the workflow for Rosetta's protein-folding for structure prediction.

In [None]:
import logging
import glob
import os
import getpass
from pathlib import Path

from Pegasus.api import *

logging.basicConfig(level=logging.INFO)

# --- Working Directory Setup --------------------------------------------------
# A good working directory for workflow runs and output files
WORK_DIR = Path.home() / "workflows"
WORK_DIR.mkdir(exist_ok=True)

TOP_DIR = Path().resolve()

# --- Properties ---------------------------------------------------------------
props = Properties()
props["pegasus.data.configuration"] = "condorio"  

# Provide a full kickstart record, including the environment, even for successful jobs
props["pegasus.gridstart.arguments"] = "-f"

#Limit the number of idle jobs for large workflows
props["dagman.maxidle"] = "1600"

# Help Pegasus developers by sharing performance data (optional)
props["pegasus.monitord.encoding"] = "json"
props["pegasus.catalog.workflow.amqp.url"] = "amqp://friend:donatedata@msgs.pegasus.isi.edu:5672/prod/workflows"

# write properties file to ./pegasus.properties
props.write()


### 2. Site Catalog

In [None]:
# --- Sites --------------------------------------------------------------------
sc = SiteCatalog()

# local site (submit machine)
local_site = Site(name="local", arch=Arch.X86_64)

local_shared_scratch = Directory(directory_type=Directory.SHARED_SCRATCH, path=WORK_DIR / "scratch")
local_shared_scratch.add_file_servers(FileServer(url="file://" + str(WORK_DIR / "scratch"), operation_type=Operation.ALL))
local_site.add_directories(local_shared_scratch)

local_storage = Directory(directory_type=Directory.LOCAL_STORAGE, path=TOP_DIR / "outputs")
local_storage.add_file_servers(FileServer(url="file://" + str(TOP_DIR / "outputs"), operation_type=Operation.ALL))
local_site.add_directories(local_storage)

local_site.add_env(PATH=os.environ["PATH"])
sc.add_sites(local_site)

# condorpool (execution site)
condorpool_site = Site(name="condorpool", arch=Arch.X86_64, os_type=OS.LINUX)
condorpool_site.add_pegasus_profile(style="condor")
condorpool_site.add_condor_profile(
    universe="vanilla",
    request_cpus=3,
    request_memory="3 GB",
    request_disk="10000000",
)

sc.add_sites(condorpool_site)

# write SiteCatalog to ./sites.yml
sc.write()

### 3. Transformation Catalog

Note that in the Transformation catalog section of the workflow, the clustering feature is enabled. This tells Pegasus to cluster multiple jobs together.

        proteinfold = Transformation(
                name="proteinfold",
                site="local",
                pfn=TOP_DIR / "bin/proteinfold.sh",
                is_stageable="True",
                arch=Arch.X86_64).add_pegasus_profile(clusters_size=10)

To disable clustering, set ```clusters_size``` to 1. Experiment with different values for ```clusters_size``` and observe how it affects the time required for the jobs to finish.


In [None]:
# --- Transformations ----------------------------------------------------------
proteinfold = Transformation(
    name="proteinfold",
    site="local",
    pfn=TOP_DIR / "bin/proteinfold.sh",
    is_stageable="True",
    arch=Arch.X86_64).add_pegasus_profile(clusters_size=10)

tc = TransformationCatalog()
tc.add_transformations(proteinfold)

# write TransformationCatalog to ./transformations.yml
tc.write()

### 4.Replica Catalog

In [None]:
# --- Replicas -----------------------------------------------------------------
exec_file = [File(f.name) for f in (TOP_DIR / "bin").iterdir() if f.name.startswith("AbinitioRelax")]

input_files = [File(f.name) for f in (TOP_DIR / "inputs").iterdir()]

db_files = [File(f.name) for f in (TOP_DIR / "database").iterdir()]

rc = ReplicaCatalog()

for f in input_files:
    rc.add_replica(site="local", lfn=f, pfn=TOP_DIR / "inputs" / f.lfn)

for f in exec_file:
    rc.add_replica(site="local", lfn=f, pfn=TOP_DIR / "bin" / f.lfn)

for f in db_files:
    rc.add_replica(site="local", lfn=f, pfn=TOP_DIR / "database" / f.lfn)

# write ReplicaCatalog to replicas.yml
rc.write()

### 5. Adding jobs to the workflow

In [None]:
# --- Workflow -----------------------------------------------------------------
wf = Workflow(name="protein-folding-workflow")

for f in input_files:
    filename = f.lfn.replace(".tar.gz","")
    out_file = File(filename + "_silent.out")

    proteinfold_job = Job(proteinfold).add_args(filename, "-database ./database","-in:file:fasta",f"./{filename}.fasta",
            "-in:file:frag3",f"./{filename}-03_05.200_v1_3",
            "-in:file:frag9",f"./{filename}-09_05.200_v1_3","-in:file:native",f"./{filename}.pdb",
            "-abinitio:relax","-nstruct","1",
            "-out:file:silent", out_file,
            "-use_filters","true","-psipred_ss2",f"./{filename}.psipred_ss2",
            "-abinitio::increase_cycles","10",
            "-abinitio::rg_reweight","0.5","-abinitio::rg_reweight","0.5",
            "-abinitio::rsd_wt_helix","0.5","-abinitio::rsd_wt_loop","0.5","-relax::fast")\
            .add_inputs(exec_file[0],db_files[0],f).add_outputs(out_file)
    wf.add_jobs(proteinfold_job)

### 6. Submit the workflow and launch pilot jons on ACCESS resources

In [None]:
# plan and run the workflow
wf.plan(
    dir=WORK_DIR / "runs",
    sites=["condorpool"],
    staging_sites={"condorpool":"local"},
    output_sites=["local"],
    cluster=["horizontal"],
    submit=True
)


Note that we are running jobs on site condorpool i.e the selected ACCESS resource. After the workflow has been successfully planned and submitted, you can use ```wf.status()``` in order to monitor the status of the workflow. It shows in detail the counts of jobs of each status and also the whether the job is idle or running.

At this point you should have some idle jobs in the queue. They are idle because there are no resources yet to execute on. Resources can be brought in with the HTCondor Annex tool, by sending pilot jobs (also called glideins) to the ACCESS resource providers. These pilots have the following properties:

The process of starting pilots is described in the [ACCESS Pegasus Documentation](https://xsedetoaccess.ccs.uky.edu/confluence/redirect/ACCESS+Pegasus.html)



### 7. Statistics
Depending on if the workflow finished successfully or not, you have options on what to do next. If the workflow failed you can use ```wf.analyze()``` do get help finding out what went wrong. If the workflow finished successfully, we can pull out some statistcs using ```wf.statistics()```