# Alphafold workflow
This notebook implements an Alphafold workflow to demonstrate [Parsl Python parallel scripting](https://parsl-project.org/) in a Jupyter notebook orchestrating a batch of jobs on a SLURM cluster. The goal is to demonstrate Alphafold running multiple proteins at the same time.

## Step 1: Define workflow inputs
This PW workflow can be either launched from its form in the `Compute` tab or it can be run directly in this notebook.  If running directly from the notebook, the user needs to go through the extra step of defining the inputs of the workfow in the notebook. Currently, this workflow only has one input, which is a list of `.fasta` files.

In [1]:
# Experimenting here with automatically loading the 
# Parsl configuration from a file generated by the
# resource.
#import utils.parslpw

In [2]:
import os
from os.path import exists
import argparse

print('Define workflow inputs...')

# Start assuming workflow is launched from the notebook.
run_in_notebook=True

if (run_in_notebook):
    print("Running from a notebook. Use the file specified by the user below.")
    
    # Each line in this file is a protein to run with Alphafold.
    with open("./example/fasta_list.txt","r") as f:
        fasta_list = f.readlines()
        
else:
    print("Running from a form.  Use the file specified by the user on the form.")
    run_in_notebook=False
    # GET COMMAND LINE ARGS FROM PW FORM
    parser=argparse.ArgumentParser()
    parsed, unknown = parser.parse_known_args()
    for arg in unknown:
        if arg.startswith(("-", "--")):
            parser.add_argument(arg)
    pwargs=parser.parse_args()
    print("pwargs:",pwargs)

Define workflow inputs...
Running from a notebook. Use the file specified by the user below.


## Step 2: Configure Parsl
The Alphafold application itself is in a Singularity container.  Instructions for building and testing this container are in `./container/README.md`.

The configuration below tells Parsl, the Python parallel scripting library, what kind of compute resources we are using. It is very similar to the `#SBATCH` commands at the top of an `sbatch` script (e.g. `./container/sbatch_example.sh`).

In [3]:
# Parsl essentials
import parsl
from parsl.app.app import python_app, bash_app
from parsl.data_provider.files import File
from parsl.config import Config
from parsl.executors import HighThroughputExecutor
from parsl.providers import SlurmProvider
from parsl.launchers import SrunLauncher
from parsl.channels import SSHChannel,SSHInteractiveLoginChannel
from parsl.data_provider.file_noop import NoOpFileStaging
import logging # Needed for parsl.set_file_logger

# For embedding Design Explorer results in notebook
from IPython.display import display, HTML

# Checking inputs from the WORKFLOW FORM
if (not run_in_notebook):
    print(pwargs)

# Start logging
parsl.set_stream_logger(level=logging.INFO)

#==============================
# Key setup parameters
#==============================
slurm_user = 'gstefan'
home_dir = '/gs/gsfs0/users/'+slurm_user+'/'
workflow_dir = home_dir+'pw/workflows/alphafold-notebook-demo/'

# Other customizations, not critical.
work_dir = 'parsl_work'
log_dir = 'parsl_log'
script_dir = 'parsl_script'

print("Configuring Parsl...")
config = Config(
    run_dir='./local_parsl_logs',          # Defaults to ./runinfo
    executors=[
        HighThroughputExecutor(
            label=slurm_user+'_slurm',         # This value names the pools in the log dirs
            address='einsteinmed-submit.parallel.works',  # IP or hostname visible to the workers/remote resource for connecting back to interchange!!
            worker_debug=False,             # Default False for shorter logs
            max_workers=int(100),
            mem_per_worker=int(1),         # Used to find number of slots per node
            cores_per_worker=int(1),       # DOES NOT correspond to --cpus-per-task 1 per Parsl docs.  Rather BYPASSES SLURM opts and is process_pool.py -c cores_per_worker, but IS NOT CORES ON SLURM - sets number of workers                                                       
            working_dir = home_dir+work_dir,
            worker_logdir_root = home_dir+log_dir,
            provider = SlurmProvider(
                partition = 'ht',          # Cluster specific! Needs to match GPU availability, and RAM per CPU limits specified for partion.
                channel=SSHChannel(
                    hostname='einsteinmed-submit.parallel.works', #slurm_user+'.hpc.einsteinmed.edu',
                    username=slurm_user,
                    key_filename='/gs/gsfs0/users/'+slurm_user+'/.ssh/pw_id_rsa',
                    script_dir=home_dir+script_dir
                ),
                worker_init='source /gs/gsfs0/users/'+slurm_user+'/pworks/bootstrap.sh; source /gs/gsfs0/hpc01/rhel8/apps/conda3/etc/profile.d/conda.sh; conda activate /gs/gsfs0/users/'+slurm_user+'/pw/parsl-1.2',
                init_blocks=1,
                mem_per_node = int(120),
                nodes_per_block = int(1),
                cores_per_node = int(16),   # Corresponds to --cpus-per-task                                                                                                                         
                min_blocks = int(1),
                max_blocks = int(10),
                parallelism = 1,           # Was 0.80, 1 is "use everything you can NOW"                                                                                                            
                exclusive = False,         # Default is T, hard to get workers on shared cluster                                                                                                    
                walltime='12:00:00',       # Will limit job to this run time, 10 min default Parsl                                                                                                  
                launcher=SrunLauncher()    # defaults to SingleNodeLauncher() which seems to work
            ),
            storage_access=[NoOpFileStaging()]
        )
    ]
)
parsl.load(config)
print("Parsl configuration loaded")

  "class": algorithms.Blowfish,
2022-08-13 04:15:48 parsl.dataflow.dflow:86 [INFO]  Parsl version: 1.2.0
2022-08-13 04:15:48 parsl.dataflow.dflow:114 [INFO]  Run id is: 2373a088-5f4e-47bc-b959-19c8e543810e
2022-08-13 04:15:49 parsl.dataflow.memoization:164 [INFO]  App caching initialized


Configuring Parsl...


2022-08-13 04:15:51 parsl.providers.slurm.slurm:236 [ERROR]  Retcode:1 STDOUT: STDERR:sbatch: error: If munged is up, restart with --num-threads=10
sbatch: error: Munge encode failed: Failed to access "/var/run/munge/munge.socket.2": No such file or directory
sbatch: error: slurm_send_node_msg: g_slurm_auth_create: REQUEST_SUBMIT_BATCH_JOB has authentication error: Invalid authentication credential
sbatch: error: Batch job submission failed: Protocol authentication error


Submission of command to scale_out failed
Parsl configuration loaded


2022-08-13 04:15:55 parsl.executors.status_handling:111 [ERROR]  Setting bad state due to exception
Exception: 1. Failed to start block 0: Executor slurm failed due to: Attempts to provision nodes via provider has failed

2022-08-13 04:16:00 parsl.executors.status_handling:111 [ERROR]  Setting bad state due to exception
Exception: 1. Failed to start block 0: Executor slurm failed due to: Attempts to provision nodes via provider has failed

2022-08-13 04:16:05 parsl.executors.status_handling:111 [ERROR]  Setting bad state due to exception
Exception: 1. Failed to start block 0: Executor slurm failed due to: Attempts to provision nodes via provider has failed



## Step 3: Define Parsl workflow apps
These apps are decorated with Parsl's `@bash_app` and as such are executed in parallel on the compute resources that are defined in the PW configuration loaded above. You can view the contents of the `@bash_app` as the same thing as the contents of an `sbatch` script. The difference between using Parsl and SLURM is that if you are using a SLURM cluster, Parsl will build your `sbatch` script for you and submit the job(s).  This job submission happens during Step 4, below.

In [None]:
print("Defining Parsl workflow apps...")

#===================================
# Molecular dynamics simulation app
#===================================
@bash_app
def run_alphafold(stdout='run.af.stdout', stderr='run.af.stderr', inputs=[], outputs=[]):
    return '''
    python run_singularity_container.py \
     --data_dir=/public/apps/alphafold/databases \
     --fasta_paths=/gs/gsfs0/users/gstefan/work/alphafold/input/%s \
     --max_template_date=2022-07-22 \
     --output_dir=/gs/gsfs0/users/gstefan/work/alphafold/output
    ''' % (inputs[0])

## Step 4: Workflow
This cell executes the workflow itself.

In [None]:
#============================================================================
print("Running Alphafold workflow...")
#============================================================================
# For each line in cases.list, run and visualize a molecular dynamics simulation
# These empty lists will store the futures of Parsl-parallelized apps.
# Use Path for staging because multiple files in ./models/mdlite are needed
# and mutliple files in ./results/case_*/md are sent back to the platform.
futures = []
for ii, fasta in enumerate(fasta_list):        
    # Run simulation
    futures.append(run_af(inputs=[workflow_dir+fasta]))
    
# Call results for all app futures to require
# execution to wait for all simulations to complete.
for instance in futures:
    instance.result()

print('All proteins done running.')

## Step 5: View results
3D interactive visualization of proteins will be added later.

## Step 6: Clean up
This step is only necessary when running directly in a notebook. These intermediate and log files are removed to keep the workflow file structure clean if this workflow is pushed into the PW Market Place.  Please feel free to comment out these lines in order to inspect intermediate files as needed.

In [4]:
if (run_in_notebook):
    # Delete intermediate files/logs that are NOT core code or results
    !rm -rf runinfo
    !rm -rf __pycache__
    !rm -rf parsl-task.*
    !rm -rf *.pid
    !rm -rf *.started
    !rm -rf *.ended
    !rm -rf *.cancelled
    !rm -rf *.cogout
    !rm -rf lastid*
    !rm -rf launchcmd.*
    !rm -rf parsl-htex-worker.sh
    # Retain pw.conf if re-running this notebook on the 
    # same resource and there is no resource Off/On cycling.
    # (See README.md for more information.)
    !rm -rf pw.conf
    # Delete outputs
    #!rm -rf ./results
    #!rm -f mdlite_dex.*
    
    # shut down the parsl executor
    parsl.dfk().cleanup()

2022-08-13 04:16:10 parsl.executors.status_handling:111 [ERROR]  Setting bad state due to exception
Exception: 1. Failed to start block 0: Executor slurm failed due to: Attempts to provision nodes via provider has failed

2022-08-13 04:16:10 parsl.dataflow.dflow:1057 [INFO]  DFK cleanup initiated
2022-08-13 04:16:10 parsl.dataflow.dflow:947 [INFO]  Summary of tasks in DFK:
2022-08-13 04:16:10 parsl.dataflow.dflow:967 [INFO]  End of summary
2022-08-13 04:16:10 parsl.dataflow.dflow:1081 [INFO]  Terminating flow_control and strategy threads
2022-08-13 04:16:10 parsl.dataflow.dflow:1111 [INFO]  DFK cleanup complete
