# Simcat workflow with slurm

The simcat workflow consists of two steps:  

1) Fill a database with simulation parameters. This happens a single time and doesn't require parallel computing.  

2) Send out many jobs, each of which takes a piece of that database and runs the simulations.  

Both of these steps use simcat python code, although the second step works much better if separated into many small jobs on a cluster. Here, I'm demonstrating using **slurm scheduling** to make the second step happen really efficiently by sending out 2000 jobs, each using four cores. 

### Imports

In [1]:
import simcat
import toytree

## Building the simulation database

Building the simulation database requires an input species tree and parameters defining the size of that database and defining the extent of variation we wish to see in species tree parameters. 

### Define the tree:

In [2]:
tre = toytree.rtree.imbtree(8, treeheight = 20e6)

In [3]:
tre.draw(ts='p');

### Build the database:

In [4]:
db = simcat.Database(
    name='imb_8tip_20mil_2admixedges',
    workdir="../",
    tree=tre,
    nrows=60000,
    nsnps=20000,
    exclude_sisters=True,
    existing_admix_edges=[(1,3)],
    admix_edge_min=.3,
    admix_edge_max=.7,
    admix_prop_min=0.05,
    admix_prop_max=.5,
)

60000 labels to be stored in: ../imb_8tip_20mil_2admixedges.labels.h5


Building the database for 60000 simulations took about ten minutes for me -- but it will vary some depending on the size of the tree.

## Running in parallel on slurm

My goal is to run thousands of jobs on the cluster, each of which will do run simulations (taking several hours apiece). I will use a bash script to automate the running of these jobs, each of which will point to a central python script that calls simcat. 

### Writing the python script:

The python script tells simcat to open the database file and pick 30 unfinished jobs. Each job defined in the bash script will call this same python code.

Notice that we have to define the same name and working directory as in the Database section so that it can find the database and counts file we have written in the previous sections. Think about this when deciding where to save the file and whether to use relative paths.

In [5]:
python_script = """import os
import simcat
import toytree
import sys
import ipyparallel as ipp

clust_id = sys.argv[1]

ipyclient = ipp.Client(cluster_id=str(clust_id))

print("num of engines: " + str(len(ipyclient)))

tst = simcat.Simulator.Simulator('imb_8tip_20mil_2admixedges','../')  # init simulator
tst.run(30,ipyclient=ipyclient)  # 30 runs
"""

Now write it to your desired location -- again remember that this is important if using relative paths when pointing to the working directory with the `Simulator` object.

In [6]:
# define path
python_script_path = "/rigel/dsi/users/pfm2119/projects/simcat_power/training/testing/dat/run_queue.py"

# write the file
with open(python_script_path,'w') as f:
    f.write(python_script)

### Writing the bash script:

The bash script will point thousands of jobs to the python script that we have written, running each job with a separate slurm script. Each of these slurm scripts will also define the computing resources we're requesting. 

Notice also that I have started an ipcluster in each slurm script and have given each ipcluster a unique ID and sleep time to start up so that all engines are active when the python script calls for them.

In [7]:
job_directory = "/rigel/dsi/users/pfm2119/projects/simcat_power/training/testing/logs"
num_jobs = 2000 # running 2000 to fill up the whole 60k-simulation database... 2000*30=60000
account_name = 'dsi'
num_cores = 4
time = "11:59:00"

In [8]:
bash_script = """
#!/bin/bash

for jobname in $(seq 1 {0}); do

    job_directory={1}

    job_file="${{job_directory}}/${{jobname}}.job"

    clust_id_d=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 8 | head -n 1)
    clust_id="${{clust_id_d}}"
    echo "#!/bin/bash
#SBATCH --workdir=$job_directory
#SBATCH --account={2}
#SBATCH --job-name=sc${{jobname}}
#SBATCH --cores={3}
#SBATCH --nodes=1
#SBATCH --time={4}

ipcluster start --n {3} --daemonize --debug --cluster-id=${{clust_id}} --delay=5.0
sleep 330
date +%Y-%m-%d-%H:%M:%S
which python
date +%Y-%m-%d-%H:%M:%S
python {5} ${{clust_id}}
date +%Y-%m-%d-%H:%M:%S" > $job_file

    sbatch $job_file

    rm $job_file

done
""".format(num_jobs,job_directory,account_name, num_cores, time, python_script_path)

In [9]:
print(bash_script)


#!/bin/bash

for jobname in $(seq 1 2000); do

    job_directory=/rigel/dsi/users/pfm2119/projects/simcat_power/training/testing/logs

    job_file="${job_directory}/${jobname}.job"

    clust_id_d=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | fold -w 8 | head -n 1)
    clust_id="${clust_id_d}"
    echo "#!/bin/bash
#SBATCH --workdir=$job_directory
#SBATCH --account=dsi
#SBATCH --job-name=sc${jobname}
#SBATCH --cores=4
#SBATCH --nodes=1
#SBATCH --time=11:59:00

ipcluster start --n 4 --daemonize --debug --cluster-id=${clust_id} --delay=5.0
sleep 330
date +%Y-%m-%d-%H:%M:%S
which python
date +%Y-%m-%d-%H:%M:%S
python /rigel/dsi/users/pfm2119/projects/simcat_power/training/testing/dat/run_queue.py ${clust_id}
date +%Y-%m-%d-%H:%M:%S" > $job_file

    sbatch $job_file

    rm $job_file

done



### Run the jobs:

Now I just have to write and run the bash script, and it will submit all of the jobs to slurm.

Write the bash file:

In [10]:
with open("run_sims.sh","w") as f:
    f.write(bash_script)

Run the bash file:

In [None]:
%%bash
bash run_sims.sh

Now we wait -- if all jobs start immediately, we should have 60000 simulations finished in just 12 hours.