Skip to content

4.1. Add new bioinformatics tool to ProkEvo

Natasha Pavlovikj edited this page Feb 1, 2021 · 5 revisions

In order to demonstrate how to add a completely new bioinformatics tool to ProkEvo, we will use snp-dists as an example. snp-dists is tool that creates pairwise SNP distance matrix from a FASTA sequence alignment.

  1. The first step of adding a new tool to ProkEvo is to add the conda package for that tool and its specific version in the prokevo.yml file. ProkEvo uses this file to create conda environment with all needed dependencies and databases. The current version of snp-dists is 0.7.0, so snp-dists=0.7.0 should be added at the end of the file "prokevo.yml":
[centos@npavlovikj-prokevo ProkEvo]$ cat prokevo.yml
...
  - zlib=1.2.11=h516909a_1009
  - zstd=1.4.5=h6597ccf_2
  - snp-dists=0.7.0
  1. The second step is to create Bash script for snp-dists in the directory scripts/ with the options we want to use snp-dists with:
[centos@npavlovikj-prokevo ProkEvo]$ vim scripts/snp-dists.sh 
#!/bin/bash

source /opt/conda/etc/profile.d/conda.sh
conda activate /home/centos/ProkEvo/prokevo

# snp-dists "$@"
snp-dists -b -c $1

conda deactivate
  1. The next step is to do the mapping of the location of the snp-dists executable with the name of the job used in the file "sub-dax.py". This can be done by adding:
tr ex_snp_dists_run {
    site local-hcc {
        pfn "file:////home/centos/ProkEvo/scripts/snp-dists.sh"
    }
}

at the end of the "tc.txt" file. "tc.txt" is the Transformation Catalog used by Pegasus and more information about its structure and content can be found here.

  1. The last step is to add the job and the proper dependencies for it in the workflow itself. In this example, snp-dists is part of the second sub-workflow, so this information will be added in "sub-dax.py". Here, snp-dists uses the alignment file that is output from Roary to generate the distance matrix. Moreover, the output from snp-dists is not used as an input to another tool, thus no other program is run after snp-dists.

In order to add the job for snp-dists in "sub-dax.py", the following snippet should be added to "sub-dax.py":

[centos@npavlovikj-prokevo ProkEvo]$ more sub-dax.py
...
# add job for snp-dists
snp_dists_run = Job("ex_snp_dists_run")
snp_dists_run.addArguments("roary_output/core_gene_alignment.aln")
snp_dists_run.uses("roary_output/core_gene_alignment.aln", link=Link.INPUT)
o = File("distance_matrix_output.csv")
snp_dists_run.setStdout(o)
snp_dists_run.uses(o, link=Link.OUTPUT, transfer=True)
snp_dists_run.addProfile(Profile("condor", "request_memory", "70000"))
snp_dists_run.addProfile(Profile("globus", "maxmemory", "70000"))
snp_dists_run.addProfile(Profile("condor", "memory", "70000"))
# snp_dists_run.addProfile(Profile("pegasus", "label", str(srr_id)))
dax.addJob(snp_dists_run)
...

More information about the meaning of the individual lines can be found here.

After the job is added, its dependencies need to be defined at the end of the file sub-dax.py, in the Section "Add control-flow dependencies". snp-dists is being run after Roary and no program is being run after, so the following line should be added:

...
dax.addDependency(Dependency(parent=roary_run, child=snp_dists_run))
...

to "sub-dax.py".

With this, the Section "Add control-flow dependencies" from the file sub-dax.py should look like:

[centos@npavlovikj-prokevo ProkEvo]$ cat sub-dax.py
...
for i in range(0,length):
    # Add control-flow dependencies
    dax.addDependency(Dependency(parent=plasmidfinder_run[i], child=ls_run))
    dax.addDependency(Dependency(parent=prokka_run[i], child=roary_run))
    # COMMENT OUT THE LINE BELOW TO SKIP SISTR IF NON SALMONELLA ORGANISM IS USED!!!
    dax.addDependency(Dependency(parent=sistr_run[i], child=cat))
dax.addDependency(Dependency(parent=roary_run, child=snp_dists_run))
dax.addDependency(Dependency(parent=mlst_run, child=ls_run))
dax.addDependency(Dependency(parent=abricate_argannot_run, child=ls_run))
dax.addDependency(Dependency(parent=abricate_card_run, child=ls_run))
dax.addDependency(Dependency(parent=abricate_ncbi_run, child=ls_run))
dax.addDependency(Dependency(parent=abricate_plasmidfinder_run, child=ls_run))
dax.addDependency(Dependency(parent=abricate_resfinder_run, child=ls_run))
dax.addDependency(Dependency(parent=abricate_vfdb_run, child=ls_run))
dax.addDependency(Dependency(parent=roary_run, child=fastbaps_run))
dax.addDependency(Dependency(parent=cat, child=merge_sistr_run))
...

With these changes, the snp-dists step is added to ProkEvo, and the workflow can be submitted. More information about the dependency section and syntax in the Python script can be found in the documentation of Pegasus.