IPSSA (Integrated Positively Selected Sites Analyses) is a Compi pipeline to automatically identify positively selected amino acid sites using three different methods, namely CodeML, omegaMap, and FUBAR. Moreover, it looks for evidence of recombination in the data. A Docker image is available for this pipeline in this Docker Hub repository.
IPSSA (Integrated Positively Selected Sites Analyses) is a Compi pipeline to automatically identify positively selected amino acid sites using three different methods, namely CodeML, omegaMap, and FUBAR. Moreover, it looks for evidence of recombination in the data.
IPSSA applies the same steps to each input FASTA file separately. This process comprises:
- Checking if the input FASTA file contains ambiguous nucleotide positions or non-multiple of three sequences. If so, the pipeline stops at this point and the files must be fixed.
- Extract a random subset of sequences according to the sequence limit specified to create the master set of sequences.
- Translate and align the master set of sequences.
- The master protein alignment is then backtranslated to produce a master DNA alignment, used to:
- Run phipack.
- Create the PSS subsets for CodeML, omegaMap, and FUBAR, according to the number of sequences and replicas specified for each method.
- The master protein alignment is also filtered to remove low confidence positions, according to the value specified. These filtered files are then converted into DNA files, which are split into the same subsets used by CodeML, omegaMap, and FUBAR. These are the files used by MrBayes to produce a Bayesian phylogenetic tree that is used by FUBAR and CodeML.
- Run phipack for each one of the PSS subsets.
- Run MrBayes for each one of the PSS subsets (using the filtered DNA files produced in step 5).
- Run CodeML, omegaMap, and FUBAR, using their corresponding PSS subsets.
- Finally, gather the results of all PSS methods into a tabular format.
In order to use the IPSSA image, create first a directory in your local file system (ipssa_project
in the example) with the following structure:
ipssa_project/
├── input
│ ├── 1.fasta
│ ├── 2.fasta
│ ├── .
│ ├── .
│ ├── .
│ └── n.fasta
└── ipssa-project.params
Where:
- The input FASTA files to be analized must be placed in the
ipssa_project/input
directory. - If neccessary, the Compi parameters file is located at
ipssa_project/ipssa-project.params
.
Once this structure and files are ready, you should run and adapt the following commands to run the entire pipeline using the default parameters (i.e. without a Compi parameters file). Here, you only need to set PROJECT_DIR
to the right path in your local file system and COMPI_NUM_TASKS
to the maximum number of parallel tasks that can be run. Pipeline parameters can be added at the end of the pipeline (e.g. --sequence_limit 10
). Note that the --host_working_dir
is mandatory and must point to the pipeline working directory in the host machine.
PROJECT_DIR=/path/to/ipssa_project
COMPI_NUM_TASKS=6
PIPELINE_WORKING_DIR=${PROJECT_DIR}/pipeline_working_dir
INPUT_DIR=${PROJECT_DIR}/input
# Run with default parameter values
docker run -v /tmp:/tmp -v /var/run/docker.sock:/var/run/docker.sock -v ${PIPELINE_WORKING_DIR}:/working_dir -v ${INPUT_DIR}:/input --rm pegi3s/ipssa -o --logs /working_dir/logs --num-tasks ${COMPI_NUM_TASKS} -- --host_working_dir ${PIPELINE_WORKING_DIR}
If you want to specify the pipeline parameters using a Compi parameters file, you should run and adapt the following commands. These are the same commands as above but with the addition of the PARAMS_DIR
variable.
An example of a Compi parameters file can be obtained running the following command: docker run --rm --entrypoint cat pegi3s/ipssa /resources/ipssa-project.params
.
This parameters file contains the default values recommended for running IPSSA. Please, note that you must update the value of the host_working_dir
parameter in this file before using it.
PROJECT_DIR=/path/to/ipssa_project
COMPI_NUM_TASKS=6
PIPELINE_WORKING_DIR=${PROJECT_DIR}/pipeline_working_dir
INPUT_DIR=${PROJECT_DIR}/input
PARAMS_DIR=${PROJECT_DIR}
docker run -v /tmp:/tmp -v /var/run/docker.sock:/var/run/docker.sock -v ${PIPELINE_WORKING_DIR}:/working_dir -v ${INPUT_DIR}:/input -v ${PARAMS_DIR}:/params --rm pegi3s/ipssa -o --logs /working_dir/logs --num-tasks ${COMPI_NUM_TASKS} -pa /params/ipssa-project.params
Some tasks may produce errors that do not cause the pipeline to fail, but they can be important. Such errors are reported in the log files produced in the logs
directory of the pipeline working directory. The find-error-tasks.sh
script of the pegi3s/ipssa
Docker image displays the errored tasks (i.e. those containing the word error in their log files) along with the names of the corresponding input files. Run the following command to find them (assuming the environment variables with the project and working directory paths have been declared):
docker run --entrypoint /opt/scripts/find-error-tasks.sh -v ${PIPELINE_WORKING_DIR}:/working_dir -v ${INPUT_DIR}:/input --rm pegi3s/ipssa /working_dir/logs /input /working_dir/run_lists
To re-run the pipeline in the same project directory, run the following command first in order to clean the pipeline working directory:
sudo rm -rf ${PIPELINE_WORKING_DIR}
These are the pipeline parameters:
sequence_limit
: The maximum number of sequences to use for the master file. The default value is90
.random_seed
: The random seed.align_method
: The alignment method to use, one of:clustalw
,muscle
,kalign
,t_coffee
, oramap
. The default value ismuscle
.tcoffee_min_score
: The minimum support value for alignment positions. The default value is3
.mrbayes_generations
: The number of iterations in MrBayes. The default value is1000000
.mrbayes_burnin
: The MrBayes burnin. The default value is2500
.fubar_sequence_limit
: The maximum number of sequences to be used by FUBAR. The default value is90
.fubar_runs
: The number of independent replicas for FUBAR. The default value is1
.codeml_sequence_limit
: The maximum number of sequences to be used by CodeML. The default value is30
.codeml_runs
: The number of independent replicas for CodeML. The default value is1
.codeml_models
: The CodeML models to be run, one or more of:1
,2
,7
, and/or8
. To declare more than one model use a blank space between models. The default value is1 2 7 8
.omegamap_sequence_limit
: The maximum number of sequences to be used by omegaMap. The default value is90
.omegamap_iterations
: The number of omegaMap iterations. the default value is1
.omegamap_runs
: The number of independent replicas for omegaMap. The default value is2500
.omegamap_recomb
: A flag indicating if omegaMap must be executed only if recombination is detected in the master file. By default, the flag is not present and thus omegaMap is executed (ifomegamap_iterations
> 0).
The sample data is available here. Download and uncompress it, and move to the directory named ipssa-m-leprae
, where you will find:
- A directory called
ipssa-m-leprae-project
, that contains the structure described previously. - A file called
run.sh
, that contains the following commands (where you should adapt thePROJECT_DIR
path) to test the pipeline:
PROJECT_DIR=/path/to/ipssa-m-leprae-project
COMPI_NUM_TASKS=8
PIPELINE_WORKING_DIR=${PROJECT_DIR}/pipeline_working_dir
INPUT_DIR=${PROJECT_DIR}/input
PARAMS_DIR=${PROJECT_DIR}
docker run -v /tmp:/tmp -v /var/run/docker.sock:/var/run/docker.sock -v ${PIPELINE_WORKING_DIR}:/working_dir -v ${INPUT_DIR}:/input -v ${PARAMS_DIR}:/params --rm pegi3s/ipssa -o --logs /working_dir/logs --num-tasks ${COMPI_NUM_TASKS} -pa /params/ipssa-project.params
- ≈ 207 minutes - 50 parallel tasks - Ubuntu 18.04.2 LTS, 96 CPUs (AMD EPYC™ 7401 @ 2GHz), 1TB of RAM and SSD disk.
- ≈ 345 minutes - 16 parallel tasks - Ubuntu 18.04.3 LTS, 12 CPUs (AMD Ryzen 5 2600 @ 3.40GHz), 16GB of RAM and SSD disk.
To build the Docker image, compi-dk
is required. Once you have it installed, simply run compi-dk build
from the project directory to build the Docker image. The image will be created with the name specified in the compi.project
file (i.e. pegi3s/ipssa:latest
). This file also specifies the version of compi that goes into the Docker image.
- H. López-Fernández; C. P. Vieira; P. Ferreira; P. Gouveia; F. Fdez-Riverola; M. Reboiro-Jato; J. Vieira (2021) On the identification of clinically relevant bacterial amino acid changes at the whole genome level using Auto-PSS-Genome. Interdisciplinary Sciences: Computational Life Sciences. Volume 13, pp. 334–343.