The ebola_i2o pipeline has been built to process the output from the EBOV chip, designed by the Duncan Lab, at the Center For Biologics Evaluation and Research (CBER), FDA. EBOV has been designed to type strains of the Ebola virus using a Multi-Locus Subtyping (MLST) strategy. Briefly, the chip generates sequence from 13 regions of the Ebola virus genome sequence. Generated sequences are aligned with the corresponding regions from 149 reference Ebola strains with know strain type. A phylogenetic tree derived from the concatenated alignment, across the 13 segments, allows one to establish the evolutionary relationships between the test strain and the reference strains.
To run this pipeline you will need to install:
- Git and Docker
- the following Perl modules:
- File::Path
- File::SortedSeek
- hint:
sudo perl -MCPAN -e 'install <module_name>'
- the following:
- a Blast database. You can use the nt database from NCBI. Alternatively you can create your own database of Filovirus nucleotide sequences, using code included in this repo (see below).
- The nucl_gb.accession2taxid.gz from
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz
. Be sure to unzip file before use. - The names.dmp file from the taxdump.tar.gz file-set at
ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
. Be sure to unzip file before use.
-
Clone this repo to your local machine:
git clone https://github.com/openbox-bio/ebola_i2o.git
. -
The
ebola_i2o
directory has the following subdirectories:
ebola_i2o
|-->code (stores the ebola_i2o code and ebola_i2o settings file. Stores the program create_blast_nucl_database.py and an associated environment file, filovirus_env, to create and maintain a Filovirus database.)
|-->data (stores all reference strain multiple sequence alignments) -
Configure the path variables in file ebola_i2o_settngs file. This settings file stores the path to a set of key files and directories for ebola_i2o.
- REF_DATA_PATH: Path to the data sub-directory in the ebola_i2o directory.
- BLAST_DB_PATH: Path to the directory storing the Blast database.
- BLAST_DB_NAME: Name of the Blast database.
- ACC2TAXID_PATH: Path to the nucl.gb.accession2taxid file.
- NAMES_DMP_PATH: Path to the names.dmp file.
- RESULTS_PATH: Path to the folder that stores output from ebola_i2o. For each sample that is run, the pipeline creates a subfolder (name format = <Sample_Name>_YYYY_MM_DD_HH_MM_SS) here to store all the output files.
-
Pull the following docker image
openboxbio/ebola_i2o_tools:latest
- hint:
docker pull openboxbio/ebola_i2o_tools:latest
- hint:
-
Set environmental variable
CODE_PATH
to the absolute path to theebola_i2o_settings
file. In Ubuntu this can be done by addingexport CODE_PATH=<path_to_ebola_i2o_settings_file>
in the .profile file. -
Run the pipeline:
perl ebola_i2o <full_path_name_to_input_file>
Important:
- Path names should have no spaces in them. Good folder name:
data_folder
. Bad folder name:data folder
. - Please ensure that the name of the input file is of the following format: <test_strain_name>.txt, where test_strain_name is no longer than ten characters.
- Disclaimer: This pipeline has been tested on Ubnutu. It should, in principle, work on all flavors of Unix. Please mail me at anjan.purkayastha@gmail.com to report problems or bugs.
- Ensure that the following Python packages are installed in your environment:
biopython
,python-dotenv
,pathlib
- In your BLAST database folder create a file, that will serve as the log file for the program.
- hint:
touch filoviridae.db.log
- Open the environment file ebola_env in a text editor, specify the absolute path of the filovirus database using the
DB
environment variable; specify the path of the log file created in step 2, using theLOG
environment variable. Email me at anjan.purkayastha@gmail.com for the API KEY and APPLICATION PASSWORD. Enter these variable values. Do not alter any of the other variables. - To create a Filovirus database run
python <path_to_create_blast_nucl_database.py> -e <path_to_filovirus_env> -t create