ROAGUE: Reconstruction of Ancestral Gene Blocks Using Events
ROAGUE is a tool to reconstruct ancestors of gene blocks in prokaryotic genomes. Gene blocks are genes co-located on the chromosome. In many cases, gene blocks are conserved between bacterial species, sometimes as operons, when genes are co-transcribed. The conservation is rarely absolute: gene loss, gain, duplication, block splitting and block fusion are frequently observed.
ROAGUE accepts a set of species and a gene block in a reference species. It then finds all gene blocks, orhtologous to the reference gene blocks, and reconsructs their ancestral states.
- Conda (package manager so we don't have to use sudo)
- Python 3+
- Biopython 1.63+
- Muscle Alignment
- ETE3 (python framework for tree)
- PDA (optional if you want to debias your tree base on Phylogenetic Diversity)
Users can either use github interface Download button or type the following command in command line:
git clone https://github.com/nguyenngochuy91/Ancestral-Blocks-Reconstruction
Install Miniconda (you can either export the path everytime you use ROAGUE, or add it to the .bashrc file). Before using the following command line, users will need to install Wget.
wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O Miniconda-latest-Linux-x86_64.sh bash Miniconda-latest-Linux-x86_64.sh -b -p ~/anaconda_ete/ export PATH=~/anaconda_ete/bin:$PATH;
Install Biopython and ete3 using conda (highly recommended install biopython with conda)
conda install -c bioconda biopython ete3
Install ete_toolchain for visualization
conda install -c etetoolkit ete_toolchain
Install BLAST, ClustalW, MUSCLE
conda install -c bioconda blast clustalw muscle
For PDA, check installation instructions on this website: PDA
The easiest way to run the project is to execute the script ROAGUE which is inside the directory [Ancestral-Blocks-Reconstruction].
Run on example datasets
The users can run this script on the example data sets provided in directory E_Coli and B_Sub. The two following command lines will run roague on our 2 directories. The final results (pdf files of our ancestral reconstructions) are stored in
result/B_Sub/visualization directory by default.
./roague.py -g E_Coli/genomes/ -b E_Coli/gene_block_names_and_genes.txt -r NC_000913 -f E_Coli/phylo_order.txt -m global
./roague.py -g B_Sub/genomes/ -b B_Sub/gene_block_names_and_genes.txt -r NC_000964 -f B_Sub/phylo_order.txt -m global
Run on users' specific datasets
If the users wants to run the program on their own datasets, then they have to provide the following inputs:
- Directory that stores all the genomes file to study in genbank format
- Gene block text file that stores gene blocks in a reference species (this reference has to be in the genomes directory). The gene block format is tab delimited. The first column is the gene block name, then followed by the genes' name. For example, here is the
gene_block_names_and_genes.txtfile from Escheria coli K-12 MG1655.
astCADBE astA astB astC astD astE atpIBEFHAGDC atpI atpH atpC atpB atpA atpG atpF atpE atpD caiTABCDE caiA caiE caiD caiC caiB caiT casABCDE12 casE casD casA casC casB cas1 cas2 chbBCARFG chbG chbF chbC chbB chbA chbR
- Run ROAGUE, the output is stored in directory
./roague.py -g genomes_directory -b gene_block_names_and_genes.txt -r ref_accession -m global -o result
usage: roague.py [-h] [--genomes_directory GENOMES_DIRECTORY] [--gene_blocks GENE_BLOCKS] [--reference REFERENCE] [--filter FILTER] [--method METHOD] [--output OUTPUT] optional arguments: -h, --help show this help message and exit --genomes_directory GENOMES_DIRECTORY, -g GENOMES_DIRECTORY The directory that store all the genomes file (E_Coli/genomes) --gene_blocks GENE_BLOCKS, -b GENE_BLOCKS The gene_block_names_and_genes.txt file, this file stores the operon name and its set of genes --reference REFERENCE, -r REFERENCE The ncbi accession number for the reference genome (NC_000913 for E_Coli and NC_000964 for B_Sub) --filter FILTER, -f FILTER The filter file for creating the tree (E_Coli/phylo_order.txt for E_Coli or B_Sub/phylo_order.txt for B-Sub) --method METHOD, -m METHOD The method to reconstruc ancestral gene block, we support either global or local --output OUTPUT, -o OUTPUT Output directory to store the result
Besides, the users can also provide a filter text file. This filter file specifies the species to be included in the reconstruction analysis. The reason is that there might be families of species that are over representative in our genomes directory. This will reduce phylogenetic diversity and cause bias in our ancestral reconstruction. Hence, it is recomended to run PDA on generated tree before proceeding further steps in our analysis. In order to achieve this, the user can follow the following instructions:
- Generate a phylogenetic tree from the genomes directory
./create_newick_tree.py -G genomes_directory -o tree_directory -f NONE -r ref_accession
usage: create_newick_tree.py [-h] [-G DIRECTORY] [-o DIRECTORY] [-f FILE] [-m STRING] [-t FILE] [-r REF] [-q] optional arguments: -h, --help show this help message and exit -G DIRECTORY, --genbank_directory DIRECTORY Folder containing all genbank files for use by the program. -o DIRECTORY, --outfolder DIRECTORY Directory where the results of this program will be stored. -f FILE, --filter FILE File restrictiong which accession numbers this script will process. If no file is provided, filtering is not performed. -r REF, --ref REF The reference genome number, such as NC_000913 for E_Coli -q, --quiet Suppresses most program text outputs.
- Download and install PDA. Debias the phylogenetic tree using
./debias.py -i tree_directory/out_tree.nwk -o pda_result.txt -s num -r ref_accession
usage: debias.py [-h] [-i INPUT_TREE] [-o PDA_OUT] [-s TREE_SIZE] [-r REF] optional arguments: -h, --help show this help message and exit -i INPUT_TREE, --input_tree INPUT_TREE Input tree that we want to debias -o PDA_OUT, --pda_out PDA_OUT Output of pda to be store. -s TREE_SIZE, --tree_size TREE_SIZE Reduce the size of the tree to this size -r REF, --ref REF Force to include the following species, here I force to include the reference species
- Run ROAGUE, the output is stored in directory
./roague.py -g genomes_directory -b gene_block_names_and_genes.txt -r ref_accession -f phylo_order.txt -m global -o result
Here are two gene blocks that were generated through our program.
- Gene block paaABCDEFGHIJK:
This gene block codes for genes involved in the catabolism of phenylacetate and it is not conserved between the group of studied bacteria.
This gene block catalyzes the synthesis of ATP from ADP and inorganic phosphate and it is very conserved between the group of studied bacteria.