Skip to content
Lyndon Coghill edited this page Jul 15, 2015 · 7 revisions

Collect Raw Sequences

  • Download all DNA sequences in Genbank for Eukaryotes between 500-7500 base pairs.
  • Click Here for a prebuilt search.
  • Click Send To
  • Select File
  • Format Genbank (full)
  • Click Create File

At the time of building this wiki, it should be somewhere around 5.1 million records. It'll take awhile to download based on your connection.

Parse and Format the Data

Once you have the sequences downloaded, first we need to convert them to FASTA format. One reason not to download them as FASTA initially is so that you can control the various information that gets included in each record. The pipeline expects a FASTA file with a label formatted in a specific way.

>giXXXX_tiXXXX_accXXXX Description  
TCTAGAAATGAGTGATAATAAAACTGCTGATAAACAGTACGAGTCAGTATGATGGTGTTC  
  1. giXXX: the GI value for the record.
  2. tiXXX: the TI value for the record.
  3. accXX: the accession number for the record.
  4. Description: the genbank description for the record.

We have provided a script to help with this. It will parse the Genbank file with all the necessary information, and generate the properly formatted FASTA file. You can find the script, genbank-to-fasta.py in the tools repository. This script assumes you have already created the gi_to_ti.db sqlite database as mentioned in the database requirements section. If you haven't, you'll need to take care of that first.

Once you have the FASTA file generated, there is one additional, optional step you can perform. For our initial work, we wanted to avoid the inclusion of taxa that are redundantly researched (ie: model organisms). You can filter out our current list of model organisms by running the filter-models.py script if you want. This step is optional though, the rest of the pipeline will run without it.

Build Clusters

Once your data is properly formatted, you can build the clusters. Clusters are built using the program VSEARCH. If you installed the requirements listed previously in the wiki, you should be set. You can tweak the parameters as necessary, but we find that a 70% similarity threshold works pretty well. An example execution might be:

$ vsearch --cluster_fast eukaryotic_genbank.fas --id 0.70 --out clust_

This will produce the cluster files in the local directory, and each file with have the name clust_NN where NN will be a sequential number generated by vsearch. A script is in development to automate this with the pipeline. We will update the wiki when it's ready.

Filter Clusters

There are a number of filters that are necessary in order to isolate the phylogenetically informative (PI) data from the raw clusters. Each step is covered below, and automated to some degree.

Filter for only PI clusters

In order for a cluster to be informative phylogenetically, it needs to contain at least 4 unique taxa so that a gene tree can be built from it later in the process to represent relationships. You can filter your clusters however you choose, but we do have a simple script, pull-pic-clusters.py, that will filter based on the number of taxa in each cluster.

$ pull-pic-clusters.py
Filter transposable elements (optional)

Transposable elements (transposons) jumping around the genome can obviously cause issues with phylogenetic signal. In order to minimize the effect of transposons, we filter out as many as we can. In order to do this we developed a simple tool called Transeeker. Transeeker accepts a directory of FASTA files (in this case our PI cluster files), and blasts them against a blast database of the latest version of Repbase. Any hits of a given threshold, and filtered out. While this step is optional, it's pretty fast and probably worth the time to have a slightly higher level of confidence in the data. You can find all the steps needed in the Transeeker Repository.

Filter mis-labeled sequences (optional, but highly recommended)

At this point the data is in pretty good shape, and you'll probably notice that a fair bit of it has been filtered out. The final filtering stage is to attempt to clean up any mis-identified sequences. This means, sequences that cluster with one taxonomic group based on sequence similarity, but are taxonomically identified in Genbank as belonging to a very distantly related group. A very simple example of this might be, you have a sequence that clusters with a group of fungi based on similarity, but is labeled with the TI value of an insect. Since we have no way of knowing whether the TI is the error, or the sequence is incorrect, we need to exclude these sequences. Our simple method for doing this relies on the use of absolute deviations from the median. We have a tool, find-outliers.py, that will calculate the median number of steps between any two taxa in a cluster along the taxonomy, and any sequence that is more than a given threshold of deviations away will be flagged in an output file. These can be reviewed, and then a second script, purge-outliers, can be used to remove them. If you don't care about the details, you can run them as follows:

Filter for only PI clusters

Once all of this filtering is done, you will need to filter a second time for PI clusters. Some of the clusters will likely have had sequences removed, and that may drop them under the threshold of 4 or more unique taxa, to be considered phylogenetically informative. Just run the pull-pic-clusters.py script on your newly filtered clusters and your data is ready.

Build Alignments

The next step is to build alignments for each of our filtered clusters. This is a relatively standard method, we have just provided a script, build-alignments.py, that will automate building the alignment for all of the clusters in a directory. Each alignment is built with MUSCLE, and the alignment parameters can be tweaked inside the script as necessary. The defaults were chosen on the premise of speed and accuracy balance. The output is a directory of alignments in FASTA format for each cluster. There will be a second directory of FASTA files, reduced-alignments/, that contains the same alignments, but filtered for column density. For each alignment, each column is examined, and if a given column contains missing data in more than X percentage threshold of it's rows, it is removed. Either dataset can be used from here on out.

Build Trees

In order to understand the relationships between taxa in a given cluster we need to build gene trees for each cluster. We have tools that give you two options to do this.

  1. build-trees-raxml.py - Trees are built using RAxML. Standard maximum likelihood method. Works well, but can take a very long time with bootstrap replicates.
  2. build-trees-fasttree.py - Trees are built using FastTree. Infers an approximately-maximum-likelihood tree with support values for each node. Much faster than RAxML but the trees are usually a little less accurate.

Both of these scripts accept a directory of alignment files in FASTA format, and will build a tree with the given parameters for each alignment. At the end, all of the best trees are combined into a single trees.out file in newick format. For our normal run, we use the RAxML method with 100 bootstrap replicates. Be warned though, on a relatively fast system using up to 20 cores this can still take weeks to run.

Collapse poorly supported nodes (optional)

In order to be sure that the relationships represented in our unrooted gene trees are accurate, we normally choose to collapse any nodes in each tree that are under 70% support. If you would like to perform this step on your trees, you can use the script collapse-nodes.py. This script accepts the trees.out file, traverses each tree, and collapses any nodes with less than a given support value.

Extract rooted convex subtrees

In order to understand the phylogenetic relationships of the taxa in a given cluster we need a rooted tree. Unfortunately, without an outgroup or more information we can't root our unrooted gene trees by traditional methods. We have developed a tool, extract-convex-subtrees.py, to attempt to overcome this limitation. This script takes each tree and maps it onto a taxonomy graph. It then finds the most recent common ancestor of all taxa in the unrooted gene tree. After which, it traverses the graph, node-by-node from the most inclusive to the least inclusive node. If at any given node, a group of taxa can be pruned into a subtree with a single cut along the unrooted gene tree, we extract that subtree and root it at the current node. This produces a set of "convex rooted subtrees" that contain real taxa from the unrooted gene trees, but which are rooted giving a direction and representation of the relationships of those taxa.

Create Taxonomy-Phylogeny-Alignment Graph

In order to quantify the phylogenetic signal across a large dataset, the dataset (in our case rooted trees) need to be merged into a single object. We accomplish that in Phyloboost by merging all of the rooted subtrees into a common graph structure using the build-taxonomy-phylogeny-graph.py script. Once all of the trees are merged, you can explore the various components of the graph, to see how coalesced or disconnected your original input data is from the NCBI taxonomy.

Coming Soon: A simple tutorial on exploring the various graph components.