Skip to content

Classifying sequences with treesapp assign

Connor Morgan-Lang edited this page Dec 10, 2021 · 9 revisions

Overview

TreeSAPP's gene-centric classification workflow assigns both a gene name and taxonomic label to amino acid sequences. It is capable of classifying either genomic or proteomic sequences in this manner, and they can be derived from genomes (isolate, single-cell amplified or metagenome-assembled) or metagenomes. The TreeSAPP subcommand used for classifying sequences is treesapp assign. Nucleotide reference packages are currently not supported.

Supported versions >=0.11.0


Table of Contents

Workflow description

The classification workflow begins with conceptually translating open reading frames (ORFs) using Prodigal 1 if the input FASTA or FASTQ file contains nucleotide sequences. This step is not necessary otherwise. The protein ORFs are then queried against a set of profile hidden Markov models (HMMs) using hmmsearch from the HMMER suite 2. Proteins that are deemed homologous are then divided into groups based on the regions they aligned to in the profile HMM and these groups are used in a profile alignment with hmmalign to generate a multiple sequence alignment with reference and query sequences. The evolutionary placement algorithm (EPA-NG) is used to insert the query sequences into the reference phylogeny 3. Each placement, called a PQuery, is assigned a likelihood weight ratio, and distal and pendant length distances 4. These are used for further filtering out non-homologous sequences.

PQueries are assigned a taxonomic label based on the lowest common ancestor (LCA) of their descendants. Furthermore, a taxonomic rank is recommended by a linear model that models the correlation between taxonomic rank and phylogenetic distance. This is useful in situations where a PQuery is placed on an edge with few children such that the LCA is highly resolved but the phylogenetic distance between the query and ancestral reference state is large.

Finally, classified sequences are written to a classification table and PQueries are concatenated into a single JPlace file for each reference package's classifications. The latter can be used for visualization with iTOL 5.

Here is a diagram of the workflow:

Usage

For a list of all available arguments use treesapp assign -h.

Basic usage

treesapp assign only requires a single FASTA or FASTQ file containing either nucleotide or amino acid sequences. It is capable of detecting the molecule type automatically, but this can also be indicated explicitly through the -m/--molecule argument with the choices 'dna' and 'prot'. By default the output directory './output' would be written to unless a directory path is provided using the argument -o/--output.

treesapp assign -i my.fasta -o treesapp_output/

Recommended arguments

In addition to the basics, we tend to also extract phylogenetically informative regions from multiple sequence alignments using BMGE with the flag --trim_align 6. This speeds up the phylogenetic placement stage without significantly harming classification performance (recall and precision).

Another well-used option is --num_procs to determine the number of threads available to Prodigal (FASTA is split into chunks and Prodigal is run in parallel), HMMER, and EPA-NG.

If you're not interested in using the reference packages provided in the GitHub version (maybe you've happened upon our RefPkgs repository) and you want to classify sequences using different ones, you can direct TreeSAPP to these with the argument --refpkg_dir.

Targeted classifications

treesapp assign identifies homologs using all reference packages by default, although in many circumstances this is not necessary or desired. The argument for selecting the subset of reference packages to query is -t. This requires one or more comma-separated reference package names (typically the short-form gene or protein name). The selection can be found by running treesapp info -v. The bottom chunk of standard output lists the gene/protein names mapped to reference package codes, descriptions and other information. Here is an example:

Name	Molecule	Tree builder	RefPkg-type	Leaf nodes	Description	Created	Last-updated
DsrAB	prot	FastTree	functional	560	Dissimilatory sulfite reductase, alpha and beta subunits	2021-04-19	
HydA	prot	FastTree	functional	93	[FeFe] hydrogenase, group A, B1, B3	2021-04-21	
McrA	prot	FastTree	functional	204	Methyl coenzyme M reductase alpha subunit	2021-04-23	
McrB	prot	FastTree	functional	122	Methyl-coenzyme M reductase beta subunit	2021-04-19	
McrG	prot	FastTree	functional	123	Methyl-coenzyme M reductase gamma subunit	2021-04-19	
NapA	prot	FastTree	functional	748	Periplasmic nitrate reductase, large subunit	2021-04-21	
NifD	prot	FastTree	functional	844	nitrogenase molybdenum-iron protein alpha chain	2021-04-22	
NirK	prot	FastTree	functional	325	nitrite reductase (NO-forming)	2021-04-21	
NirS	prot	FastTree	functional	523	nitrite reductase (NO-forming) / hydroxylamine reductase	2021-04-21	
NorB	prot	FastTree	functional	742	nitric oxide reductase large subunit B [COG3256]	2021-04-22	2021-04-22
NosZ	prot	FastTree	functional	326	Nitrous-oxide reductase (TAT-dependent|Sec-dependent)	2021-04-22	2021-04-23
NxrA	prot	FastTree	functional	866	nitrate reductase / nitrite oxidoreductase, alpha subunit	2021-04-23	
NxrB	prot	FastTree	functional	519	nitrate reductase / nitrite oxidoreductase, beta subunit	2021-04-22	

To analyze just the McrABG subunits (last three rows), you could include -t McrA,McrB,McrG in your treesapp assign command.

Abundance estimates

treesapp assign is also capable of assigning FPKM or TPM abundance values to classified sequences when the --rel_abund flag is invoked. At least one FASTQ file is also required (using the -r argument), and a second if the reads are paired-end and not interleaved (path provided to -2). The --pairing option is used to specify whether the library is paired-end ('pe') or single-end ('se').

BWA is used to both index the classified nucleotide ORF sequences, and align reads to this index 7. samsum, a Python package dependency, parses the SAM file written and calculates normalized relative abundance values of each ORF.

Note: This routine is only available if a genome was provided and Prodigal was run - we cannot map DNA reads to protein sequences.

Checkpointing

treesapp assign supports restarting from incomplete output (i.e. checkpoints), as described in the general notes. Additionally, the --stage argument can be used to control what stages are completed, where all stages up to and including the one specified are ran.

Outputs

There are a variety of output files generated by treesapp assign in three subdirectories: final_outputs, intermediates and iTOL_output.

The most useful files are the classification table and FASTA file containing all of the classified query sequences, already truncated to include just the homologous region, in the final_ouputs/ directory.

The intermediates/ directory contains many other files such as intermediate JPlace files, MSAs, and more. If the --delete flag was invoked, however, this directory was removed.

The inputs for visualization with iTOL are located in iTOL_output/.

Classification table

The table containing functional and taxonomic classifications for the queries is final_outputs/classifications.tsv under the output directory. Each row represents the classification for a single query sequence with the gene/protein name and taxonomic label it was classified as. Chances are, you are just interested in the first six columns.

Field Description
Sample Name of the sample the ORF originated from. This is taken from the name of the FASTA file provided to -i/--fasta_input except when the --fpkm flag is used, in which case the name is taken from the FASTQ file(s).
Query Name of the query protein sequence or conceptually translated ORF sequence.
Marker Name of the reference package (i.e. gene or protein name) the query sequence was classified as. If you are unfamiliar with this name, treesapp info -v can give you a brief description.
Start_pos Left-most alignment position on the query sequence to the reference profile HMM. *
End_pos Length of the query sequence that was aligned to the reference profile HMM and subsequently used for phylogenetic placement. This could be, and usually is, a subsequence of the full-length ORF or protein sequence.
Taxonomy Taxonomic label assigned to the query sequence based on LCA alone and the rank recommended by the reference package's linear model. **
Abundance The abundance of the query sequence in the dataset. If --fpkm was not used it is set to 1, otherwise it is a fragments per kilobase per million mappable reads (FPKM) value.
E-value The Expectation value for the aligned query sequence calculated by hmmsearch.
iNode The internal node number representing where on the reference phylogeny the query sequence was placed.
LWR Likelihood weight ratio assigned to the query by EPA-NG.
EvoDist Sum of all phylogenetic distances calculated by EPA-NG as well as the average node-to-tip distances.
Distances A comma-separated list of the distances that summed to equal EvoDist

Notes:
* The total homologous sequence length was substituted for the Start_pos and End_pos fields in version 0.8.7
** The LCA-only taxonomy was removed in version 0.8.7

iTOL inputs

Within the root output directory, there is a directory called iTOL_output/. Within there is a directory for each of the reference packages that query sequences were assigned to and within those are files you can use to visualize your placements in iTOL.

If you have an iTOL account already you should login first and once you are in your workspace click-and-drag the JPlace file from one of the iTOL output refpkg directories into a workspace. If you do not have an iTOL account (and don't want to get one) you can navigate to the "Upload" button at the top of the page and upload the JPlace file through that.

Once you are viewing the phylogeny you should toggle on the "Phylogenetic Placements" dataset. Then, to finish uploading the annotation files, click-and-drag the remaining files in that directory (e.g. simplebar.txt, labels.txt) onto the webpage et voila, you have a b-e-a-utiful phylogeny showing you where sequences in your dataset were placed.

To paint the reference package's phylogenetic tree by taxonomy in iTOL you will need to create the colour_strip.txt and colours_style.txt files using treesapp colour. These files are also imported by clicking-and-dragging onto the browser window.

References

  1. Hyatt, D., Chen, G.-L., Locascio, P. F., Land, M. L., Larimer, F. W., & Hauser, L. J. (2010). Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics, 11, 119. https://doi.org/10.1186/1471-2105-11-119

  2. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics (Oxford, England), 14(9), 755–763. https://doi.org/btb114

  3. Barbera, P., Kozlov, A. M., Czech, L., Morel, B., & Stamatakis, A. (2018). EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences. Systematic Biology, 0(0), 291658. https://doi.org/10.1101/291658

  4. Matsen, F. A., Hoffman, N. G., Gallagher, A., & Stamatakis, A. (2012). A format for phylogenetic placements. PLoS ONE, 7(2), 1–4. https://doi.org/10.1371/journal.pone.0031009

  5. Letunic, I., & Bork, P. (2019). Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Research, 47(W1), W256–W259. https://doi.org/10.1093/nar/gkz239

  6. Criscuolo, A., & Gribaldo, S. (2010). BMGE (Block Mapping and Gathering with Entropy): A new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology, 10(1). https://doi.org/10.1186/1471-2148-10-210

  7. Li, H. (2013). Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv Preprint ArXiv, 00(00), 3. https://doi.org/arXiv:1303.3997