Skip to content

Evaluating taxonomic classification accuracy

Connor Morgan-Lang edited this page Jan 21, 2021 · 5 revisions

Overview

Clade exclusion analysis is a method for evaluating a software's taxonomic classification performance 1,2. The goal of it is to simulate scenarios where the reference database does not adequately reflect the full diversity of a sample, or a set of query sequences. By starting with a database that at first represents the query sequences, it proceeds by iteratively removing cohesive taxonomic groups (i.e. a clade) from the reference database, classifying query sequences belonging to these clades, then replacing them before carrying on to the next one. These clades often represent not just different taxa of the same rank, but also taxa of different rank. Therefore, clade exclusion analysis allows one to approximate expected classification performances with real community data, where queries vary in relation to the database, in a controlled experiment.

The TreeSAPP command treesapp evaluate uses clade exclusion analysis to estimate a reference package's taxonomic classification performance for any taxonomic rank. It is useful to know this once you're created a new reference package and before you use treesapp assign.

Supported versions >=0.9.0


Table of Contents

Usage

Here is the command-line usage and options, printed to the console by treesapp evaluate -h:

usage: treesapp evaluate [-v] [-h] -i INPUT [-o OUTPUT] [--delete]
                         [--trim_align] [-w MIN_SEQ_LENGTH]
                         [-m {prot,dna,rrna}] -r PKG_PATH [PKG_PATH ...]
                         [--accession2taxid ACC_TO_TAXID] [-a ACC_TO_LIN]
                         [--overwrite] [-n NUM_THREADS]
                         [--taxonomic_ranks {domain,phylum,class,order,family,genus,species} [{domain,phylum,class,order,family,genus,species} ...]]
                         [--fresh] [--tool {treesapp,graftm,diamond}]
                         [-l LENGTH]
                         [--stage {continue,lineages,classify,calculate}]

Evaluate classification performance using clade-exclusion analysis.

Required parameters:
  -i INPUT, --fastx_input INPUT
                        An input file containing DNA or protein sequences in
                        either FASTA or FASTQ format
  -r PKG_PATH [PKG_PATH ...], --refpkg_path PKG_PATH [PKG_PATH ...]
                        Path to the reference package pickle (.pkl) file.

Optional options:
  -o OUTPUT, --output OUTPUT
                        Path to an output directory [DEFAULT = ./output/]
  --trim_align          Flag to turn on position masking of the multiple
                        sequence alignment [DEFAULT = False]
  -w MIN_SEQ_LENGTH, --min_seq_length MIN_SEQ_LENGTH
                        minimal sequence length after alignment trimming
                        [DEFAULT = 30]
  -m {prot,dna,rrna}, --molecule {prot,dna,rrna}
                        Type of input sequences (prot = protein; dna =
                        nucleotide [DEFAULT]; rrna = rRNA)
  --taxonomic_ranks {domain,phylum,class,order,family,genus,species} [{domain,phylum,class,order,family,genus,species} ...]
                        A list of the taxonomic ranks (space-separated) to
                        test. [ DEFAULT = class species ]
  --fresh               Recalculate a fresh phylogenetic tree with the target
                        clades removed instead of removing the leaves
                        corresponding to targets from the reference tree.
  --tool {treesapp,graftm,diamond}
                        Classify using one of the tools: treesapp [DEFAULT],
                        graftm, or diamond.
  -l LENGTH, --length LENGTH
                        Arbitrarily slice the input sequences to this length.
                        Useful for testing classification accuracy for
                        fragments.
  --stage {continue,lineages,classify,calculate}
                        The stage(s) for TreeSAPP to execute [DEFAULT =
                        continue]

Taxonomic-lineage arguments:
  --accession2taxid ACC_TO_TAXID
                        Path to an NCBI accession2taxid file for more rapid
                        accession-to-lineage mapping.
  -a ACC_TO_LIN, --accession2lin ACC_TO_LIN
                        Path to a file that maps sequence accessions to
                        taxonomic lineages, possibly made by `treesapp
                        create`...

Miscellaneous options:
  -v, --verbose         Prints a more verbose runtime log
  -h, --help            Show this help message and exit
  --delete              Delete all intermediate files to save disk space.
  --overwrite           overwrites previously processed output folders
  -n NUM_THREADS, --num_procs NUM_THREADS
                        The number of CPU threads or parallel processes to use
                        in various pipeline steps [DEFAULT = 2]

Summary of important arguments:

  • -i/--fastx_input is a fasta file containing query sequences. It's best to use a set of queries with a broad taxonomic range, unless you're scope is small and focused on a particular taxon. Using the same fasta file provided to treesapp create is ideal.
  • -r/--refpkg_path is the path to the reference package .pkl file.
  • --taxonomic_ranks dictates what ranks are evaluated. By default this is just 'class' and 'species', to approximate performance at both ends of the spectrum.
  • -l/--length can be used to estimate the classification performance of sequences that are randomly trimmed to a specific length. For instance, if you plan on classifying ORFs predicted from unassembled or merged Illumina reads, you should probably estimate the performance of query sequences truncated to 50 to ~200 characters (150 bp and 300 bp merged paired-end reads, respectively).
  • --fresh will trigger TreeSAPP to recreate the phylogeny for every round of clade exclusion analysis, instead of pruning the leaf nodes from the tree representing excluded clades. This will require significantly more time to complete and is generally not recommended.
  • -a/--accession2lin refers to a tab-separated value file with a one-to-one mapping between a query sequence's accession and its taxonomic lineage. This file is automatically created by treesapp create, but may be deleted if the --delete flag was invoked.

treesapp evaluate can be ran iteratively, such that if you first ran it with --taxonomic_ranks class you would be able to later re-run with --taxonomic_ranks genus species and the results would be retabulated for class, genus and species - no need to start over again! That is, unless you included either the --overwrite or --delete flags...

Workflow description

treesapp evaluate begins by collecting the taxonomic lineage information for all query sequences. This can be effectively skipped by providing the accession2lin table.

Once the taxonomic lineages are known, the next step determines the set of query sequences that are to be used to evaluate the reference package at each taxonomic rank specified. Not every taxon can be used as often times a taxon will only be represented by one child taxon (e.g. Methylosarcina barkeri is the only species representing the genus Methylosarcina). In these cases, if the child taxon is removed, the optimal taxonomic classification will not represent a genus-level relationship, but potentially family or even more distant. These taxa are skipped to avoid misrepresenting the relationship between accuracy and taxonomic rank.

Something to note: taxonomic rank is used here as a proxy for evolutionary divergance, or an approximation of how related the query sequences are to the reference database. If you're interested in knowing how treesapp assign performs when the query sequences are distantly related to the reference package, as would be the case when classifying query sequences from an understudied environment, then use 'class' or 'order'. If you're just interested in knowing the performance when the queries are well represented by the reference package, use 'species'.

With the sequences that are to be used for evaluating the reference package clade-exclusion analysis begins. Briefly, a new reference package is created without the reference sequences belonging to a taxon. Query sequences that belong to this taxon are then classified with the new reference package and the taxonomic labels assigned to the query sequences are saved before moving on to the next taxon.

After exhaustively performing clade exclusion analysis for all taxa of all ranks the classifications are analysed. Here, the distance from the optimal taxonomic assignment are calculated by comparing the assigned label to the taxa that were in the reference package used for classification. For example, if a taxon ('d__Bacteria; p__Nitrospirae; c__Nitrospira; o__Nitrospirales; f__Nitrospiraceae; g__Nitrospira') was assigned the label 'd__Bacteria; p__Nitrospirae; c__Nitrospira' and the closest relative was 'd__Bacteria; p__Nitrospirae; c__Nitrospira; o__Nitrospirales; f__Nitrospiraceae' the distance would be 2.

Outputs

Three files summarizing the clade-exclusion analysis results can be found in the final_ouputs/ directory:

  • accuracy.tsv: Honestly, not super useful. Provides the percentage of sequences that were correctly classified at each taxonomic rank. For instance, if the percentage at Phylum was 99.8, that would mean that if the taxonomic labels were truncated to the rank of phylum, 99.8% of these labels would be correct.
  • clade_exclusion_performance.tsv: Summarises the number of sequences that were correctly assigned within a taxonomic distance, as well as the sequences that were over- (towards species) and under-classified (towards Root). The taxonomic distance controls the acceptable error. A distance of seven includes all query sequences where as a distance of two includes only the sequences that were classified within two ranks of the perfect taxonomic classification.
  • taxonomic_recall.tsv: Provides the recall for each taxon. Columns are 'Rank', 'Taxon' and 'Recall'.

References

  1. Brady, A., & Salzberg, S. L. (2009). Phymm and PhymmBL: Metagenomic phylogenetic classification with interpolated Markov models. Nature Methods, 6(9), 673–676. https://doi.org/10.1038/nmeth.1358
  2. Peabody, M. A., Van Rossum, T., Lo, R., & Brinkman, F. S. L. (2015). Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics, 16(1), 363. https://doi.org/10.1186/s12859-015-0788-5