Analysis run through

Rutger Vos edited this page Jan 24, 2017 · 40 revisions

A typical SUPERSMART run consists at least of the steps described on this page, in the order in which they are given here. The examples, such as for the primates and the piperaceae, follow these same steps:

  1. smrt taxize
  2. ... align
  3. ... orthologize
  4. ... bbmerge
  5. ... bbinfer
  6. ... bbreroot
  7. ... bbcalibrate
  8. ... consense
  9. ... bbdecompose
  10. ... clademerge
  11. ... cladeinfer
  12. ... cladegraft

The steps in a typical run are thus the execution of subcommands of the smrt program. For example, the first step is the smrt taxize subcommand. As the pipeline runs inside a UNIX-like environment, the subcommands can be chained together using a variety of different approaches. In the examples we provide simple shell scripts (though you can make these as advanced as you want), but many other approaches are possible, for example by calling these commands from within a programming language such as Python or R, from within a Makefile (nice to keep track of the dependencies between steps), etc. For a first run-through, though, it makes sense to compose the commands one by one in a terminal window.

To find out how to use the subcommands when working in the terminal they provide a usage help message that is displayed when issuing either smrt help <subcommand>, or smrt <subcommand> --help. In the descriptions below, in addition to all required arguments all optional arguments are also described. For the most part these are names for input and output files. Unless you have good reasons it is probably best to go with these defaults because other subcommands downstream also assume them so you'd have to change their invocation as well.

The behavior of some commands is also influenced by certain variables (in ALL CAPS), which are described where they apply for the relevant subcommands below. These variables can be specified either for your entire SUPERSMART installation, or just for a specific run. To update a variable for your entire installation, run the smrt-config command, which opens an editor to update the configuration file supersmart.ini. It is a good idea to peruse this file to get an idea of what configuration options are available. To update or overwrite a variable for a specific run, you can set it as an environment variable with the SUPERSMART_ prefix. For example, to change the number of parallel nodes that a certain command can use to 2 (by default this is NODES=4, but you are well advised to run SUPERSMART on an architecture that provides for more cores/nodes), do this:

export SUPERSMART_NODES=2

Lastly, when running, each subcommand will issue a large stream of logging messages, which are shown in different colors in the terminal window. Green and blue messages (informational and debugging, respectively) simply inform of the progress of the analysis. Yellow warning messages indicate recoverable issues, though make sure that you understand what the messages mean. Red error messages identify serious, non-recoverable issues, which will only occur if something is seriously wrong. The output of various programs that are invoked by the pipeline is shown in black.

smrt taxize

This step takes either a text file with a list of species names (example), or a higher taxon name provided as a command line argument, and performs taxonomic name resolution and expansion on it. The result is a tab separated spreadsheet where each row represents a species that has been resolved, and the columns give the higher taxa (genus, family, etc.) for this species (example). To run taxize on a file, do as follows:

smrt taxize -i <filename>

where <filename> is the name of a text file that has on each line a species name to resolve. To run taxize on a higher taxon, do as follows:

smrt taxize -r <taxon>

where <taxon> is a higher taxon name, such as Primates.

Optional arguments

-o <outfile>

Specifies a different name for the output spreadsheet. The default is species.tsv

-e <rank>

Specifies to which taxonomic rank to expand the higher taxon. Typically the default, i.e. species, works best. In any case it has to be a taxonomic rank that is actually used by NCBI to annotate individual sequences with, so this could be below species level (e.g. a subspecies) but not above it.

#####-b Removes non-binomial taxon names in the output file. Can be used to filter out unclassified taxa as e.g. Canis sp.

-a <all ranks>

Also includes taxa of ranks higher than species to the output spreadsheet.

smrt align

This step selects and aligns all phylota clusters that are relevant for the taxa in the species spreadsheet. During this step, many FASTA files are produced and placed in a directory called alignments. The FASTA files are named after the phylota cluster they are based on, so the name consists of <seed gi>-<MRCA taxon ID>-<cluster ID>-<cluster type>.fa.

Optional arguments

-i <infile>

Specifies a different name for the species spreadsheet. Default is species.tsv

-o <outdir>

Specifies a different name for the output directory containing the alignments. Default is alignments/

-z <zip>

Specifies whether the output directory should be compressed in a zip archive.

smrt orthologize

The alignments from the previous step consist of sets of sequences that have been clustered, by the PhyLoTA pipeline, by all-versus-all BLAST searches of the sequences that belong to an algorithmically selected higher taxonomic level (e.g. a genus or a family). PhyLoTA cannot cluster all of GenBank simply because the number of pairwise comparisons would be too large. It is of course possible that the same marker has been sequenced more broadly across such clusters (e.g. among multiple genera or families). This step therefore assesses which alignments from the previous step are orthologous with one another, and merges them. The result consists of many FASTA files, each named clusterXXX.fa (where XXX is a unique integer), and an index file that lists all the created FASTA files, called merged.txt.

Merging of orthologous clusters is influenced by the following variable:

  • BACKBONE_MAX_DISTANCE defines the threshold for the average pairwise uncorrected distance of the candidate alignment. If a candidate is more divergent than that it is omitted. Reasonable values are probably in the range of 0.05 to 0.1.

Optional arguments

-i <dirname>

To specify the location of the alignments directory from the previous step. Default is alignments/

-o <outdir>

Specifies a different name for the output directory containing the alignments. Default is clusters/

-z <zip>

Specifies whether the output directory should be compressed in a zip archive.

smrt bbmerge

This step merges the orthologous clusters into a supermatrix in PHYLIP format. Candidate clusters to merge into the supermatrix are filtered according to the following variable:

  • BACKBONE_MIN_COVERAGE defines the minimum number of alignments that every species in the supermatrix must participate in. If this value is too high, too many alignments will be concatenated and the supermatrix will become too big to handle in the next steps. Conversely, if this is too small, there will be too little transitive overlap between taxa. Reasonable values are in the range of 3 to 5.

These values can be specified in two ways:

  1. by editing the configuration file using the smrt-config command. This means that the newly updated value will be used by all subsequent runs.
  2. by setting environment variables, prefixed with SUPERSMART_ (example: SUPERSMART_BACKBONE_MAX_DISTANCE).

Optional arguments

-i <dirname>

Specifies the directory or zip file containing merged alignments. Default is clusters/

-t <filename>

Specifies the species spreadsheet. Default is species.tsv

-o <outfile>

Specifies the supermatrix output file name. Default is supermatrix.phy

-i <taxon1,taxon2,...>

Specifies names of taxa present in the species spreadsheet that must be included in the supermatrix, regardless their marker coverage. By default, exemplar taxa are selected algorithmically.

-m <filename>

Specifies the name for an output spreadsheet summarizing the selected exemplar taxa and markers. Default is markers-backbone.tsv

-e <number>

Specifies the number of exemplar species per genus. Defaults to 2. Set to -1 to include all species in genera.

smrt bbinfer

Constructs backbone phylogenies using the supermatrix from the previous step. Three inference tools are currently available: ExaML, ExaBayes and RAxML. For the maximum-likelihood tools ExaML and RAxML, a bootstrap analysis can be specified. The following variables are important:

  • For ExaML and RAxML, the substitution model can be specified using the variables EXAML_MODEL or RAXML_MODEL, respectively. These can be specified using smrt-config or by setting environment variables with the prefix SUPERSMART_ (e.g. SUPERSMART_EXAML_MODEL).
  • For ExaBayes, the number of runs (EXABAYES_NUMRUNS), the number of chains (EXABAYES_NUMCHAINS) and the number of generations (EXABAYES_NUMGENS) can be specified.

We have found that the defaults for these variables work reasonably well in most cases.

Optional arguments

-s <infile>

Specifies the name of the supermatrix. Default is supermatrix.phy

-i <inference tool>

Specifies the name of the inference tool to use. Can be one of examl, raxml or exabayes

-t <taxafile>

Specifies the name of the taxa file which will be used to generate a starting tree if the inference tool is ExaML.

-k <starting tree>

Specifies a starting tree if the inference tool is ExaML.

-m <random start>

Lets the ML search start from a random tree if the inference tool is ExaML.

-b <bootstrap replicates>

Using an ML tool, specifies the number of bootstrap replicates.

-o <outfile>

Specifies the name of the output tree file. Default is backbone.dnd

-x

Specifies that all intermediate files are to be cleaned up.

smrt bbreroot

The tree(s) from the previous step are unrooted. As we are constructing time-calibrated species trees the trees must be rooted. This step does this using one of two possible approaches:

  1. Using an outgroup. This outgroup can be a single species, a comma-separated list of species, or of higher taxa. For example, in the phylogeny of the Primates, the Strepsirrhini (lemurs, bush babies, and such) are the outgroup of the monkeys, apes and tarsiers. Therefore, this higher taxon can be used to root the tree. This approach of course requires some prior knowledge about what would be a good outgroup.
  2. Using taxonomy. With this approach, the species spreadsheet is used to identify the rooting that minimizes the number of non-monophyletic higher taxa.

Using either approach, the output is a Newick tree file with the rerooted tree(s) called backbone-rerooted.dnd.

Optional arguments

-t <filename>

Specifies the species spreadsheet. Default is species.tsv

-b <filename>

Specifies the backbone tree file. Default is backbone.dnd

-g <taxon1,taxon2,...>

Specifies outgroup taxa.

-o <outfile>

Specifies the output file. Default is backbone-rerooted.dnd

-s

Smooths the placement of the root along the basal branch. The approach is to select a placement that approximates the midpoint, such that the average root-to-tip path length is approximately equal for the taxa on both sides of the root.

smrt bbcalibrate

Time-calibrates the rooted backbone tree(s) using treePL. Calibration points for this step are specified using a tab-separated spreadsheet such as this. You can make such a file for example in Excel or in a text editor, but make sure that:

  1. The separators really are tab characters, not something else.
  2. The line breaks are UNIX line breaks, not Windows or MacOS.

The following columns are required, but may be used in any order:

  • NFos - a unique integer for every fossil
  • FossilName - a human readable name for the fossil
  • CrownvsStem - whether the fossil is a stem or a crown fossil
  • CalibratedTaxon - which higher taxon node the fossil applies to
  • MinAge - minimum age
  • MaxAge - maximum age
  • BestPracticeScore - currently not used, but refers to the fossil vetting ranking used by FossilCalibrations.org

The location of your fossil spreadsheet must be provided on the command line, so at a minimum the command must be invoked like this:

smrt bbcalibrate -f <fossils>

where <fossils> is the name of the spreadsheet file.

Note: Once the FossilCalibrations.org project makes available either a web service API or a data dump, this step will change accordingly, presumably to a version that provides (semi-)automated fossil harvesting functionality.

Optional arguments

The following arguments may optionally be provided, in addition to the mandatory -f <fossils> argument.

-t <treefile>

Specifies the rerooted backbone tree file. Default is backbone-rerooted.dnd

-s <supermatrix>

Specifies the supermatrix file, which is needed to compute the number of sites for the penalized likelihood algorithm. Default is supermatrix.phy

-o <outfile>

Specifies the output tree file. Default is chronogram.dnd

smrt consense

This step calculates a consensus tree over a set of Newick trees. This is typically done using the output from the previous bbcalibrate command, but it can equally be used with the output from the preceding commands that produce sets of Newick trees, i.e. bbreroot and bbinfer. The output of this command is a NEXUS file that uses the additional syntax for node annotations that is generated by TreeAnnotator and understood by the highly recommended tree viewer FigTree.

Optional arguments

-i <infile>

Specifies the input tree file. Default is chronogram.dnd

-o <outfile>

Specifies the output tree file (NEXUS format). Default is consensus.nex.

-b <burnin>

Specifies the fraction of burnin to omit. This option necessarily only applies to sets of trees that were generated in Bayesian analyses, by default this value is 0.0, as that would apply to sets of bootstrapped trees.

-e <heights>

Specifies how to summarize node heights, compatible with syntax for TreeAnnotator, i.e. one of keep, median, mean or ca. Default is ca.

-l <limit>

Support threshold below which a node is collapsed in the consensus. Default is 0.0, i.e. all bipartitions are included in the consensus irrespective of their support.

-s <fossils>

Specifies the fossil table file. If provided, the FossilNames are attached to their respective nodes, where possible.

-p

Flag to indicate that the support values are posterior probabilities (and should be embedded in the node annotations as such). If not provided, the support values are assumed to be bootstrap values.

smrt bbdecompose

Decomposes the backbone topology into monophyletic clades to be refined in clade-level analyses. For each of these clades, a directory named cladeXXX (where XXX is a unique integer) is created, in which candidate alignments are written. In addition, a tab-separated spreadsheet file - called markers-clades.tsv - is created that summarizes the selected markers. The selection process for these alignments is influenced by two variables:

  1. CLADE_MAX_DISTANCE specifies a threshold for the average pairwise uncorrected distance below which alignments are still selected (above this value they are assumed to be too saturated - or possibly misaligned - to be useful). Reasonable values appear to be around 0.1, higher values would select more alignments.
  2. CLADE_MIN_DENSITY specifies the minimum fraction of the total species diversity of the clade above which alignments are selected. For example, a value of 0.5 selects all alignments that have sequence data for at least half of the taxa in the clade. Higher values would result in more complete clades, but select fewer alignments.

These variables can be set using smrt-config or as environment variables with the SUPERSMART_ prefix.

Optional arguments

-b <treefile>

Specifies the backbone tree file to decompose. Default is consensus.nex

-g

Attempt to automatically add an outgroup from a sister genus for each clade, if there is sufficient marker overlap between clades.

-a <index file>

Specifies the index file with all candidate alignments. Default is aligned.txt.

-o <outfile>

Specifies the name for the output tab-separated spreadsheet that summarizes the selected markers. Default is markers-clades.tsv

smrt clademerge

Clade-level analyses are performed using *BEAST. To this end, the candidate alignments from the previous step must be merged in such a way that it is clear to which species each selected sequence belongs (as there are multiple alignments, and multiple sequences per species in each alignment). As an intermediate format, this merged data is written as richly-annotated NeXML. The maximum number of alignments that is selected is specified with the CLADE_MAX_MARKERS variable: all alignments are sorted by decreasing taxon coverage, and then the first CLADE_MAX_MARKERS are included in the merged data set.

Optional arguments

-e

Specifies that the candidate alignments must be enriched with additional haplotypes. The number of haplotypes that is added is specified with the CLADE_MAX_HAPLOTYPES variable. Assuming you intend to do a true multi-species, multi-locus coalescent, you would always provide this argument as otherwise only a single sequence would be selected for each species, resulting in a glorified gene tree.

smrt cladeinfer

Runs a *BEAST analysis for every clade identified in the preceding steps. The input files for *BEAST are generated by inserting values in a template file located in $SUPERSMART_HOME/data/beast/starbeast.xml. This template consists for the most part of BEAST XML statements, with some additional template syntax (shown inside [% square brackets %]) for inserting the data. If you are familiar with editing BEAST XML files you should be able to make sense out of the syntax. The defaults that are in the template appear to be reasonable for many cases, but you are encouraged to modify the template as you see fit (please tell us about generally applicable improvements that can be made).

The cladeinfer step takes the NeXML files generated in the previous step (named cladeXXX/cladeXXX.xml), reads in the data and inserts it in the template file to produce a file in *BEAST syntax (named cladeXXX/cladeXXX-beast-in.xml). Once it starts running the analysis on the *BEAST input file, beast will start logging parameters to the file cladeXXX/cladeXXX.log. It is strongly recommended that you track whether the parameters are converging to "hairy caterpillars" and you achieve sufficient ESS by using the Tracer program. If for some clade it looks like an additional run would improve ESS you can append to your existing results by running:

smrt cladeinfer -n <generations> -a -x -b <burnin> -f <cladeXXX/cladeXXX.xml>

Optional arguments

-n <ngens>

Number of generations. The default, 100_000 is intended for testing and is far too low. Reasonable values are at least above 10_000_000

-s <sfreq>

Sampling frequency. Default is 1000.

-l <lfreq>

Logging frequency. Default is 1000.

-f <infile>

If provided, only the specified input NeXML file is analyzed, not all clades. This is useful for example if you want to rerun / expand the analysis for one clade, but not others.

-x

Specifies that previously generated *BEAST input files need to be rebuild. If this is not set, any already existing *BEAST files are re-used. When appending new results to previous results, this flag always must be set, as the input files must be rebuilt to specify new log and tree file names that don't overwrite existing ones (these new names are generated automatically).

-a

Specifies that results from the current analysis are to be appended to previously existing results. This is useful if you have not quite accomplished high enough ESS for all parameters and so you want to extend the chain.

-b <fraction>

Specifies the fraction of burnin to omit from newly generated results before appending them to existing results.

smrt cladegraft

This, final, step grafts the clade-level analysis results onto the backbone topology. This results in a single, time-calibrated, ultrametric tree in NEXUS format with additional annotations as are understood by, and viewable in, FigTree.

Optional arguments

-b <infile>

Specifies the name of the backbone tree file on which to graft the clades. Default is consensus.nex

-o <outfile>

Specifies the output tree file name. Default is final.nex

-c <clade tree>

Specifies the clade tree file if only grafting a single clade on the backbone.

-e <heights>

Specifies how to summarize the node heights in the clade trees. Values can be one of keep, ca, median or mean. Default is ca.