Skip to content

Hidden state prediction

Gavin Douglas edited this page Apr 2, 2021 · 18 revisions

PICRUSt2 wraps the castor R package to run hidden-state prediction (hsp) to predict gene family abundances. See below for how the --chunk_size and -p options can be used to maximize memory and run-time efficiency.

Hidden-state prediction for 16S copy, E.C. numbers, and KO abundances per-genome can be run with these commands:

hsp.py -i 16S -t placed_seqs.tre -o marker_nsti_predicted.tsv.gz -p 1 -n

hsp.py -i EC -t placed_seqs.tre -o EC_predicted.tsv.gz -p 1

hsp.py -i KO -t placed_seqs.tre -o KO_predicted.tsv.gz -p 1

The input arguments and options are:

  • -t TREEFILE: Newick tree with study sequences placed amongst reference sequences.
  • -i TRAIT_OPTION: Which default pre-calculated count table to use (one of '16S', 'COG', 'EC', 'KO', 'PFAM', 'TIGRFAM', 'PHENO')
  • -o OUTPUT: Named of output file containing predicted counts.
  • --observed_trait_table TRAIT_COUNTFILE: Trait file to use if a non-default file is needed (most users should use one of the default trait options above).
  • -m METHOD: Hidden-state prediction method to use, which needs to be one of: maximum parsimony (mp), empirical probabilities (emp_prob), subtree averaging (subtree_average), phylogenetic independent contrast (pic), or squared-change parsimony (scp).
  • -p INT: Number of processes to run in parallel.
  • -e float - Setting for maximum parisomony hidden-state prediction. Specifies weighting transition costs by the inverse length of edge lengths. If 0, then edge lengths do not influence predictions. Must be a non-negative real-valued number (default: 0.5).
  • --chunk_size INT: Number of gene families to read in for a given processor. Note that increasing this value will not always speed up execution! The trait table is split into subsets of size chunk_size before running hidden state prediction to reduce memory usage. Each subset can be run through on a different processor, but the number of simultaneous processors is equal to the number of data subsets! For example, if there are 1500 total gene families and the chunk_size is 500, the maximum number of processes that can be run is parallel is 3 (which will be the case even if you set something higher!).
  • -n: Indicates that Nearest-sequenced taxon index (NSTI) values should be calculated. This metric can be used to identify study sequences that are highly distant from all reference sequences (the predictions for these sequences are less reliable!). By default the maximum NSTI cut-off for subsequent commands is 2. Any study sequences with a NSTI value higher than 2 are typically either from uncharacterized phyla or off-target sequences. Users can visualize the NSTI distribution for their placed reads to help determine the best cut-off for their dataset for the next steps.