Skip to content

Full pipeline script

Gavin Douglas edited this page Apr 2, 2021 · 15 revisions

The entire PICRUSt2 pipeline can be run using a single script, called picrust2_pipeline.py. This script will run each of the 4 key steps outlined on this wiki: (1) sequence placement, (2) hidden-state prediction of genomes, (3) metagenome prediction, (4) pathway-level predictions.

The option of this program are the same as for each individual scripts overall.

The standard pipeline will generate metagenome predictions for 16S rRNA gene data. The input files should be a FASTA of amplicon sequences variants (ASVs; i.e. your representative sequences, not your raw reads, which is study_seqs.fna below) and a BIOM table of the abundance of each ASV across each sample (study_seqs.biom below). Note that a tab-delimited table with ASV ids as the first column and sample abundances as all subsequent columns will also work.

The below command will run the full default pipeline on the two input files. EC number and KO metagenomes are predicted as well as MetaCyc pathway abundances and coverages predicted based on the predicted EC number abundances. The nearest-sequenced taxon index (NSTI) will be calculated for each input ASV and by default any ASVs with NSTI > 2 will be excluded from the output by default. Stratified output will only be calculated when the --stratified option is set, which can greatly increase run-time.

picrust2_pipeline.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline -p 1

All of the output files produced by the pipeline (including intermediate files, which can be useful for troubleshooting issues), will be output in picrust2_out_pipeline. Note that these are the default outputs, but if you specify different functional databases or custom reference databases (e.g. non-16S amplicon reference data) then the output will differ.

The key output files are:

  • EC_metagenome_out - Folder containing unstratified EC number metagenome predictions (pred_metagenome_unstrat.tsv.gz), sequence table normalized by predicted 16S copy number abundances (seqtab_norm.tsv.gz), and the per-sample NSTI values weighted by the abundance of each ASV (weighted_nsti.tsv.gz).
  • KO_metagenome_out - As EC_metagenome_out above, but for KO metagenomes.
  • pathways_out - Folder containing predicted pathway abundances and coverages per-sample, based on predicted EC number abundances.

Additional output files, which can be useful for advanced users are:

  • EC_predicted.tsv.gz - Predicted EC number copy numbers per ASV.
  • intermediate - Folder containing intermediate MinPath files and files used for sequence placement pipeline (including JPLACE file: intermediate/place_seqs/epa_out/epa_result.jplace).
  • KO_predicted.tsv.gz - As EC_predicted.tsv.gz above, but for KO predictions.
  • marker_nsti_predicted.tsv.gz - Predicted 16S copy numbers and NSTI per ASV.
  • out.tre - Tree of reference and study 16S sequences.

Options:

  • -s PATH - FASTA of unaligned study sequences

  • -i PATH - Input table of sequence abundances (BIOM, TSV, or mothur shared file format)

  • -o PATH - Output folder

  • -p INT: Number of processes to run in parallel.

  • -t epa-ng|sepp: Placement tool to use when placing sequences into reference tree. One of "epa-ng" or "sepp" must be input (default: epa-ng)

  • --ref_dir DIRECTORY: Argument specifying non-default reference files to use for sequence placement. There are four expected files in this directory (see below).

  • --in_traits IN_TRAITS - Comma-delimited list (with no spaces) of which gene families to predict from this set: COG, EC, KO, PFAM, TIGRFAM. Note that EC numbers will always be predicted unless --no_pathways is set (default: EC,KO).

  • --custom_trait_tables PATH - Optional path to custom trait tables with gene families as columns and genomes as rows (overrides --in_traits setting) to be used for hidden-state prediction. Multiple tables can be specified by delimiting filenames by commas. Importantly, the first custom table specified will be used for inferring pathway abundances. Typically this command would be used with a custom marker gene table (--marker_gene_table) as well

  • --marker_gene_table PATH - Path to marker gene copy number table (16S copy numbers by default).

  • --pathway_map MAP - MinPath mapfile. The default mapfile maps MetaCyc reactions to prokaryotic pathways (default: picrust2/default_files/pathway_mapfiles/metacyc_path2rxn_struc_filt_pro.txt).

  • --reaction_func STR - Functional database to use as reactions for inferring pathway abundances (default: "EC", which corresponds to the prokaryotic EC number table). This should be either the short-form of the database as specified in --in_traits, or the path to the file as would be specified for --custom_trait_tables. Note that when functions besides the default prokaryotic EC numbers are used typically the --no_regroup option would also be set.

  • --no_pathways - Flag to indicate that pathways should NOT be inferred (otherwise they will be inferred by default). Predicted E.C. number abundances are used to infer pathways when default reference files are used.

  • --regroup_map ID_MAP - Mapfile of ids to regroup gene families to before running MinPath. The default mapfile is for regrouping E. C. numbers to MetaCyc reactions (default: picrust2/default_files/pathway_mapfiles/ec_level4_to_metacyc_rxn.tsv).

  • --no_regroup - Do not regroup input gene families to reactions as specified in the regrouping mapfile. This option should only be used if you are using custom reference and/or mapping files.

  • --stratified - Flag to indicate that stratified tables should be generated at all steps (will increase run-time).

  • -a {hmmalign,papara} - Which program to use for aligning query sequences to reference MSA prior to EPA-NG step (default: hmmalign).

  • --max_nsti INT - Sequences with NSTI values above this value will be excluded (default: 2).

  • --min_reads INT - Minimum number of reads across all samples for each input ASV. ASVs below this cut-off will be counted as part of the "RARE" category in the stratified output (default: 1).

  • --min_samples INT - Minimum number of samples that an ASV needs to be identfied within. ASVs below this cut-off will be counted as part of the "RARE" category in the stratified output (default: 1).

  • -m {mp,emp_prob,pic,scp,subtree_average} - HSP method to use."mp": predict discrete traits using max parsimony. "emp_prob": predict discrete traits based on empirical state probabilities across tips. "subtree_average": predict continuous traits using subtree averaging. "pic": predict continuous traits with phylogentic independent contrast. "scp": reconstruct continuous traits using squared-change parsimony (default: mp).

  • -e float - Setting for maximum parisomony hidden-state prediction. Specifies weighting transition costs by the inverse length of edge lengths. If 0, then edge lengths do not influence predictions. Must be a non-negative real-valued number (default: 0.5).

  • --skip_nsti - Do not calculate nearest-sequenced taxon index (NSTI), which is added as a column in the predicted marker-gene copy-number table by default (marker_nsti_predicted.tsv.gz).

  • --no_gap_fill - Do not perform gap filling before predicting pathway abundances (Gap filling is on otherwise by default).

  • --coverage - Calculate pathway coverages as well as abundances, which are an alternative way to identify which pathway are present. Note that these values are experimental and only useful for advanced users. Coverage is also calculated using the same approach as HUMAnN2.

  • --skip_minpath -Do not run MinPath to identify which pathways are present as a first pass (MinPath is run by default).

  • --per_sequence_contrib - Option to specify that stratified abundances should be reported in terms of the contribution by each predicted genome rather than how much each genome is contributing to the overall community abundance. In other words, pathway abundances will be calculated for each individual predicted genome. Stratified coverages will only be reported when this option is used (and --coverage is set). As of v2.2.0-b, unstratified pathway abundances based on the community-wide pathway abundances and also based on the per-sequence pathway abundances will be output when this option is used.

  • --wide_table -Output wide-format stratified table of metagenome predictions when --strat_out is set. This is the deprecated method of generating stratified tables since it is extremely memory intensive. The stratified outfile is named pred_metagenome_strat.tsv.gz when this option is set (added in v2.2.0-b).

  • --skip_norm - Skip normalizing sequence abundances by predicted marker gene copy numbers (typically 16S rRNA genes). This step will be performed automatically unless this option is specified (added in v2.2.0-b).

  • --remove_intermediate - Remove the intermediate outfiles of the sequence placement and pathway inference steps.

  • --verbose - If specified, print out wrapped commands to screen.