Skip to content

PICRUSt2 Streamlined Tutorial (v2.1.4 beta)

Gavin Douglas edited this page Jun 11, 2021 · 6 revisions

This is the streamlined version of the PICRUSt2 tutorial, which focuses on running commands. You can find the full tutorial here. You can see descriptions of the arguments and options used by each script by typing the script name followed by --help. All past and present official PICRUSt2 tutorials are listed here. Make sure you are looking at the correct version of the tutorial.

Minimum requirements to run full tutorial

PICRUSt2 is only available for Mac OS X and Linux systems (not for Windows unless you are running a Linux emulator). Before running the installation commands below you will need to install Miniconda.

To run this tutorial you will need at least 16 GB of RAM, however you can still run through most of this tutorial with less RAM. The key memory-intensive step is the read placement, which you can avoid by simply downloading the expected output file and proceeding with the tutorial. When running your data you can run the QIIME2 plugin of PICRUSt2 and use SEPP for read placement. Using this approach is much less memory intensive, but currently there are fewer options available for the QIIME2 plugin.

Install with bioconda

The easiest way to install PICRUSt2 is with bioconda, which will install a new environment which contains all the software needed to run PICRUSt2.

conda create -n picrust2 -c bioconda -c conda-forge picrust2=2.1.4_b

When using conda you need to specifically activate environments to use them:

conda activate picrust2

Download tutorial data

First, download, unzip, and enter the folder of tutorial files:

wget http://kronos.pharmacology.dal.ca/public_files/picrust/picrust2_tutorial_files/chemerin_16S.zip
unzip chemerin_16S.zip
cd chemerin_16S

You should see the three files below when you type ls -1:

metadata.tsv
seqs.fna
table.biom

These files are processed datafiles rather than raw reads. We can take a look at the input FASTA file with less (note that with less you can go back to the terminal by typing q).

less seqs.fna

You should see:

>86681c8e9c64b6683071cf185f9e3419
CGGGCTCAAACGGGGAGTGACTCAGGCAGAGACGCCTGTTTCCCACGGGACACTCCCCGAGGTGCTGCATGGTTGTCGTC
AGCTCGTGCCGTGAGGTGTCGGCTTAAGTGCCATAACGAGCGCAACCCCCGCGTGCAGTTGCTAACAGATAACGCTGAGG
ACTCTGCACGGACTGCCGGCGCAAGCCGCGAGGAAGGCGGGGATGACGTCAAATCAGCACGGCCCTTACGTCCGGGGCGA
CACACGTGTTACAATGGGTGGTACAGCGGGAAGCCAGGCGGCGACGCCGAGCGGAACCCGAAATCCACTCTCAGTTCGGA
TCGGAGTCTGCAACCCGACTCCGTGAAGCTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCG
>b7b02ffee9c9862f822c7fa87f055090
AGGTCTTGACATCCAGTGCAAACCTAAGAGATTAGGTGTTCCCTTCGGGGACGCTGAGACAGGTGGTGCATGGCTGTCGT
CAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTGTCATTAGTTGCCATCATTAAGTTGGGCA
CTCTAATGAGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGCCCCTTATGACCTGGGCTAC
ACACGTGCTACAATGGACGGTACAACGAGAAGCGAACCTGCGAAGGCAAGCGGATCTCTTAAAGCCGTTCTCAGTTCGGA
CTGTAGGCTGCAACTCGCCTACACGAAGCTGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGG
...

This file contains the Amplicon Sequence Variants (ASVs) of 16S rRNA gene sequences found across the mouse stool samples. The id of each ASV is written on the lines starting with ">", so the id of the first sequence is "86681c8e9c64b6683071cf185f9e3419".

These are the ids produced by the deblur denoising pipeline so the ids of your dataset may look very different depending on what pipeline you ran. For instance, this file could alternatively contain Operational Taxonomic Unit (OTU) sequences; PICRUSt2 is agnostic to which processing pipeline is used for 16S rRNA gene data.

The next file is a bit tricky to view since it is a BIOM file which is binary encoded (not easily viewed with a plain-text viewer). We can get a sense of what the first few columns and rows look like by using the following command:

biom head -i table.biom

You should see:

# Constructed from biom file
#OTU ID	100CHE6KO	101CHE6WT	102CHE6WT	103CHE6KO	104CHE6KO
86681c8e9c64b6683071cf185f9e3419	716.0	699.0	408.0	580.0	958.0
b7b02ffee9c9862f822c7fa87f055090	51.0	98.0	96.0	91.0	223.0
663baa27cddeac43b08f428e96650bf5	112.0	29.0	45.0	35.0	75.0
abd40160e03332710641ee74b1b42816	0.0	0.0	0.0	0.0	0.0
bac85f511e578b92b4a16264719bdb2d	56.0	4.0	63.0	131.0	90.0

Notice, that the first column ("#OTU ID") contains the ASV ids in the FASTA file above. The first column of this table needs to match the ids in the FASTA file. The additional columns represent different samples with the counts representing the number of reads within each of those samples. Importantly, the input table should contain read counts rather than relative abundance. You can input a rarefied table, but that is not required.

We can get more information about the rest of the table using this command:

biom summarize-table -i table.biom

Data pre-processing

The input tutorial datafiles have already been pre-processed, but there are a few things you should keep in mind for your own data:

  • Depending on the pipeline you used to process your raw 16S data there may be many extremely rare ASVs found across your samples. Such rare ASVs typically only add noise to analyses (especially singletons - ASVs found only by 1 read in 1 sample) and should be removed. Reducing the number of input ASVs will also make PICRUSt2 faster and less memory intensive.

  • Similarly, it is important to check for any low-depth samples that should be removed. These can be identified based on the output of biom summarize-table above.

  • The best minimum cut-offs for excluding ASVs and samples varies by dataset since these cut-offs will differ depending on the overall read depth of your dataset.

  • There are many ways to filter a BIOM table, but one easy approach is described in a QIIME2 tutorial.

PICRUSt2 pipeline

Now that we have and understand the data we can run PICRUSt2!

The easiest way to run PICRUSt2 is with the picrust2_pipeline.py script, which you can see here. This script automatically will run all of the steps that are described below. However, for the purposes of the tutorial we will run each step individually so that it is easier to follow.

First make and enter a new folder for writing the PICRUSt2 output files.

mkdir picrust2_out_pipeline
cd picrust2_out_pipeline

When running the below commands you can swap in the number of available cores for the 1 in -p 1, which will allow many steps to be run in parallel and speed up the pipeline.

Place reads into reference tree

The first step of PICRUSt2 is to insert your study ASVs into a reference tree (details).

You can run this step with this command (1 minute and 13.8 GB of RAM on 1 processor):

place_seqs.py -s ../seqs.fna -o out.tre -p 1 \
              --intermediate intermediate/place_seqs

The required input is the FASTA of the ASV sequences. The key output file is out.tre, which is a tree in newick format of your study ASVs and reference 16S sequences. If you do not have enough RAM to run this command locally you can simply download the expected out.tre output file here.

Hidden-state prediction of gene families

Run hsp.py, which will take 4 min and 1 GB of RAM on 1 processor with this dataset.

hsp.py -i 16S -t out.tre -o marker_predicted_and_nsti.tsv.gz -p 1 -n

hsp.py -i EC -t out.tre -o EC_predicted.tsv.gz -p 1

The output files of these commands are marker_predicted_and_nsti.tsv.gz (download) and EC_predicted.tsv.gz (download).

Note that these files are gzip compressed to save space. You can take a look at them using zless -S (zless is like less, but for gzipped files and the -S option turns off line wrapping so it is easier to see):

Generate metagenome predictions

The metagenome_pipeline.py script will output tables of gene family predictions per sample (details), which should run in ~5 seconds:

metagenome_pipeline.py -i ../table.biom -m marker_predicted_and_nsti.tsv.gz -f EC_predicted.tsv.gz \
                       -o EC_metagenome_out --strat_out --metagenome_contrib

In the above command --metagenome_contrib indicates that the long-form table of how each ASV is contributing EC numbers in each sample is output. The --strat_out option indicates that the stratified output file, which is a wide table with many 0s indicating how ASVs contribute ECs, will also be output. These tables are two different ways of seeing the breakdown of how ASVs contribute gene families. The full stratified output table can use a lot of memory and disk space as the number of ASVs and samples increases so for most purposes you will likely just want to specify --metagenome_contrib. There are several output files created within the EC_metagenome_out directory:

  • EC_metagenome_out/pred_metagenome_unstrat.tsv.gz (download) - overall EC number abundances per sample.
  • EC_metagenome_out/pred_metagenome_strat.tsv.gz (download) - EC number abundances per sample stratified by the contributing ASVs.
  • EC_metagenome_out/metagenome_contrib.tsv.gz (download) - A table breaking down how the ASVs contribute to gene family abundances in each sample. Note that this table is a more memory efficient and is an alternative format to the stratified table above.
  • EC_metagenome_out/seqtab_norm.tsv.gz (download) - the ASV abundance table normalized by predicted 16S copy number.
  • EC_metagenome_out/weighted_nsti.tsv.gz (download) - the mean NSTI value per sample (when taking into account the relative abundance of the ASVs). This file can be useful for identifying outlier samples in your dataset. In PICRUSt1 weighted NSTI values < 0.06 and > 0.15 were suggested as good and high, respectively. The cut-offs can be useful for getting a ball-park of how your samples compare to other datasets, but a weighted NSTI score > 0.15 does not necessarily mean that the predictions are meaningless.

Pathway-level inference

The last major step of the PICRUSt2 pipeline is to infer pathway-level abundances with pathway_pipeline.py (details), which you can run with this command (< 2 min):

pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_strat.tsv.gz \
                    -o pathways_out -p 1

The output of this script are in the pathways_out folder. These include both unstratified (download) and stratified (download) MetaCyc pathway abundances, similar to the EC number tables above. Note that this stratified output was output since a stratified gene family table was input. This default stratified pathway abundance table represents how much each ASV is contributing to the community-wide pathway abundance and not what the pathway abundance is predicted to be within the predicted genome of that ASV alone. For that output you would need to use the --per_sequence_contrib option, which is more computationally intensive to run since pathways need to be inferred for each individual ASV. A further discussion of the difference between these stratified pathway outputs is found here.

Add functional descriptions

Finally, it can be useful to have a description of each functional id in the output abundance tables. The below commands will add these descriptions as new column in gene family and pathway abundance tables (details):

add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                    -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
                    -o pathways_out/path_abun_unstrat_descrip.tsv.gz

Optionally visualize the PICRUSt2 output in STAMP

As mentioned above, there are many possible ways to analyze the PICRUSt2 output. STAMP is one tool that can be used that requires no background in scripting languages. This tool is very useful for visualizing microbiome data, but note that there are better alternatives for conducting statistical tests with compositional data (such as the ALDEx2 R package).

1: Install STAMP (this has to be installed on a local machine, not a server since it has a graphical interface)

  • Windows: Download and install.
  • Mac or Linux:
conda deactivate
conda install -c bioconda stamp

Then simply type the following to start:

stamp

2: If you ran PICRUSt2 on a remote server you will need to download and uncompress path_abun_unstrat_descrip.tsv.gz and metadata.tsv to your own computer. You can use WinSCP (for Windows) or CyberDuck (for Mac).

3: Start up STAMP and load in the path_abun_unstrat_descrip.tsv as the Profile File, and the metadata.tsv file as the Group metadata file. You can now explore the data overall with heatmaps and ordination tools like PCA or alternatively compare boxplots of the relative abundance of individual predicted pathways.

Clone this wiki locally