Skip to content

PICRUSt2 Tutorial (San Juan, April 2019)

Gavin Douglas edited this page May 15, 2019 · 2 revisions

This tutorial will guide you through using PICRUSt2 on a practice dataset. This is an old workflow of PICRUSt2 meant for v2.1.2-b. You can see all the official PICRUSt2 tutorials here.

You will need to install PICRUSt2 first, before proceeding.

Ensure that the PICRUSt2 environment is active:

conda activate picrust2

Now copy the tutorial files that we will be working with to your directory:

wget http://kronos.pharmacology.dal.ca/public_files/tutorial_datasets/picrust2_tutorial_files/seqs.fna
wget http://kronos.pharmacology.dal.ca/public_files/tutorial_datasets/picrust2_tutorial_files/table.biom
wget http://kronos.pharmacology.dal.ca/public_files/tutorial_datasets/picrust2_tutorial_files/metadata.tsv

Lets see those files listed now:

ls

You will see three files listed:

  • metadata.tsv
  • seqs.fna
  • table.biom

Lets look at each of these. First lets browse the metadata file using the simple text viewer less:

less metadata.tsv

With less you can go back to the terminal by typing q. This file contains the names of each of your samples within the first column and then other metadata about those samples in the other columns. Since these samples were collected from different mammals that is the metadata shown for each sample. How many samples are there?

Ok, lets quickly look at the sequences. Note you can browse each by using up and down arrow keys or by typing the space bar. Take note of the odd sequence ids after the ">" symbol. (Fun fact these are md5sums of the actual sequences!).

less seqs.fna

The next file is a bit tricky to view since it is a BIOM file which is binary encoded (not easily viewed with a plain text viewer). We can get a sense of what the first few columns and rows look like by using the following command:

biom head -i table.biom

Notice, that the first column contains those same funky sequence ids. The additional columns represent different samples with the counts representing the number of reads within each of those samples.

We can get more information about the rest of the table using this command:

biom summarize-table -i table.biom

Which sample has the highest sequencing depth?

Ok, lets actually start using PICRUSt!

STEP 1: Place amplicon sequence variants (or OTUs) into reference phylogeny (details)

place_seqs.py -s seqs.fna -o placed_seqs.tre -p 2 --print_cmds

The input into this script is the seqs.fna file and the main output is placed_seqs.tre which is simply PICRUSt's reference tree with inserted reads from the input file. There are extra options (that we could have left out): -p 2 indicates how many processors to use, and --print_cmds indicates that it should show us the commands that it is running in the background. You can checkout the details link above for more information about this command.

Step 2: Run hidden-state prediction to get 16S copy numbers, and E.C. number abundances per predicted genome (details).

Note that NSTI values will be added to the 16S prediction table (since the -n option was given).

hsp.py -i 16S -t placed_seqs.tre -o 16S_predicted.tsv -p 2 -n

hsp.py -i EC -t placed_seqs.tre -o EC_predicted.tsv -p 2 -m pic

Note in this case the pic the phylogenetic independent contrast hidden-state prediction method is indicated since it is fastest and cuts our waiting time in half. However, we recommend that in practice users use the mp method.

While waiting, you can check out the details link above for this command.

Step 3: Predict E.C. in sequencing samples and also adjust gene family abundances by 16S sequence abundance (details)

metagenome_pipeline.py -i table.biom -m 16S_predicted.tsv -f EC_predicted.tsv -o EC_metagenome_out --strat_out

This command takes in our abundance BIOM table, corrects the abundances by the predicted 16S table predictions, and then outputs the predictions for all of the ECs for each sample. The --strat_out option indicates that we also want the stratified output file (taxa contributing to each function). All output files are within the EC_metagenome_out directory.

Take a look at the EC predictions with less (the -S option just turns off line wrapping so it looks better):

less -S EC_metagenome_out/pred_metagenome_unstrat.tsv

Now look at the stratified output (notice the extra second column indicating which sequence contributes to each EC):

less -S EC_metagenome_out/pred_metagenome_strat.tsv

Step 4: Infer MetaCyc pathway abundances and coverages based on predicted E.C. number abundances (details)

pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_strat.tsv -o pathways_out -p 2

This obtains MetaCyc pathways from the EC predictions using MinPath. Lots of information on the details link.

Step 5: Add descriptions as new column in gene family and pathway abundance tables (details)

add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv -m EC -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv

add_descriptions.py -i pathways_out/path_abun_unstrat.tsv -m METACYC -o pathways_out/path_abun_unstrat_descrip.tsv

This adds function descriptions for EC and pathways directly into the table for easier investigation.

That is it! You have completed the PICRUSt tutorial!

Optional: Try looking at your PICRUSt output in STAMP.

1: Install STAMP (this has to be installed on a local machine, not a server since it has a graphical interface)

  • Windows: Download and install.
  • Mac: Install MiniConda first. Then from command line: conda deactivate conda install -c bioconda stamp

Then simply type the following to start:

stamp

2: If you ran PICRUSt on a remote server you will need to download path_abun_unstrat_descrip.tsv and metadata.tsv to your own computer. You can use WinSCP (for Windows) or CyberDuck (for Mac).

3: Start up STAMP and load in the path_abun_unstrat_descrip.tsv as the Profile File, and the metadata.tsv file as the Group metadata file.

Clone this wiki locally