Skip to content

Frequently Asked Questions

Gavin Douglas edited this page Mar 28, 2023 · 88 revisions

Questions

General

Errors and Warnings

Pathway-level inference

QIIME 2

Answers

General

How can I get a breakdown of which ASVs are contributing to specific functions?

In PICRUSt1 this was done with the metagenome_contributions.py script. In PICRUSt2 you will want to generate the stratified output to see these contributions. This will create a table with columns for both functions (e.g. KOs, ECs, pathways) and sequences (i.e. ASVs) along with the abundance within each sample. This stratified output will be produced when the --stratified option is set with picrust2_pipeline.py.

A different approach would be to look at the predicted genomes of ASVs generated by the hidden-state prediction step (which represents the gene family abundances per ASV rather than per sample).

How can I access the reference files?

The default files will be downloaded no matter how you install PICRUSt2. However, you can also access these files by downloading the GitHub repository and then moving to picrust2/default_files to see the various reference file folders.

How can I restrict a FASTA file to be the first field per line?

Sometimes you can run into issues if your FASTA file has multiple fields per header-line. In other words if your FASTA file looks like this:

>seq1 X 18 Bacteria
AGAGTGAACCA
>seq2 X 10 Proteobacteria
AGGGACAGAAT

The input FASTA file should look like this instead:

>seq1
AGAGTGAACCA
>seq2
AGGGACAGAAT

You can use awk to parse FASTA files this way:

awk '{print $1 }' FASTA_IN > FASTA_OUT

Where FASTA_IN and FASTA_OUT are the names of the input and output FASTA files.

Should I rarefy or transform the predictions after running PICRUSt2?

The output predictions are very similar to the input format of read counts in that they are compositional data. This means that you definitely need to control for the compositionally somehow and you cannot just perform standard tests (like a t-test) on the raw abundances output.

You could use a tool explicitly meant for compositional data (see these articles for the motivation and possible approaches: https://www.nature.com/articles/s41467-019-10656-5 and https://www.frontiersin.org/articles/10.3389/fmicb.2017.02224/full). However, It is currently controversial in the field which approach is best to use.

Rarefying each sample’s predicted function abundances to the same count would in practice be very similar to transforming the data to % (unless there is high variance in read depth across your samples). I typically do not rarefy sequencing data and instead just ensure that all samples have passed a minimum depth threshold of ~4000 reads (before running PICRUSt2 - visualizing rarefaction curves can help you choose a threshold). Either way, you will want to use a tool that treats relative abundances appropriately instead of using an approach like Wilcoxon or t-tests on the relative abundances.

How can I run a custom or non-default database (such as the fungi 18S and ITS databases)?

PICRUSt2 is intended to be able to easily incorporate custom databases for predictions. For example, fungal predictions are available for EC numbers based on 18S and ITS amplicons. In our pre-print we showed that these predictions perform slightly better than random. It is more challenging to evaluate fungi predictions with metagenomes with high proportions of prokaryotic DNA, but nonetheless it is important to realize that these predictions appear to be substantially lower accuracy than predictions for 16S data. Advanced users may still find these predictions useful, but we emphasize that the results must be taken with a grain of salt and we do not think this database will be useful for the vast majority of users.

Nonetheless, we will explain how to point to non-default database files and give example options by pointing to the fungal database as an example. Importantly, these options could be changed in the picrust2_pipeline.py script (which wraps all of the key scripts), but the individual scripts will mainly be referred to below.

First, at the place_seqs.py step you need to specify a non-default reference database with the --ref_dir option. So this would be --ref_dir picrust2/picrust2/default_files/fungi/fungi_ITS to point to the fungal ITS files. The format of the required files in this directory is described here.

Secondly, you need to specify the reference trait tables when running the hidden-state-prediction step (hsp.py). You can specify these tables one at a time when running this command with --observed_trait_table. For example, you would specify --observed_trait_table picrust2/picrust2/default_files/fungi/ec_ITS_counts.txt.gz to specify the EC numbers and --observed_trait_table picrust2/picrust2/default_files/fungi/ITS_counts.txt.gz to specify the ITS copy number for the fungi ITS database (with two separate hsp.py commands).

Note that if you are running picrust2_pipeline.py you would need to specify these two tables with the --custom_trait_tables and --marker_gene_table options. When running picrust2_pipeline.py you would also need to specify the EC number table with the --reaction_func option to specify that this table should be used for reconstructing pathway abundances in the next step.

Finally, when inferring pathway abundances (pathway_pipeline.py) you would need to change the pathway mapfile to be for pathways found in fungi only (the default is for MetaCyc pathways found in prokaryotes only). You can do this by setting --map /home/gavin/github_repos/picrust_repos/picrust2/picrust2/default_files/pathway_mapfiles/metacyc_path2rxn_struc_filt_fungi.txt.

Depending on what gene family to pathway mapping you would like to use you may also need to either change the mapfile for regrouping gene families to reactions (--regroup_map, which maps from EC numbers to MetaCyc reactions by default) or turn off the regrouping step with --no_regroup.

How should I visualize PICRUSt2 data?

Users typically plot the relative abundances of gene families between sample groupings, much like would be done for a gene profile from actual shotgun metagenomics data. You can get some ideas for plots you can make with this kind of data by looking at the HUMAnN2 tutorial and BURRITO.

The output is a compositional datatype so you can also make similar plots that you would make for a table of taxonomic abundances, like a PCoA representing inter-sample distances and boxplots of individual features’ relative abundances between groups.

In terms of exploring the data in general, I personally prefer making custom plots in R with ggplot2. The ggpicrust2 (software, preprint) is an R package that uses the ggplot2 paradigm for analyzing PICRUSt2 data. Note that we (the PICRUSt2 authors/developers) are not involved in the development or maintenance of this R package.

One possible alternative is to use STAMP, which is a user-friendly tool for visualizing microbiome data (although sometimes this tool can be difficult to install on Macs).

How should I analyze the PICRUSt2 output?

There aren't any easy answers for this question unfortunately - how to best analyze microbiome sequencing data in general is still a very open question and there are many differing opinions. This question becomes even more challenging when you are dealing with functional predictions which have higher levels of noise than functional profiles based on actual metagenomics sequencing. You can see in Figure 2 of the PICRUSt2 manuscript that we compare how the interpretation of significant functions differs between metagenome inference pipeline and it can sometimes differ a lot. Different differential abundance tools also result in very different sets of significant functions as well, which you can see in the supplementary materials and described in the main-text.

That is all to say that we have not been able to figure out a single best practice differential abundance workflow for analyzing this data that clearly provides more reliable results compared to other workflows. Also, despite higher concordance between predicted data and matching metagenomics data than random unfortunately there can be a lot of variation in results between the chosen differential abundance workflow. This is crucial to keep in mind whenever you are performing differential abundance testing on this kind of data.

However, the ggpicrust2 (software, preprint) is an R package for analyzing PICRUSt2 data, which you can try. Note that we (the PICRUSt2 authors/developers) are not involved in the development or maintenance of this R package.

How can I limit the large stratified files to only functions of interest?

You could use grep to do this.

For example:

# To get the headerline
zcat pred_metagenome_contrib.tsv.gz | head -n 1 > contrib_subset.tsv

# Get all rows corresponding to either K00003 or K19353.
zcat pred_metagenome_contrib.tsv.gz | grep 'K00003\|K19353' >> contrib_subset.tsv

Errors and Warnings

How do I troubleshoot this error: Error running this command: hsp.py -i 16S -t /tmp/placed_seqs.tre -p 2 -n -o /tmp/picrust2_out/16S_predicted -m mp

If no further error messages are provided then you can try re-running your command with 1 processor. Currently there is an issue where parallelized steps of the PICRUSt2 pipeline are not properly reporting clear error messages. Running these steps with the setting "--threads 1" does enable useful error messages to be returned.

Is this message something to worry about: PendingDeprecationWarning: the matrix subclass is not the recommended way to represent matrices or deal with linear algebra

This is a numpy warning and is nothing to worry about. This will be resolved in a future release that uses updated versions of numpy and scipy. The current versions used that produce this warning and being used to be compatible with qiime2-2019.01.

The full warning looks like this:

$HOME/local/prg/anaconda2/envs/q2_picrust2_test/lib/python3.6/site-packages/numpy/matrixlib/defmatrix.py:68: PendingDeprecationWarning: the matrix subclass is not the recommended way to represent matrices or deal with linear algebra (see https://docs.scipy.org/doc/numpy/user/numpy-for-matlab-users.html). Please adjust your code to use regular ndarray.
    return matrix(data, dtype=dtype, copy=False)

Why am I getting a "Permission denied" error when trying to clone the GitHub repository?

This is the example error output referred to:

Warning: Permanently added the RSA host key for IP address 'XXX.XXX.XXX.XXX' to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

This error is likely due to issues with your ssh configuration. Fortunately GitHub repos can also be cloned with HTTPS. So instead of this command:

git clone git@github.com:picrust/picrust2.git

Try this:

git clone https://github.com/picrust/picrust2.git

How do I troubleshoot "ValueError: No sequence ids overlap between all three of the input files."?

If your sequence ids are numbers the most likely reason for this error is that your sequence ids are being interpreted differently depending on the input file. In particular, it's likely that in one object they are being interpreted as strings and as numbers in a different object. If so, this is a bug that will be addressed soon. For now you can best address it by simply prepending a string (like "seq") to the beginning of all of your sequence ids.

If your sequence ids are already strings then this means that the tables are either malformed, the ids contain hidden characters or spaces in certain files, or that truly no ids overlap between the files, which could be due to inputting the wrong abundance table and FASTA at the first step of the pipeline.

Why am I getting the error "SyntaxError: invalid syntax"?

The most likely reason you're getting this error is that you're running Python 2 rather than Python 3, which is required for PICRUSt2. One major difference in Python 3 compared to Python 2 is that print is now a function, which means that often syntax errors related to this incompatibility will refer to the print function. This would certainly be the first thing to check if you run into this error.

Why am I getting this error: "Stopping - no ASV ids overlap between input FASTA and sequence abundance table"?

When you get this error it means that the sequence ids in your input FASTA file of ASVs do not match the ASV ids in the input BIOM table.

This can be difficult to troubleshoot if you don't know that BIOM files can be converted to a simple text format, which can be done with this command:

biom convert -i BIOM -o BIOM.tsv --to-tsv

You can then compare the ids in the first column of this table to the headerlines in the input FASTA file to see how they differ. One common reason for them to differ is that there may be additional info besides the sequence id on each FASTA headerline. See this section of the FAQ for how to parse just the beginning of each line in a FASTA, which can usually fix this problem.

Why am I getting the error "Stopping - all X input sequences aligned poorly to reference sequences"?

This occurs when your inputs sequences align really poorly to the reference sequences. The most common issue is that there are multiple fields in the FASTA header line, which causes problem with the programs that PICRUSt2 runs for this alignment step. See this quick fix for this problem.

If that's not the problem then you should make sure your input sequences are on the positive strand and correspond to the appropriate amplicon region.

Pathway-level inference

How are the MetaCyc pathway abundances calculated?

Pathway abundances are calculated as the harmonic mean of the key reaction abundances in a sample (the distinction of "key" reactions is for structured pathways like MetaCyc, which have both optional and key reactions). There are some other subtleties as well, such as gap filling by default (setting the reaction/gene with the lowest abundance to be that of the second-lowest abundance), and including optional reactions to some degree as well. Specifically, optional reactions with abundance > harmonic mean based on key reactions alone are included in the calculation. Last, there can be nested pathways/sub-pathways in the structures, which are computed recursively (i.e., the abundance of a sub-pathway is computed and then its abundance is treated just like that of any other reaction). This overall approach was used to keep the output as similar as possible to HUMAnN2.

The MetaCyc reaction abundances are based on the predicted EC number abundances (the mapping file ec_level4_to_metacyc_rxn.tsv, which is in picrust2/default_files/pathway_mapfiles, shows how they are linked). Note that mainly the EC numbers and MetaCyc reactions are 1:1 links, but not all.

Because the MetaCyc reactions are based on regrouping the predicted EC numbers the resulting pathway predictions also partially correspond to the original read counts. To understand this, it's important understand how the EC gene family abundances are calculated per ASV: by default, the read depth per ASV is divided by the predicted number of 16S rRNA genes and then multiplied by the predicted copy number of each EC number.

How can I determine KEGG pathway abundances from the predicted KO abundances?

KEGG is a closed source database of gene families and pathways, which means that accessing the latest mappings of gene families to pathways is not possible for many users. The Integrated Microbial Genomes database has distributed KEGG ortholog information for genomes in their database, but not mappings from gene families to pathways. Since researchers are not consistently using the most recent mappings of gene families to pathways this can cause a lot of confusion, which is why the focus of pathway-level predictions has shifted to MetaCyc pathways in PICRUSt2

Nonetheless, you can infer KEGG pathways levels by using the last publicly available mapping of KEGG orthologs to pathways (which is from around 2011) with this command:

pathway_pipeline.py -i KO_metagenome_out/pred_metagenome_contrib.tsv.gz -o KEGG_pathways_out --no_regroup --map picrust2/picrust2/default_files/pathway_mapfiles/KEGG_pathways_to_KO.tsv

This command determines the presence and abundance of pathways using the same procedure as for the default MetaCyc pathways (except that the mapping of gene families to pathways in unstructured in contrast to the structured mapping of MetaCyc RXNs to pathways). Note that the --no_regroup option is necessary, because by default gene families are regrouped to the features specified in the regroup mapfile (which regroups EC numbers to MetaCyc RXNs by default).

How are the abundances of unstructured pathways (e.g., the KEGG Pathways) computed?

Unstructured (e.g., lacking a distinction between key and optional reactions) pathway abundances are calculated based on sorting the abundance of all reactions/genes in that pathway, and then taking the arithmetic mean of the abundances of reactions/genes in the upper second half of that sorted list. This approach was also taken in order to keep the output as close as possible to HUMAnN2.

Note: Although KEGG Pathways do have structured definitions, the open-source version provided with PICRUSt2 is unstructured.

How can I run categorize_by_function.py like in PICRUSt1?

The categorize_by_function.py script collapsed all gene families to any pathways that they could potentially be involved in. In other words, all pathways with at least one required gene family present were assumed to be present and gene families contributed equally to all pathways they are within. There are many ways to infer pathway levels from gene family abundances, but I personally dislike this approach since I believe it results in a large proportion of false positive pathways being reported. For this reason, a more stringent pathway prediction pipeline has been implemented in PICRUSt2 (identical to the approach used in HUMAnN2). In addition, categorize_by_function.py script will not work with PICRUSt2 outputs directly. See How can I determine KEGG pathway abundances from the predicted KO abundances? for a different method for inferring KEGG pathways in PICRUSt2.

However, if you are determined to get this output and are comfortable working in R you can get the identical output to what categorize_by_function.py would output by using an R function shared below. Please note that this code is not part of the official PICRUSt2 project and is simply shared here in the hopes that it will be helpful. You can download the R code here: https://www.dropbox.com/s/91pohevw4ayxtn2/picrust1_categorize_by_func.R?dl=1 and the legacy table of mappings from KOs to BRITE hierarchy here (which is a required input file): https://www.dropbox.com/s/a5o4li0irsqupt3/picrust1_KO_BRITE_map.tsv?dl=1.

Note that if you use this function you could run a sanity check with a file you run through PICRUSt1 to make sure you are getting the expected output (an example of how to do this is at the bottom of the Rscript).

Why do the stratified pathway abundances not sum to the community-wide abundance when the --per_sequence_contrib option is set?

When the "--per_sequence_contrib" option is used that means that pathway abundances and coverages are calculated for each predicted genome individually. If this option is not set then the contribution of each predicted genome to the community-wide pathway abundances is reported.

It is a little confusing to distinguish these two abundances, so here is an example.

Let’s say pathway X is made up of 3 reactions (A, B, and C).

If genome 1 has 40 copies of reaction A, but no copies of B and C then the pathway won't be called as present within that particular genome. However, if genome 2 has 30 copies of B and C each then pathway X might be called as present in that genome and at an abundance of 30.

To calculate the “community-wide” pathway abundances the individual genomes aren’t considered, only the total gene family abundances overall. If there are only genomes 1 and 2 in the community then the abundances would be 40, 30, and 30 for reactions A, B, and C respectively. The harmonic mean of these 3 values is higher than 30, which is what you would get by summing the pathway abundances within each genome alone. This simple example demonstrates how the community-wide and summed per-genome pathway abundance can differ.

What is pathway coverage?

Pathway coverage is an alternative approach to assess how likely a pathway is present (ranging from 0 to 1). It is included in the output to be consistent with how HUMAnN2 outputs pathway abundances and coverages. However, this metric is not commonly used and will typically only be used by advanced users. The coverage is based on the harmonic mean of confidence scores for each reaction in a pathway. The scores will differ depending on what the median reaction abundance is for the group being assessed (i.e. different reaction scores will be inferred if the reaction is within one predicted genome compared to across an entire sample). Since the median reaction abundance will be higher across an entire sample it’s harder to be confident in rare reactions within a sample based on this approach.

Why aren't all pathways in the abundance output file in the coverage output file?

Not all pathways identified as present based on abundance will have coverage > 0. This is due to the different approaches used to calculate these two measures. Please see the What is pathway coverage? question.

Why are certain MetaCyc pathways missing from the reference file?

Many MetaCyc pathways have only been identified in eukaryotes and so are not present in the default PICRUSt2 pathway database. In addition, pathways based on only three or fewer reactions was excluded (in line with the same files used by HUMAnN2).

What does genome_function_count mean for the stratified pathway output table?

In the default usage, the column "genome_function_count" is simply the contribution of the taxon to the community-wide pathway abundance (which is in "taxon_function_abun") divided by that taxon's abundance (which is in taxon_abun).

For the default usage the values in "taxon_function_abun" are calculated by determining what proportion of the overall pathway abundance is contributed by each individual taxon (which is based on the total number of genes encoded by that taxon that are involved in that pathway).

What is the difference between the default and --per_sequence_contrib pathway output?

The default stratified pathway output gives the contribution of the ASVs to the overall community pathway abundance, which essentially assumes that all genes across the ASVs are shared (as though there were universal cross-feeding). This is done because it's the standard way to analyze pathways from shotgun metagenomics data.

This means that under the default usage an amplicon sequence variant (ASV) could be a large contributor to the community-wide pathway abundance because it encodes multiple copies of genes in the pathway and is at high relative abundance. However, this does not mean that the ASV encodes a sufficient number of genes to infer that pathway present in that ASV alone.

The --per_sequence_contrib option is needed to determine that, because it instead predicts the pathway abundance for each ASV indivdiually based solely on the genes found within each predicted genome (i.e. assuming no cross-feeding at all).

This means you might get different sets of ASVs called as positive for different pathways based on default stratified output and the --per_sequence_contrib approach.

Why does the MetaCyc pathway mapfile to reactions look so complicated?

Rather than just listing all the reactions in each pathway, the mapfile looks more complicated. Here are two example lines:

PWY-6572           ( -CHONDRO-4-SULFATASE-RXN , ( RXN-11572 ( RXN-12177 , RXN-11615 , -CHONDRO-6-SULFATASE-RXN ) ) )  RXN-12178 4.2.2.5-RXN 4.2.2.21-RXN
PWY-7373           (  TRANS-PENTAPRENYLTRANSFERASE-RXN  +  RXN-12345 ( RXN-15264 , RXN-12346 ) RXN-10620 -RXN-10622  )  -RXN-14909 

This file was taken from the HUMAnN2 project, which prepared these files to allow for structure in pathway reactions. As they describe in the manual:

The structure follows the same definition as that for Kegg modules. Each structure is a list (space-delimited) of items with a comma to indicate alternatives, a plus to indicate a complex, and a minus sign at the beginning of an item to indicate this is not a key item in the pathway.

QIIME 2

Is PICRUSt2 developed as part of QIIME 2?

QIIME 2 is a framework for running many different bioinformatics tools. PICRUSt2 is an independent tool that is not developed by the QIIME 2 developers. However, independent developers can create "plugins" so that their tools can be used in the QIIME 2 framework. A plugin for PICRUSt2 is available if you would like to work within the QIIME 2 framework.

Does the q2-picrust2 plugin have all the standalone functionalities?

Not all functionality is present in the PICRUSt2 plugin for QIIME 2. Currently the standalone version needs to be run for more advanced analyses such as custom input tables and stratified output.

How do I install and run the PICRUSt2 QIIME 2 plugin?

Follow the instructions here to install q2-picrust2. A tutorial on this QIIME 2 plugin can be found here.

Clone this wiki locally