Skip to content

Key Limitations

Gavin Douglas edited this page Mar 23, 2020 · 8 revisions

There are several limitations to keep in mind when analyzing PICRUSt2 output, which are mainly related to predictions being limited to the gene contents of existing reference genomes. The best way to assess how this limitations are affecting any interpretations in your dataset is to compare with functional profiles from metagenomes on a subset of samples.

  • Differential abundance results will likely differ substantially from what would be found based on shotgun metagenomics data. Although amplicon-based predictions may be highly correlated with functional profiles based on shotgun metagenomics sequencing data, the actual functions identified as significantly different can substantially differ. Check out this post for more details.

  • Predictions are limited by how your study sequences can be placed in the reference tree. By default EPA-NG is used to place study sequences, which requires several considerations to be taken into account. Most importantly, the placements may not be entirely reproducible depending on what other sequences are also placed at the same time. In practice the resulting predictions per sample tend to be highly similar, but there can be important differences, especially whenever interpretations are based on a single ASV or function. Making the placement step totally reproducible with different sets of input ASVs is something we are currently trying to address in the beta version of PICRUSt2.

  • The accuracy on any given sample type will depend heavily on the availability of appropriate reference genomes. You can partially assess this problem by computing the per-ASV and sample-weighted nearest-sequenced taxon index (NSTI) values, which will give you a rough idea of how well-represented your ASVs are by the reference database (see tutorial). However, 16S rRNA gene sequences do not typically enable resolution of strain variation within a species. Strains of prokaryotic species can vary in gene content to remarkable degrees and horizontal gene transfer can frequently occur between distantly related taxa, so the predictions should always be taken with a grain of salt.

  • A related issue is that the certain environments are better represented by reference genomes than others. For instance, PICRUSt2 is expected to perform better on 16S sequences from the human gut compared to the cow rumen, even if the actual 16S sequences themselves are very similar. The reason for this is that many important rumen-specific enzymes will be missing in the default reference genomes. One potential solution to this problem is to create a custom reference database of genomes specific to your environment of interest for making predictions. It is worth noting that our validations on non-human associated environment indicate that the overall predictions perform better than random, but nonetheless we expect that many niche-specific functions will be poorly represented.

  • By default input sequences with NSTI values above 2 will be excluded from the analysis. This could potentially affect some samples much more than others, which should be evaluated (i.e. you can determine what proportion of the community relative abundance was excluded per sample, which is typically extremely little).

  • PICRUSt2 can only predict genes that are in the input function tables (which correspond to KEGG orthologs and Enzyme Classification numbers by default). Although these gene families are useful, they typically represent a small proportion of the total genetic variation within metagenomes and can be mis-annotated.

  • PICRUSt2 mappings to high-level functions like pathways is entirely dependent on the mapping file used. Therefore, any gaps or inaccuracies in pathway annotation or assignments of gene function will still be present. As an example, many KEGG orthologs are listed as participating in pathways related to "Human Diseases". In many cases this is simply due to bacteria containing (distant) homologs of enzymes with important roles in, for example, mammalian pathways. Therefore, it is worth carefully checking KEGG pathway annotations to ensure that they are reasonable for your system and to filter out any irrelevant pathways.

  • Predictions are limited to the portion of the full metagenome contributed by the organisms targeted by your primers. If the primers are biased away from certain taxa than naturally the functions contributed by these taxa will be less represented in the predicted metagenome.