RDPutils
was originally written to provide means to construct phyloseq
objects from the output of RDP's web-based tools for clustering and classifying DNA sequences from high-throughput amplicon sequencing projects. It has since been expanded to handle output from RDP’s command line tools as well as USEARCH
and iTagger
(JGI) output.
Phyloseq
is an R/Bioconductor
package that includes a variety of wrappers for quick exploratory data analysis of sequencing data, but perhaps its most convenient feature is that it enables the rapid and flexible sub-setting of data from a large experiment. A phyloseq
object has slots for an OTU table, a classification table, a sample data table, a tree of the sequences representing each OTU, and the representative sequences themselves. I recommend phyloseq
because it organizes all data for an experiment and R
because it provides a flexible means of analyzing that data. The functions in this package reformat RDP, USEARCH
, and iTagger
outputs so that they may be used to fill phyloseq
slots.
The Ribosomal Database Project (RDP) provides both web-based and command line tools (RDPTools) for processing rRNA gene sequences from Bacteria, Archaea, and Fungi as well as functional genes. Web-based tools and tutorials for using them are available at (http://rdp.cme.msu.edu/index.jsp). The command line tools are available at (https://github.com/rdpstaff/RDPTools). Processing can take either of two approaches. In the "supervised" approach, sequences from multiple samples are binned by classifying them using a database for Bacteria or Archaea or Fungi, or a user's own database. Further processing has traditionally been done in a spreadsheet program, but that is no longer necessary. The function hier2phyloseq
in this package imports RDP classifier results in hierarchy format into a phyloseq
object with OTU and taxonomy tables. In the "unsupervised" approach, sequences are clustered into OTUs based on their degree of similarity. RDP provides additional tools to parse the cluster files into OTU tables that can be imported into R, and to retrieve representative sequences for each cluster. OTU tables can be also be parsed from the cluster file with a function in this package.
The key to filling the classification table and tree slots is to first rename the representative sequences to correspond to the OTU names; this package includes functions to do this. Once renamed, the representative sequences can be classified and treed, and the results used to fill phyloseq classification table and tree slots. For either approach, supervised or unsupervised, a sample data table is most easily constructed in a spreadsheet program. A vignette in the package gives example workflows for constructing phyloseq
objects for both supervised and unsupervised methods.
As of May 2014, the RDP command line tools include options to output a biom file with OTU table, classification, and sample data, as well as renaming the representative sequences to correspond to the OTU names. The biom file can be imported directly into phyloseq
, making the unsupervised workflow described in the RDPutils vignette no longer necessary. However, if these renamed representative sequences themselves are to be included in the phyloseq object, the functions trim_fasta_names
and unalign_fasta
should be applied before importing them into phyloseq
.
RDPutils
version 1.3.0 includes functions to import USEARCH
generated biom files, OTU tables with or without taxonomy and taxonomy files generated with utax
and sintax
. The import_itagger_otutab_taxa
function converts the tab-delimited iTagger otu.tax.tsv
file into a phyloseq
object with otu_table
and tax_table
. A second vignette in the package demonstrates all of these capabilities. Version 1.3.1 fixed bug related to OTU name format. Version 1.3.2 added the function make_framebot_tax_table.Version 1.4.0 rewrote vignettes using rmarkdown.