Skip to content

phyloseq class structure (developer)

joey711 edited this page Apr 24, 2012 · 4 revisions

Design Decisions:

Current Design

The main data class used in functions defined by the phyloseq-package is a custom class, also called "phyloseq" (see below). The phyloseq-class is intended to allow the representation of all the useful features of a phylogenetic sequencing project as one coherent object. This class works just as well for representing subsets of the experimental data, as well as trimmed, smoothed, de-noised, or otherwise-transformed versions. The phyloseq-package further provides a number of commonly-needed methods for subsetting and manipulating instances of the phyloseq-class, which we hope makes the tedious (but important!) process of trimming/checking/normalizing this type of data easier, less error-prone, and more reproducible.

For end-users, the expectation should also be that any user-accessible function operating on phylogenetic sequencing data should take an instance of the phyloseq-class as its main data argument. Since some data components are sometimes not needed or not available, an error is thrown to the users if a needed component is missing/empty in the provided data object. The tools for checking and optionally throwing an error are built-in to the accessors, and so developers of additional tools for phyloseq need not (re)build this into their functions.

Why approach the problem this way? Processing raw phylogenetic sequence data usually results in multiple related summary data types that share indices in some way. Matching the set and order of these indices (say, species names / sample names) is something that is enforced by the phyloseq-package. This removes several potential sources for error during data-manipulation and data-sharing. This also has the advantage of consolidating (much of) the burden of validity-checking component data to just the constructor method for the phyloseq-class, such that most analysis functions should not need to perform "sanity checks" on their input data because it can be assumed that this was done at instantiation.

Implementation:

phyloseq Class Diagram

The class structure in the phyloseq package follows the inheritance diagram shown above. Core data classes are shown with grey fill and rounded corners. The class name and its slots are shown with red- or blue-shaded text, respectively. Inheritance is indicated graphically by arrows. Lines without arrows indicate that a class contains a slot with the associated data class as one of its components, in this case the phyloseq-class.

The phyloseq package defines a single experiment-level class intended to describe all the useful features of a phylogenetic sequencing project as one coherent object. Currently, this "master" class (?"phyloseq-class") stores 4 core data classes which are themselves extended from other R classes. They are the taxonomic abundance/contingency table (otuTable), a table of sample data (sampleData), a table of taxonomic descriptors (taxonomyTable), and a phylogenetic tree (phylo, ape-package). In practice, the phyloseq-class actually has slots for virtual classes that are the trivial union of each component class with the NULL class. This is so that slots can be empty (NULL) if that data type is not available for an experiment. The main accessor functions throw an error by default if the data type they are attempting to access from a phyloseq object turns out to be NULL. Functions that operate on a phlyoseq object can simply assume that the requisite data type is present and use the accessors, which will in-turn throw a meaningful error if the expected data is missing (NULL). In normal practice, phyloseq objects (that is, instances of the phyloseq-class) can be built using the provided importers, or "manually" using the constructor, phyloseq(), with component data as arguments.

Component data classes

The otuTable class can be considered the central data type, as it directly represents the number and type of sequences observed in each sample. otuTable extends the numeric matrix class in R base, and has an additional feature slot to keep track of the orientation of the matrix, as there is disagreement among various R packages as to whether genes and samples should respectively index the rows or columns of a matrix (e.g. taxa as rows in the genefilter package in Bioconductor; or taxa as columns in vegan and picante packages). In phyloseq methods, as well as its extensions of methods in other packages, the speciesAreRows value is checked to ensure proper orientation of the otuTable. A phyloseq user is only required to specify the otuTable orientation during initialization, following which all handling is internal.

The sampleData class directly inherits R’s data.frame class, and thus effectively stores both categorical and numerical data about each sample. The orientation of a data.frame in this context requires that samples/trials are rows, and variables are columns (consistent with vegan and other packages). The taxonomyTable class directly inherits the matrix class, and is oriented such that rows are taxa (e.g. species) and columns are taxonomic levels (e.g. Phylum).

The phyloseq-class can be considered an “experiment-level class” and should contain two or more of the previously-described core data classes. We assume that phyloseq users will be interested in analyses that utilize their abundance counts derived from the phylogenetic sequencing data, and so the phyloseq() constructor will stop with an error if the arguments do not include an otuTable. There are a number of common methods that require either an otuTable and sampleData combination, or an otuTable and phylogenetic tree combination. These methods can operate on instances of the phyloseq-class, and will stop with an error if the required component data is missing.

Previous Design (no longer implemented)

We had originally devised a more complex class structure in the earliest incarnations of phyloseq. Although proper under various definitions, it required more complicated infrastructure and a whole nomenclature of intermediate class names, one for each combination of data types available from a typical phylogenetic sequencing project. The costs to maintain such a structure - especially as we consider additional data types for integration - seem to far outweigh the benefits of a marginally cleaner, DRYer, S4 OOP design. If you're interested, some of these very early considerations were described in an article for the Pacific Symposium on Biocomputing 2012 meeting (McMurdie and Holmes, PSB2012). Although, I don't recommend reading this if you want to understand how phyloseq is designed now, as much (or most?) has changed since then.