# 1 Introducing 16S Microbiome Primary Analysis
Amanda Birmingham, CCBB, UCSD (abirmingham@ucsd.edu)

This document introduces a Standard Operating Procedure (SOP) that covers primary analysis of single-end, three-read, Golay-barcoded microbiome 16S sequencing data. 

<a name = "table-of-contents"></a>

## Table of Contents

* [Background](#background)
* [Basic Approach](#basic-approach)
* [Required Inputs](#required-inputs)
* [Standard Outputs](#standard-outputs)
* [Required Time And Resources](#required-time-and-resources)

Related Notebooks:
* 2 Setting Up Starcluster for QIIME
* 3 Validation, Demultiplexing, and Quality Control
* 4 OTU Picking and Rarefaction Depth Selection
* 5 Analyzing Core Diversity

<a name = "background"></a>

## Background

16S rRNA is the small sub-unit (SSU) of bacteria’s ribosome.  This gene is largely conserved across bacteria and archaea (providing conserved primer sites for amplification by polymerase chain reaction (PCR)) but also has hypervariable regions that can be used to separate different microbial “species”--more properly known as Operational Taxonomic Units or OTUs--and build phylogenetic trees.  Usually, samples are taken from mixed microbial communities and a portion of the 16S gene is amplified by PCR (note that eukaryotic DNA is not widely amplified by primers to this gene as eukaryotes’ SSU is 18S rRNA).  

Amplicons are labeled with sample-specific indexes and sequenced in a highly multiplexed next-generation sequencing run.  The Earth Microbiome Project standard 16S rRNA Amplification Protocol (see http://www.earthmicrobiome.org/emp-standard-protocols/16s/ ) recommends the use of a standard set of error-correcting Golay barcodes as the indices.  In this case, a common sequencing approach produces three reads (read 1, read 2, and a barcode read, similar to the Illumina TruSeq approach) for each sample, with read lengths in the 100-150 base range, as shown here in this image from http://tucf-genomics.tufts.edu:  
<img src = "images/faq03_pic01.png" />

The resulting sequence data, once analyzed, provides information on the identity and abundance of the microbes in each sample.

[Table of Contents](#table-of-contents)

<a name = "basic-approach"></a>

## Basic Approach

This SOP employs the QIIME (pronounced "chime") software developed by the Knight lab at UCSD to perform "primary analysis", the first stage of analysis of all 16S-based microbiome data.  The major steps of this work are (i) read demultiplexing (if necessary) and basic quality control (QC), (ii) OTU picking, and (iii) core diversity analysis. Step iii comprises taxa summarization, alpha diversity calculations, and beta diversity calculations.  

Alpha diversity (i.e., within-sample) calculations are performed for three commonly used metrics: number of observed OTUs, whole-tree phylogenetic diversity, and the chao1 metric.  The distributions of these metrics are visualized in boxplots for each selected category in the input data. Additionally, rarefaction curves are produced for each of these metrics for each category in the input data.

Beta diversity (i.e., between-sample) calculations are performed for two commonly used metrics: unweighted and weighted UniFrac.  The distributions of these two metrics are visualized in boxplots for each selected category in the input data.  Additionally, for each metric, a principal coordinates analysis (PCoA) is performed and the top three dimensions of the result are visualized as an interactive 3-D graph.

Primary analysis excludes tasks such as investigating the effect of alternate sequence depths on outcomes, recalculating diversity metrics on researcher-selected subsets of the full data set, and investigation of hypotheses generated from the primary analysis deliverables.

The pipeline is executed by the validate_mapping_file.py, split_libraries_fastq.py, pick_open_reference_otus.py, and core_diversity_analyses.py workflow scripts of the QIIME open-source microbial ecology package. This software is built in Python 2.7.3 and deployed on Ubuntu linux.  3-D graphs are visualized with Emperor, a browser-based open-source visualization tool. Analysis currently runs on Amazon Web Services using Python-based StarCluster 0.95.6, an open-source cluster-computing toolkit, although it is anticipated that future versions of the cluster will transition to cfncluster for greater flexibility in dynamic cluster sizing.

[Table of Contents](#table-of-contents)

<a name = "required-inputs"></a>

## Required Inputs

Researchers requesting primary analysis should provide the following:  

1.	fastq file(s) of the 16S reads
2.	A comma-separated-value mapping file with one row for each sample, containing (see http://qiime.org/documentation/file_formats.html , quoted below, for more details)
    * Sample name
    * Barcode sequence used for the sample
    * Linker/primer sequence used to amplify the sample
    * Sample description
    * “Any metadata that relates to the samples (for instance, health status or sampling site)”
    * “Any additional information relating to specific samples that may be useful to have at hand when considering outliers (for example, what medications a patient was taking at time of sampling)”
3.	Experimental design information including
    * Sequencing instrument used 
    * Expected read length
    * Expected read depth
    * Read type (single- or paired-end)
        * **Integrated analysis of paired-end reads is currently beyond the scope of this SOP**, but analysis of one of the two reads from a paired-end experiment is usually adequate.
    * Whether data have already been demultiplexed
        * If not 
            * fastq file(s) of their barcodes for experiments with a separate index read
            * **Analysis of experiments not using a separate index read is currently beyond the scope of this SOP**
            * **Analysis of experiments using indexes other than those from the standard protocol is currently beyond the scope of this SOP**
    * Whether data contains positive or negative controls
        * If so, how to identify them in the mapping file
    * As many as four categories from mapping file that should be used to group data during primary analysis
    * (Optional) Preferences for 
        * Minimum sequencing depth to use during rarefaction (essentially, preference for whether analyst should favor broader sample coverage or better-quality data per sample).
            * Default: Analyst’s choice based on data
        * OTU picking method (de novo, open reference, or closed reference)
            * Default: open reference
            * If open or closed, source of OTU reference sequences
                * Default: current greengenes 97% collection
        * Three or fewer alpha diversity metrics to employ
            * Default: number of observed OTUs, whole-tree phylogenetic diversity, and chao1
        * Two or fewer beta diversity metrics to employ
            * Default: unweighted UniFrac, weighted UniFrac 

[Table of Contents](#table-of-contents)

<a name = "standard-outputs"></a>

## Standard Outputs

1.	If applicable, demultiplexed file(s)
2.	If applicable, corrected mapping file that passes QIIME validation
3.	Folder containing OTU picking outputs, including 
    * OTU table file(s) in .biom format
    * Representative sequence set file in fasta format
    * Representative set taxonomy file in .tre format
    * Summary text file of counts per sample
4.	Folder containing alpha diversity analysis outputs, including
    * Alpha rarefaction curves in .png format
    * Categorized alpha diversity metric box plots in .pdf format
5.	Folder containing beta diversity outputs, including
    * Categorized beta diversity metric box plots in .pdf format
    * Interactive 3-D Emperor PCoA graphic, in .html format
6.	Brief methods text suitable for publication, describing key method steps and choices such as that of sample depth, in .docx format
7.	Summary report presentation describing findings, in .pptx format

In the case of successful primary analysis, all of the above will be delivered.  In the case of unsuccessful analysis due to low sequence quality, unsuccessful OTU picking, etc., deliverable 7 (summary report) will be delivered, as well as any other deliverables deemed meaningful by the analyst.

[Table of Contents](#table-of-contents)

<a name = "required-time-and-resources"></a>

## Required Time and Resources

Typical microbiome datasets comprise a portion of a MiSeq run (which can contain up to ~15 million reads) split over a few hundred multiplexed samples, using ~1-5 gigabytes (GB) of storage. Although precise run times are, of course, a function of dataset size, sequence quality, categories chosen for analysis, and so forth, computation for primary analysis of a typical microbiome dataset can usually be completed in less than one working day.  Only an hour or two of analyst time is required during this computation day, while another 2 to 8 hours of analyst effort is needed to assess the computed results, draw conclusions, and prepare deliverables; more time may of course be required for datasets that cause errors during primary analysis. For example, a primary analysis computation for dataset of ~6.5 million 100-base reads plus barcodes across 467 samples (~4 GB of sequence) took ~4 calendar hours on a 3-node StarCluster comprised of m3.2xlarge instances (each of which has 8 CPUs and 30 GB RAM) attached to a shared 30 GB EBS volume. 

[Table of Contents](#table-of-contents)