## Illumina Overview Tutorial: Moving Pictures of the Human Microbiome

This tutorial covers a full QIIME workflow using Illumina sequencing data. This tutorial is intended to be quick to run, and as such, uses only a subset of a full Illumina Genome Analyzer II (GAIIx) run. We'll make use of the [Greengenes](http://www.ncbi.nlm.nih.gov/pubmed/22134646) reference OTUs, which is the default reference database used by QIIME. You can determine which version of Greengenes is being used by running ``print_qiime_config.py``. This will be [Greengenes](http://www.ncbi.nlm.nih.gov/pubmed/22134646), unless you've configured QIIME to use a different reference database by default.

The data used in this tutorial are derived from the [Moving Pictures of the Human Microbiome](http://www.ncbi.nlm.nih.gov/pubmed/21624126) study, where two human subjects collected daily samples from four body sites: the tongue, the palm of the left hand, the palm of the right hand, and the gut (via fecal samples obtained by swapping used toilet paper). These data were sequenced using the barcoded amplicon sequencing protocol described in [Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample](http://www.ncbi.nlm.nih.gov/pubmed/20534432). A more recent version of this protocol that can be used with the Illumina HiSeq 2000 and MiSeq can be found [here](http://www.ncbi.nlm.nih.gov/pubmed/22402401). 
    
This tutorial is presented as an IPython Notebook. For more information on using QIIME with IPython, see [Ragan-Kelley et al. (2013)](http://www.ncbi.nlm.nih.gov/pubmed/23096404). You can find more information on the IPython Notebook [here](http://ipython.org/notebook.html).


## Getting started

We'll begin by downloading the tutorial data.


You can use the ``FileLink`` and ``FileLinks`` features of the IPython notebook to view or download data. ``FileLinks`` is used for viewing or downloading directories, while ``FileLink`` is used for viewing or downloading single files.

We'll change to the ``moving_pictures_tutorial-1.9.0/illumina`` directory for the remaining steps. We also need to prepare the ``FileLink`` and ``FileLinks`` functions to work from this new location.

## Check our mapping file for errors

The QIIME mapping file contains all of the per-sample metadata, including technical information such as primers and barcodes that were used for each sample, and information about the samples, including what body site they were taken from. In this data set we're looking at human microbiome samples from four sites on the bodies of two individuals at mutliple time points. The metadata in this case therefore includes a subject identifier, a timepoint, and a body site for each sample. You can review the ``map.tsv`` file at the link in the previous cell to see an example of the data (or view the [published Google Spreadsheet version](https://docs.google.com/spreadsheets/d/1FXHtTmvw1gM4oUMbRdwQIEOZJlhFGeMNUvZmuEFqpps/pubhtml?gid=0&single=true), which is more nicely formatted).

In this step, we run ``validate_mapping_file.py`` to ensure that our mapping file is compatible with QIIME.

> The file `map.tsv` that was just referenced is our metadata file, just like we saw earlier. Let's take a quick look at it.

In this case there were no errors, but if there were we would review the resulting HTML summary to find out what errors are present. You could then fix those in a spreadsheet program or text editor and rerun ``validate_mapping_file.py`` on the updated mapping file.

For the sake of illustrating what errors in a mapping file might look like, we've created a bad mapping file (``map-bad.tsv``). We'll next call ``validate_mapping_file.py`` on the file ``map-bad.tsv``. Review the resulting HTML report. What are the issues with this mapping file? 

## Demultiplexing and quality filtering sequences

We next need to demultiplex and quality filter our sequences (i.e. assigning barcoded reads to the samples they are derived from). In general, you should get separate fastq files for your sequence and barcode reads. Note that we pass these files while still gzipped. ``split_libraries_fastq.py`` can handle gzipped or unzipped fastq files. The default strategy in QIIME for quality filtering of Illumina data is described in [Bokulich et al (2013)](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531572/).

Let's see what files are output by `split_libraries_fastq.py`

We can use the following command to look at the first few lines of each of the output files. `seqs.fna` is a big file, so we don't want to try to open the whole thing in out webbrower, it won't be happy!

We can see how many sequences we ended up with using ``count_seqs.py``.

## OTU picking: using an open-reference OTU picking protocol by searching reads against the Greengenes database.

Now that we have demultiplexed sequences, we're ready to cluster these sequences into OTUs. There are three high-level ways to do this in QIIME. We can use *de novo*, *closed-reference*, or *open-reference OTU picking*. Open-reference OTU picking is currently our preferred method. Discussion of these methods can be found in [Rideout et. al (2014)](https://peerj.com/articles/545/).

Here we apply open-reference OTU picking. Note that this command takes the ``seqs.fna`` file that was generated in the previous step. We're also specifying some parameters to the ``pick_otus.py`` command, which is internal to this workflow. Specifically, we set ``enable_rev_strand_match`` to ``True``, which allows sequences to match the reference database if either their forward or reverse orientation matches to a reference sequence. This parameter is specified in the *parameters file* which is passed as ``-p``. You can find information on defining parameters files [here](http://www.qiime.org/documentation/file_formats.html#qiime-parameters).

**This step can take about 10 minutes to complete.**

The primary output that we get from this command is the *OTU table*, or the number of times each operational taxonomic unit (OTU) is observed in each sample. QIIME uses the Genomics Standards Consortium Biological Observation Matrix standard (BIOM) format for representing OTU tables. You can find additional information on the BIOM format [here](http://www.biom-format.org), and information on converting these files to tab-separated text that can be viewed in spreadsheet programs [here](http://biom-format.org/documentation/biom_conversion.html). Several OTU tables are generated by this command. The one we typically want to work with is ``otus/otu_table_mc2_w_tax_no_pynast_failures.biom``. This has singleton OTUs (or OTUs with a total count of 1) removed, as well as OTUs whose representative (i.e., centroid) sequence couldn't be aligned with [PyNAST](http://bioinformatics.oxfordjournals.org/content/26/2/266.long). It also contains taxonomic assignments for each OTU as *observation metadata*.

The open-reference OTU picking command also produces a phylogenetic tree where the tips are the OTUs. The file containing the tree is ``otus/rep_set.tre``, and is the file that should be used with ``otus/otu_table_mc2_w_tax_no_pynast_failures.biom`` in downstream phylogenetic diversity calculations. The tree is stored in the widely used [newick format](http://scikit-bio.org/docs/latest/generated/skbio.io.newick.html).

To view the output of this command, call ``FileLink`` on the ``index.html`` file in the output directory.

To compute some summary statistics of the OTU table we can run the following command.

The key piece of information you need to pull from this output is the depth of sequencing that should be used in diversity analyses. Many of the analyses that follow require that there are an equal number of sequences in each sample, so you need to review the *Counts/sample detail* and decide what depth you'd like. Any samples that don't have at least that many sequences will not be included in the analyses, so this is always a trade-off between the number of sequences you throw away and the number of samples you throw away. For some perspective on this, see [Kuczynski 2010](http://www.ncbi.nlm.nih.gov/pubmed/20441597).