# Illumina Overview Tutorial: Moving Pictures of the Human Microbiome

# 0. Notes:



* pick_open_reference_otus.py calls internal qiime functions that arent exposed to regular python shell (cant import them).  Specifically pick_open_reference_otus.py calls a workflow function  pick_subsampled_open_reference_otus(). 

*/usr/local/lib/python2.7/dist-packages/qiime/workflow/pick_open_reference_otus.py

import qiime
from qiime.filter import filter_otus_from_otu_map
qiime.filter.filter_otus_from_otu_map('otus/final_otu_map.txt', 'otus/final_otu_map_mc2.txt', 2)


*Below log file gives some clues...
file:///home/qiime/Desktop/Qiime_Notebook/qiime_illumina_notebook/moving_pictures_tutorial-1.9.0/illumina/otus/log_20160711181547.txt

*Two steps are unaccounted for (no commans given)
https://groups.google.com/forum/#!topic/qiime-forum/q5pO-xhCONU

*OTU picking prodcues these results:
final_otu_map_mc2.txt
new_refseqs.fna
final_otu_map.txt
otu_table_mc2.biom
rep_set.fna
index.html
otu_table_mc2_w_tax.biom
rep_set.tre
log_20160711181547.txt
otu_table_mc2_w_tax_no_pynast_failures.biom

otus/pynast_aligned_seqs:
rep_set_aligned.fasta
rep_set_aligned_pfiltered.fasta
rep_set_failures.fasta
rep_set_log.txt

otus/step1_otus:
failures.fasta
seqs_clusters.uc
seqs_failures.txt
seqs_otus.log
seqs_otus.txt
step1_rep_set.fna

otus/step4_otus:
failures_clusters.uc
failures_otus.log
failures_otus.txt
step4_rep_set.fna

otus/uclust_assigned_taxonomy:
rep_set_tax_assignments.log
rep_set_tax_assignments.txt





# 1. About
This notebook was taken from http://nbviewer.jupyter.org/github/biocore/qiime/blob/1.9.1/examples/ipynb/illumina_overview_tutorial.ipynb

It has been edited/annotated for content relevance.

This tutorial covers a full QIIME workflow using Illumina sequencing data. This tutorial is intended to be quick to run, and as such, uses only a subset of a full Illumina Genome Analyzer II (GAIIx) run. We'll make use of the [Greengenes](http://www.ncbi.nlm.nih.gov/pubmed/22134646) reference OTUs, which is the default reference database used by QIIME. You can determine which version of Greengenes is being used by running ``print_qiime_config.py``. This will be [Greengenes](http://www.ncbi.nlm.nih.gov/pubmed/22134646), unless you've configured QIIME to use a different reference database by default.

The data used in this tutorial are derived from the [Moving Pictures of the Human Microbiome](http://www.ncbi.nlm.nih.gov/pubmed/21624126) study, where two human subjects collected daily samples from four body sites: the tongue, the palm of the left hand, the palm of the right hand, and the gut (via fecal samples obtained by swapping used toilet paper). These data were sequenced using the barcoded amplicon sequencing protocol described in [Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample](http://www.ncbi.nlm.nih.gov/pubmed/20534432). A more recent version of this protocol that can be used with the Illumina HiSeq 2000 and MiSeq can be found [here](http://www.ncbi.nlm.nih.gov/pubmed/22402401). 
    
This tutorial is presented as an IPython Notebook. For more information on using QIIME with IPython, see [Ragan-Kelley et al. (2013)](http://www.ncbi.nlm.nih.gov/pubmed/23096404). You can find more information on the IPython Notebook [here](http://ipython.org/notebook.html).


##2. Install qiime libraries

In [None]:
!pip install qiime

## 3. Necessary Files

We'll begin by downloading the tutorial data.


In [1]:
#####################################################
##### Downloads and unzips all tutorial files #######
#####################################################
import urllib2
import tarfile
import os
from subprocess import call

# make a directory for our tutorial, and jump into it
root="qiime_illumina_notebook"
if os.getcwd().split('/')[-1] != root:
    if not os.path.isdir(root):
        os.mkdir(root)
    os.chdir(root)
print "Current working directory is: " + os.getcwd()

# Files to grab
files = [(' ftp://ftp.microbio.me/qiime/tutorial_files/moving_pictures_tutorial-1.9.0.tgz','')]

# Grab and unzip the files
for url, path in files:
    target = url.split('/')[-1]
    if not os.path.isfile(target):
        resource = urllib2.urlopen(url)
        print "Downloading " + target + "...\n"
        open(target,'wb').write(resource.read())
        
        if url.split('.')[-1] == 'tgz':
            print "Extracting " + target + "...\n"
            tarfile.open(target, "r:gz").extractall()

# File Directory names:
dir='moving_pictures_tutorial-1.9.0/illumina'
os.chdir(dir)
print "Current working directory is: " + os.getcwd()
print "\nDone!"


Current working directory is: /home/qiime/Desktop/Qiime_Notebook/qiime_illumina_notebook
Current working directory is: /home/qiime/Desktop/Qiime_Notebook/qiime_illumina_notebook/moving_pictures_tutorial-1.9.0/illumina

Done!


# 3. Commands

## 1. Check our mapping file for errors
The QIIME mapping file contains all of the per-sample metadata, including technical information such as primers and barcodes that were used for each sample, and information about the samples, including what body site they were taken from. In this data set we're looking at human microbiome samples from four sites on the bodies of two individuals at mutliple time points. The metadata in this case therefore includes a subject identifier, a timepoint, and a body site for each sample. You can review the ``map.tsv`` file at the link in the previous cell to see an example of the data (or view the [published Google Spreadsheet version](https://docs.google.com/spreadsheets/d/1FXHtTmvw1gM4oUMbRdwQIEOZJlhFGeMNUvZmuEFqpps/pubhtml?gid=0&single=true), which is more nicely formatted).

In this step, we run ``validate_mapping_file.py`` to ensure that our mapping file is compatible with QIIME.

In [2]:
!validate_mapping_file.py -o vmf-map/ -m map.tsv



In this case there were no errors, but if there were we would review the resulting HTML summary to find out what errors are present. You could then fix those in a spreadsheet program or text editor and rerun ``validate_mapping_file.py`` on the updated mapping file.

For the sake of illustrating what errors in a mapping file might look like, we've created a bad mapping file (``map-bad.tsv``). We'll next call ``validate_mapping_file.py`` on the file ``map-bad.tsv``. Review the resulting HTML report. What are the issues with this mapping file? 

### 1.1 Documentation:
http://qiime.org/scripts/validate_mapping_file.html

### 1.2 Output Files:
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>\* .html</td><td>HTML version of the mapping file.  Highlights errors/warnings and provides error messages.</td></tr>
<tr><td>\* .log</td><td>Error log.</td></tr>
<tr><td>\* ._corrected.txt</td><td>Original mapping file annotated to coply with MIENS.  Any invalid characters  in the SampleID field will be replaced with ‘.’ characters.  Invalid text in other data fields will be replaced with the -c parameter (“_” by default).</td></tr>
<tr><td>overlib.js</td><td>Javascript to run the .html page.  Contains some configuration options (color, spacing, etc.)</td></tr>

<tr><td>assembledReads.report     </td><td>  Assembly report for each sequence</td></tr>



## 2. Demultiplexing and quality filtering sequences

We next need to demultiplex and quality filter our sequences (i.e. assigning barcoded reads to the samples they are derived from). In general, you should get separate fastq files for your sequence and barcode reads. Note that we pass these files while still gzipped. ``split_libraries_fastq.py`` can handle gzipped or unzipped fastq files. The default strategy in QIIME for quality filtering of Illumina data is described in [Bokulich et al (2013)](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3531572/).

In [3]:
!split_libraries_fastq.py -o slout/ -i forward_reads.fastq.gz -b barcodes.fastq.gz -m map.tsv

Next, we can see how many sequences we ended up with using ``count_seqs.py``.

In [4]:
!count_seqs.py -i slout/seqs.fna


186333  : slout/seqs.fna (Sequence lengths (mean +/- std): 132.2422 +/- 9.8806)
186333  : Total


### 2.1 Documentation:
http://qiime.org/scripts/split_libraries_fastq.html

### 2.2 Output Files:
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>histograms.txt</td><td>Histogram of sequence lengths.</td></tr>
<tr><td>seqs.fna</td><td>Cleaned sequences, grouped and tagged with their sample name.</td></tr>
<tr><td>split_library_log.txt</td><td>Summary + log file.  Lists input files, number of rejected sequences (due to length, barcode errors, too many N, etc).  Lists the number of sequences in each sample.</td></tr>
</table>

## 3. OTU picking: using an open-reference OTU picking protocol by searching reads against the Greengenes database.

Now that we have demultiplexed sequences, we're ready to cluster these sequences into OTUs. There are three high-level ways to do this in QIIME. We can use *de novo*, *closed-reference*, or *open-reference OTU picking*. Open-reference OTU picking is currently our preferred method. Discussion of these methods can be found in [Rideout et. al (2014)](https://peerj.com/articles/545/).

Here we apply open-reference OTU picking. Note that this command takes the ``seqs.fna`` file that was generated in the previous step. We're also specifying some parameters to the ``pick_otus.py`` command, which is internal to this workflow. Specifically, we set ``enable_rev_strand_match`` to ``True``, which allows sequences to match the reference database if either their forward or reverse orientation matches to a reference sequence. This parameter is specified in the *parameters file* which is passed as ``-p``. You can find information on defining parameters files [here](http://www.qiime.org/documentation/file_formats.html#qiime-parameters).

**This step can take about 10 minutes to complete.**

In [5]:
!pick_open_reference_otus.py -o otus/ -i slout/seqs.fna -p ../uc_fast_params.txt

Error in pick_open_reference_otus.py: Output directory already exists. Please choose a different directory, or force overwrite with -f.

If you need help with QIIME, see:
http://help.qiime.org


The primary output that we get from this command is the *OTU table*, or the number of times each operational taxonomic unit (OTU) is observed in each sample. QIIME uses the Genomics Standards Consortium Biological Observation Matrix standard (BIOM) format for representing OTU tables. You can find additional information on the BIOM format [here](http://www.biom-format.org), and information on converting these files to tab-separated text that can be viewed in spreadsheet programs [here](http://biom-format.org/documentation/biom_conversion.html). Several OTU tables are generated by this command. The one we typically want to work with is ``otus/otu_table_mc2_w_tax_no_pynast_failures.biom``. This has singleton OTUs (or OTUs with a total count of 1) removed, as well as OTUs whose representative (i.e., centroid) sequence couldn't be aligned with [PyNAST](http://bioinformatics.oxfordjournals.org/content/26/2/266.long). It also contains taxonomic assignments for each OTU as *observation metadata*.

The open-reference OTU picking command also produces a phylogenetic tree where the tips are the OTUs. The file containing the tree is ``otus/rep_set.tre``, and is the file that should be used with ``otus/otu_table_mc2_w_tax_no_pynast_failures.biom`` in downstream phylogenetic diversity calculations. The tree is stored in the widely used [newick format](http://scikit-bio.org/docs/latest/generated/skbio.io.newick.html).

To view the output of this command, open ``index.html`` file in the output directory.
### 3.1 Documentation:
http://qiime.org/scripts/split_libraries_fastq.html

### 3.2 Output Files:
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>Index.html</td><td>HTML table with links and descriptions of some of the output files relating to OTU identification.</td></tr>
<tr><td>log_[timestamp].txt</td><td>Log file.  Dumps parameters and variables.  Contains the sub commands run by pick_open_reference_otus.py and input/output/errors. </td></tr>
<tr><td>rep_set_tax_assignments.txt</td><td>   ?    </td></tr>
<tr><td>otu_table_mc2.biom</td><td>Binary file. OTU table exluding OTUs with fewer than 2 sequences.	</td></tr>
<tr><td>otu_table_mc2_w_tax.biom</td><td>Binary file. OTU table exluding OTUs with fewer than 2 sequences and including OTU taxonomy assignments	</td></tr>
<tr><td>otu_table_mc2_w_tax_no_pynast_failures.biom</td><td>Binary file. OTU table exluding OTUs with fewer than 2 sequences and sequences that fail to align with PyNAST and including OTU taxonomy assignments	</td></tr>
<tr><td>rep_set.tre</td><td>OTU phylogenetic tree	</td></tr>

<tr><td>new_refseqs.fna</td><td>New reference sequences (i.e., OTU representative sequences plus input reference sequences)</td></tr>
<tr><td>final_otu_map.txt</td><td>List of OTU numbers and all corresponding samples.  Includes OTUs from closed and denovo clustering.</td></tr>
<tr><td>final_otu_map_mc2.txt</td><td>List of OTU numbers and all corresponding samples.  Includes OTUs from closed and denovo clustering, plus any remaining singleton sequences.</td></tr>



<tr><td>uclust_assigned_taxonomy/rep_set.fna</td><td>OTU representative sequences</td></tr>
<tr><td>uclust_assigned_taxonomy/rep_set_tax_assignments.log</td><td>? uClust file. A table with data about the clusters: Type, ClusterID, SeqLength or ClusterSize, %ID, etc.</td></tr>

<tr><td>step1_otus/failures.fasta</td><td>The failures from seqs_failures.txt, but with BPs to make a fasta file.</td></tr>
<tr><td>step1_otus/seqs_clusters.uc</td><td>? uClust file of successful closed ref clusters.  Table showing A table with data about the clusters: Type, ClusterID, SeqLength or ClusterSize, %ID, etc.</td></tr>
<tr><td>step1_otus/seqs_failures.txt</td><td>List of sample ids (no BPs) that failed to match.</td></tr>
<tr><td>step1_otus/seqs_otus.log</td><td>Dump of parameters used in closed-ref clustering.</td></tr>
<tr><td>step1_otus/seqs_otus.txt</td><td>List of clusters (from closed ref) and the samples belonging to them.</td></tr>
<tr><td>step1_otus/step1_rep_set.fna</td><td>Fasta file containing representative sequences from each successfully ID'd OTU.</td></tr>

<tr><td>step4_otus/failures_clusters.uc</td><td>?? uClust file of denovo clustering failures.  Table showing A table with data about the clusters: Type, ClusterID, SeqLength or ClusterSize, %ID, etc.</td></tr>
<tr><td>step4_otus/failures_otus.log</td><td>Dump of parameters used in denovo clustering.</td></tr>
<tr><td>step4_otus/failures_otus.txt</td><td>Lists clusters (obtained from failed closed-ref assignment) and the samples belonging to them.</td></tr>
<tr><td>step4_otus/step4_rep_set.fna</td><td> ???? Representative subset from failures.  For ?</td></tr>

<tr><td>pynast_aligned_seqs/rep_set_aligned.fasta</td><td>Fasta file containing all aligned sequences.</td></tr>
<tr><td>pynast_aligned_seqs/rep_set_aligned_pfiltered.fasta</td><td>Fasta file containing only conserved positions.  i.e. Columns with only gaps are removed.</td></tr>
<tr><td>pynast_aligned_seqs/rep_set_failures.fasta</td><td>Fasta file containing all sequences not meeting criteria specified.</td></tr>
<tr><td>pynast_aligned_seqs/rep_set_log.txt</td><td>Table detailing final sequence identification (sequence ID, Species#, BLAST %, etc).</td></tr>

</table>

## Closed Refrence OTU Picking
The first step in open reference OTU picking is to match as many sequences as possible to a reference database.

### Documentation
http://qiime.org/scripts/pick_otus.html

### Output Files
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>step1_otus/seqs_otus.log</td><td>Dump of parameters used in closed ref picking.</td></tr>
<tr><td>step1_otus/seqs_otus.txt</td><td>List of clusters (from closed ref) and the samples belonging to them.</td></tr>
<tr><td>step1_otus/seqs_failures.txt</td><td>List of sample ids (no BPs) that failed closed ref OTU picking.</td></tr>
<tr><td>?unsure where this goes?step1_otus/seqs_clusters.uc</td><td>? uClust file of successful closed ref clusters.  Table showing A table with data about the clusters: Type, ClusterID, SeqLength or ClusterSize, %ID, etc.</td></tr>
</table>

In [None]:
!pick_otus.py -i slout/seqs.fna -o otus/step1_otus -r /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta -m uclust_ref --enable_rev_strand_match --suppress_new_clusters

## Generate full failures fasta file
We need a fasta file containing the sequences that failed closed ref OTU picking.  However, the last step produced a file (seqs_failures.txt) which listed failures by sequence ID (without BPs), so we need to reconstruct a fasta file (by looking up the failure's BP sequences in the old fasta file (seqs.fna).
### Documentation
http://qiime.org/scripts/filter_fasta.html
### Output Files
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>step1_otus/failures.fasta</td><td>The failures from seqs_failures.txt, but with BPs to make a fasta file.</td></tr>
</table>

In [None]:
!filter_fasta.py -f slout/seqs.fna -s otus/step1_otus/seqs_failures.txt -o otus/step1_otus/failures.fasta

# Pick a representative from each closed ref cluster
Picks a representative set from the sequences that were identified in closed ref OTU picking.
### Documentation
http://qiime.org/scripts/pick_rep_set.html
### Output Files
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>step1_otus/step1_rep_set.fna</td><td>Fasta file containing representative sequences from each successfully ID'd OTU.</td></tr>
</table>

In [None]:
!pick_rep_set.py -i otus/step1_otus/seqs_otus.txt -o otus/step1_otus/step1_rep_set.fna -f slout/seqs.fna

## Pick de novo OTUs on closed ref failures
From the closed ref failures, preform denovo clustering to identify novel OTUs.
### Documentation
http://qiime.org/scripts/pick_otus.html
### Output Files
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>step4_otus/failures_otus.log</td><td>Dump of parameters used in denovo clustering.</td></tr>
<tr><td>step4_otus/failures_otus.txt</td><td>Lists clusters (obtained from failed closed-ref assignment) and the samples belonging to them.</td></tr>
<tr><td>?unsure if this belongs here? step4_otus/failures_clusters.uc</td><td>?? uClust file of denovo clustering failures.  Table showing A table with data about the clusters: Type, ClusterID, SeqLength or ClusterSize, %ID, etc.</td></tr>
</table>

In [None]:
!pick_otus.py -i otus/step1_otus/failures.fasta -o otus/step4_otus/ -m uclust  --denovo_otu_id_prefix New.CleanUp.ReferenceOTU --enable_rev_strand_match

## Merge OTU maps
Merge the closed ref and denovo OTU maps into one OTU map.
### Documentation
http://man7.org/linux/man-pages/man1/cat.1.html
### Output Files
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>final_otu_map.txt</td><td>List of OTU numbers and all corresponding samples.  Includes OTUs from closed and denovo clustering.</td></tr>
</table>

In [None]:
!cat otus/step1_otus/seqs_otus.txt  otus/step4_otus/failures_otus.txt > otus/final_otu_map.txt

## Pick representative for each denovo cluster
Picks a representative sequence from each cluster identified by denovo clustering.
### Documentation
http://qiime.org/scripts/pick_otus.html
### Output Files
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>step4_otus/step4_rep_set.fna</td><td>Representative subset of sequences from each denovo cluster.</td></tr>
</table>

In [None]:
!pick_rep_set.py -i otus/step4_otus/failures_otus.txt -o otus/step4_otus/step4_rep_set.fna -f otus/step1_otus/failures.fasta

## Filter singletons (fewer than 2 members) from the OTU map using API 
Singletons in the OTU map are removed.
### Output Files
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>final_otu_map_mc2.txt</td><td>List of OTU numbers and all corresponding samples.  Includes OTUs from closed and denovo clustering, minus any singleton clusters.</td></tr>
</table>

In [12]:
import qiime
from qiime.filter import filter_otus_from_otu_map
a=qiime.filter.filter_otus_from_otu_map('otus/final_otu_map.txt', 'otus/final_otu_map_mc2.txt', 2)
print a

set(['1027904', '1050608', '2595164', '816470', 'New.CleanUp.ReferenceOTU839', '410908', 'New.CleanUp.ReferenceOTU348', 'New.CleanUp.ReferenceOTU1629', '4375688', 'New.CleanUp.ReferenceOTU1627', '4128584', '748537', '410905', '330294', '4307790', '4400735', '829373', '253429', '1108960', '1119668', '4454356', '866280', 'New.CleanUp.ReferenceOTU2410', '82194', '587041', '241454', 'New.CleanUp.ReferenceOTU892', 'New.CleanUp.ReferenceOTU1139', 'New.CleanUp.ReferenceOTU909', 'New.CleanUp.ReferenceOTU908', '580916', '844940', '959160', 'New.CleanUp.ReferenceOTU903', '3482976', '359105', '244248', '4357712', '366716', '4343580', '1093417', '4339144', '4371046', '230421', 'New.CleanUp.ReferenceOTU349', '198866', '153978', '463861', '807329', '592925', '505587', '697990', '165827', '519353', '4373910', '250288', '2949328', '514041', 'New.CleanUp.ReferenceOTU2399', '766563', '505565', 'New.CleanUp.ReferenceOTU1626', '366794', '2163609', '215231', '392111', '985339', '533787', '563240', 'New.Cle

In [None]:
## Add reference data to denovo 
### Documentation
http://qiime.org/scripts/assign_taxonomy.html
### Output Files
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>new_refseqs.fna</td><td>New reference sequences (i.e., OTU representative sequences plus input reference sequences)</td></tr>
</table>



In [9]:
!cp /usr/local/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta otus/new_refseqs.fna

## Make the OTU table 
Build the OTU table from the previous step's OTU map.
### Documentation
http://qiime.org/scripts/make_otu_table.html
### Output Files
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>otu_table_mc2.biom</td><td>Binary file. OTU table exluding OTUs with fewer than 2 sequences.	</td></tr>
</table>

In [10]:
!make_otu_table.py -i otus/final_otu_map_mc2.txt -o otus/otu_table_mc2.biom

Traceback (most recent call last):
  File "/home/qiime/anaconda2/bin/make_otu_table.py", line 119, in <module>
    main()
  File "/home/qiime/anaconda2/bin/make_otu_table.py", line 115, in main
    write_biom_table(biom_otu_table, opts.output_biom_fp)
  File "/home/qiime/anaconda2/lib/python2.7/site-packages/qiime/util.py", line 569, in write_biom_table
    "Attempting to write an empty BIOM table to disk. "
qiime.util.EmptyBIOMTableError: Attempting to write an empty BIOM table to disk. QIIME doesn't support writing empty BIOM output files.


## Assign Taxonomy
Build the OTU table from the previous step's OTU map.
### Documentation
http://qiime.org/scripts/assign_taxonomy.html
### Output Files
<table>
<hr><td>Output File</td><td>Description</td></hr>
<tr><td>uclust_assigned_taxonomy/rep_set_tax_assignments.txt</td><td>   ?    </td></tr>
<tr><td>uclust_assigned_taxonomy/rep_set.fna</td><td>OTU representative sequences</td></tr>
<tr><td>uclust_assigned_taxonomy/rep_set_tax_assignments.log</td><td>? uClust file. A table with data about the clusters: Type, ClusterID, SeqLength or ClusterSize, %ID, etc.</td></tr>
</table>

In [None]:
!assign_taxonomy.py -o otus/uclust_assigned_taxonomy -i otus/rep_set.fna

## Add taxa to OTU table
Add taxa metadata to the OTU table.
### Documentation
http://biom-format.org/documentation/adding_metadata.html
### Output Files
<table><hr><td>Output File</td><td>Description</td></hr>
<tr><td>otu_table_mc2_w_tax.biom</td><td>Binary file. OTU table exluding OTUs with fewer than 2 sequences and including OTU taxonomy assignments.</td></tr>

</table>

In [None]:
!biom add-metadata -i otus/otu_table_mc2.biom --observation-metadata-fp otus/uclust_assigned_taxonomy/rep_set_tax_assignments.txt -o otus/otu_table_mc2_w_tax.biom --sc-separated taxonomy --observation-header OTUID,taxonomy

## Align sequences command 
Align the sequences with the Greengenes dataset.
### Documentation
http://qiime.org/scripts/align_seqs.html
### Output Files
<table><hr>
<td>Output File</td><td>Description</td></hr>
<tr><td>pynast_aligned_seqs/rep_set_aligned.fasta</td><td>Fasta file containing all aligned sequences.</td></tr>
<tr><td>pynast_aligned_seqs/rep_set_failures.fasta</td><td>Fasta file containing all sequences not meeting criteria specified.</td></tr>
<tr><td>pynast_aligned_seqs/rep_set_log.txt</td><td>Table detailing final sequence identification (sequence ID, Species#, BLAST %, etc).</td></tr>
</table>

In [None]:
!align_seqs.py -i otus/rep_set.fna -o otus/pynast_aligned_seqs 

## Filter alignment
Removes columns with gaps in all sequences.
## Documentation
### Output Files
<table><hr>
<td>Output File</td><td>Description</td></hr>
<tr><td>pynast_aligned_seqs/rep_set_aligned_pfiltered.fasta</td><td>Fasta file containing only conserved positions.  i.e. Columns with only gaps are removed.</td></tr></table>

In [None]:
!filter_alignment.py -o otus/pynast_aligned_seqs -i otus/pynast_aligned_seqs/rep_set_aligned.fasta 

## Build phylogenetic tree
Removes columns with gaps in all sequences.
## Documentation
### Output Files
<table><hr>
<td>Output File</td><td>Description</td></hr>
<tr><td>rep_set.tre</td><td>OTU phylogenetic tree	</td></tr>
</table>

In [None]:
!make_phylogeny.py -i otus/pynast_aligned_seqs/rep_set_aligned_pfiltered.fasta -o otus/rep_set.tre

To compute some summary statistics of the OTU table we can run the following command.

In [None]:
!biom summarize-table -i otus/otu_table_mc2_w_tax_no_pynast_failures.biom

The key piece of information you need to pull from this output is the depth of sequencing that should be used in diversity analyses. Many of the analyses that follow require that there are an equal number of sequences in each sample, so you need to review the *Counts/sample detail* and decide what depth you'd like. Any samples that don't have at least that many sequences will not be included in the analyses, so this is always a trade-off between the number of sequences you throw away and the number of samples you throw away. For some perspective on this, see [Kuczynski 2010](http://www.ncbi.nlm.nih.gov/pubmed/20441597).

## Run diversity analyses

Here we're running the ``core_diversity_analyses.py`` script which applies many of the "first-pass" diversity analyses that users are generally interested in. The main output that users will interact with is the ``index.html`` file, which provides links into the different analysis results.

Note that in this step we're passing ``-e`` which is the sampling depth that should be used for diversity analyses. I chose 1114 here, based on reviewing the above output from ``biom summarize-table``. This value will be study-specific, so don't just use this value on your own data (though it's fine to use that value for this tutorial).

**The commands in this section (combined) can take about 15 minutes to complete.**

**You may see a RuntimeWarning generated by this command.** As the warning indicates, it's not something that you should be concerned about in this case. QIIME (and [scikit-bio](http://www.scikit-bio.org), which implements a lot of QIIME's core functionality) will *sometimes* provide these types of warnings to help you figure out if your analyses are valid, but you should always be thinking about whether a particular test or analysis is relevant for your data. Just because something can be passed as input to a QIIME script, doesn't necessarily mean that the analysis it performs is appropriate.

In [None]:
!core_diversity_analyses.py -o cdout/ -i otus/otu_table_mc2_w_tax_no_pynast_failures.biom -m map.tsv -t otus/rep_set.tre -e 1114

Next open the ``index.html`` file in the resulting directory. This will link you into the different results.

The results above treat all samples independently, but sometimes (for example, in the taxonomic summaries) it's useful to categorize samples by their metadata. We can do this by passing categories (i.e., headers from our mapping file) to ``core_diversity_analyses.py`` with the ``-c`` parameter. Because ``core_diversity_analyses.py`` can take a long time to run, it has a ``--recover_from_failure`` option, which can allow it to be rerun from a point where it previously failed in some cases (for example, if you accidentally turned your computer off while it was running). This option can also be used to add categorical analyses if you didn't include them in your initial run. Next we'll rerun ``core_diversity_analyses.py`` with two sets of categorical analyses: one for the ``"SampleType`` category, and one for the ``DaysSinceExperimentStart`` category. Remember the ``--recover_from_failure`` option: it can save you a lot of time.

In [None]:
!core_diversity_analyses.py -o cdout/ --recover_from_failure -c "SampleType,DaysSinceExperimentStart" -i otus/otu_table_mc2_w_tax_no_pynast_failures.biom -m map.tsv -t otus/rep_set.tre -e 1114

One thing you may notice in the PCoA plots generated by ``core_diversity_analyses.py`` is that the samples don't cluster perfectly by ``SampleType``. This is unexpected, based on what we know about the human microbiome. Since this is a time series, let's explore this in a little more detail integrating a time axis into our PCoA plots. We can do this by re-running Emperor directly, replacing our previously generated PCoA plots. ([Emperor](http://biocore.github.io/emperor/) is a tool for the visualization of PCoA plots with many advanced features that you can explore in the [Emperor tutorial](http://biocore.github.io/emperor/tutorial_index.html). If you use Emperor in your research you should be sure to [cite it](http://www.ncbi.nlm.nih.gov/pubmed/24280061) directly, as with the other tools that QIIME wraps, such as [uclust](http://www.ncbi.nlm.nih.gov/pubmed/20709691) and [RDPClassifier](http://www.ncbi.nlm.nih.gov/pubmed/17586664).)

After this runs, you can reload the Emperor plots that you accessed from the above ``cdout/index.html`` links. Try making the samples taken during ``AntibioticUsage`` invisible. 

In [None]:
!make_emperor.py -i cdout/bdiv_even1114/weighted_unifrac_pc.txt -o cdout/bdiv_even1114/weighted_unifrac_emperor_pcoa_plot -m map.tsv --custom_axes DaysSinceExperimentStart 
!make_emperor.py -i cdout/bdiv_even1114/unweighted_unifrac_pc.txt -o cdout/bdiv_even1114/unweighted_unifrac_emperor_pcoa_plot -m map.tsv --custom_axes DaysSinceExperimentStart 

 **IMPORTANT**: Removing points from a PCoA plot, as is suggested above for data exploration purposes, is not the same as computing PCoA without those points. If after running this, you'd like to remove the samples taken during ``AntibioticUsage`` from the analysis, you can do this with ``filter_samples_from_otus_table.py``, which is discussed [here](http://qiime.org/tutorials/metadata_description.html). As an exercise, try removing the samples taken during ``AntibioticUsage`` from the OTU table and re-running ``core_diversity_analyses.py``. You should output the results to a different directory than you created above (e.g., ``cdout_no_abx``). 