# 07/04/2020

Source: https://docs.qiime2.org/2020.2/tutorials/moving-pictures/#taxonomic-analysis

1. Taxonomic analysis   

Source: https://docs.qiime2.org/2020.2/tutorials/moving-pictures/#generate-a-tree-for-phylogenetic-diversity-analyses

2. Phylogenetic analysis  

Source: https://docs.qiime2.org/2020.2/tutorials/moving-pictures/#alpha-rarefaction-plotting

3. Alpha rarefaction 

In [8]:
cd /xdisk/tfaily/mig2020/extra/nathaliagg/sulfate_experiment/microbial_16S/qiime2

### Load module and activate conda environment

Disregard warnings.

In [9]:
module load anaconda/2020/2020.02

In [10]:
source activate qiime2-2020.2

(qiime2-2020.2) 

: 1

### 1. Taxonomic assignment

These analysis are performed on the `rep-seqs.qza` or the `FeatureData[Sequence]` QIIME 2 artifact.

Here, pre-trained Naïve Bayes classifier (pre-trained on the Greengenes 13_8 99% OTUs, where the sequences have been trimmed to only include 250 bases from the region of the 16S that was sequenced in this analysis, the the V4 region, bound by the 515F/806R primer pair) and the `q2-feature-classifier` plugin are used. Taxonomic classifiers perfom best when they are trained on specific sample preparation and sequencing parameters, including primers that were used for amplification and the length of sequencing reads.

#### Download the classifier

In [11]:
wget \
  -O "gg-13-8-99-515-806-nb-classifier.qza" \
  "https://data.qiime2.org/2020.2/common/gg-13-8-99-515-806-nb-classifier.qza"

--2020-07-04 12:09:24--  https://data.qiime2.org/2020.2/common/gg-13-8-99-515-806-nb-classifier.qza
Resolving data.qiime2.org... 52.35.38.247
Connecting to data.qiime2.org|52.35.38.247|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://s3-us-west-2.amazonaws.com/qiime2-data/2020.2/common/gg-13-8-99-515-806-nb-classifier.qza [following]
--2020-07-04 12:09:25--  https://s3-us-west-2.amazonaws.com/qiime2-data/2020.2/common/gg-13-8-99-515-806-nb-classifier.qza
Resolving s3-us-west-2.amazonaws.com... 52.218.180.120
Connecting to s3-us-west-2.amazonaws.com|52.218.180.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28373581 (27M) [application/x-www-form-urlencoded]
Saving to: “gg-13-8-99-515-806-nb-classifier.qza”


2020-07-04 12:09:36 (2.51 MB/s) - “gg-13-8-99-515-806-nb-classifier.qza” saved [28373581/28373581]

(qiime2-2020.2) 

: 1

#### Run the taxonomic analysis

In [12]:
# qiime feature-classifier --help

(qiime2-2020.2) 

: 1

In [13]:
mkdir taxonomic_phylogenetic # only run once!

(qiime2-2020.2) 

: 1

In [14]:
qiime feature-classifier classify-sklearn \
  --i-classifier gg-13-8-99-515-806-nb-classifier.qza \
  --i-reads denoise_dada2/rep-seqs.qza \
  --o-classification taxonomic_phylogenetic/taxonomy.qza

[32mSaved FeatureData[Taxonomy] to: taxonomic_phylogenetic/taxonomy.qza[0m
(qiime2-2020.2) 

: 1

Output artifact: `taxonomy.qza`.

Next, generate visualization.

In [15]:
qiime metadata tabulate \
  --m-input-file taxonomic_phylogenetic/taxonomy.qza \
  --o-visualization taxonomic_phylogenetic/taxonomy.qzv

[32mSaved Visualization to: taxonomic_phylogenetic/taxonomy.qzv[0m
(qiime2-2020.2) 

: 1

Output visualization: `taxonomy.qzv` --> `view.qiime2.org`

Next, generate another visualization for the taxonomic composition of the samples:

In [16]:
qiime taxa barplot \
  --i-table denoise_dada2/table.qza \
  --i-taxonomy taxonomic_phylogenetic/taxonomy.qza \
  --m-metadata-file metadata.tsv \
  --o-visualization taxonomic_phylogenetic/taxa-bar-plots.qzv

[32mSaved Visualization to: taxonomic_phylogenetic/taxa-bar-plots.qzv[0m
(qiime2-2020.2) 

: 1

#### Export taxonomic assignment results

Export the `table.qza` and `taxonomy.qza` so more plots can be generated locally.

This is a two-step process. 

First, export these data using `qiime tools export`, which will produce `.biom` files.

In [17]:
qiime tools export \
  --input-path taxonomic_phylogenetic/taxonomy.qza \
  --output-path taxonomic_phylogenetic

[32mExported taxonomic_phylogenetic/taxonomy.qza as TSVTaxonomyDirectoryFormat to directory taxonomic_phylogenetic[0m
(qiime2-2020.2) 

: 1

In [18]:
qiime tools export \
  --input-path denoise_dada2/table.qza \
  --output-path taxonomic_phylogenetic

[32mExported denoise_dada2/table.qza as BIOMV210DirFmt to directory taxonomic_phylogenetic[0m
(qiime2-2020.2) 

: 1

Second, convert `.biom` files to `.tsv`. This makes use of `biom convert`, which is implemented in QIIME2.

In [19]:
biom convert -i taxonomic_phylogenetic/feature-table.biom -o taxonomic_phylogenetic/feature-table.tsv --to-tsv

(qiime2-2020.2) 

: 1

Apply filter to the feature axis to remove low abundance features from a table. For example, you can remove all features with a total abundance (summed across all samples) of less than 10 as follows.

Then, export and biom convert like above.

In [20]:
qiime feature-table filter-features \
  --i-table denoise_dada2/table.qza \
  --p-min-frequency 10 \
  --o-filtered-table taxonomic_phylogenetic/feature-frequency-filtered-table.qza
  
qiime tools export \
  --input-path taxonomic_phylogenetic/feature-frequency-filtered-table.qza \
  --output-path taxonomic_phylogenetic
  
biom convert -i taxonomic_phylogenetic/feature-table.biom -o taxonomic_phylogenetic/feature-frequency-filtered-table --to-tsv


[32mSaved FeatureTable[Frequency] to: taxonomic_phylogenetic/feature-frequency-filtered-table.qza[0m
(qiime2-2020.2) (qiime2-2020.2) [32mExported taxonomic_phylogenetic/feature-frequency-filtered-table.qza as BIOMV210DirFmt to directory taxonomic_phylogenetic[0m
(qiime2-2020.2) (qiime2-2020.2) (qiime2-2020.2) 

: 1

Those level files are inside this qzv. unizp -K to reveal them...

In [21]:
qiime taxa barplot \
  --i-table taxonomic_phylogenetic/feature-frequency-filtered-table.qza \
  --i-taxonomy taxonomic_phylogenetic/taxonomy.qza \
  --m-metadata-file metadata.tsv \
  --o-visualization taxonomic_phylogenetic/taxa-bar-plots-filtered.qzv

[32mSaved Visualization to: taxonomic_phylogenetic/taxa-bar-plots-filtered.qzv[0m
(qiime2-2020.2) 

: 1

### 2. Phylogenetic diversity analyses

QIIME2 supports several phylogenetic diversity metrics, including Faith’s Phylogenetic Diversity and weighted and unweighted UniFrac. These metrics require a rooted phylogenetic tree relating the features to one another. This information will be stored in a `Phylogeny[Rooted] QIIME 2 artifact`. To generate a phylogenetic tree use `align-to-tree-mafft-fasttree` pipeline from the `q2-phylogeny plugin`.

Summary of steps:  
1. `mafft` performs msa in `FeatureData[Sequence]` to create a `FeatureData[AlignedSequence]` QIIME2 artifact.  
2. Masking or filtering the alignment to remove highly variable positions, which are considered to add noise to a phylogenetic tree.  
3. `FastTree` generates the phylogenetic tree from the masked alignment.  

The result is an unrooted tree.

In [22]:
qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences denoise_dada2/rep-seqs.qza \
  --o-alignment taxonomic_phylogenetic/aligned-rep-seqs.qza \
  --o-masked-alignment taxonomic_phylogenetic/masked-aligned-rep-seqs.qza \
  --o-tree taxonomic_phylogenetic/unrooted-tree.qza \
  --o-rooted-tree taxonomic_phylogenetic/rooted-tree.qza

[32mSaved FeatureData[AlignedSequence] to: taxonomic_phylogenetic/aligned-rep-seqs.qza[0m
[32mSaved FeatureData[AlignedSequence] to: taxonomic_phylogenetic/masked-aligned-rep-seqs.qza[0m
[32mSaved Phylogeny[Unrooted] to: taxonomic_phylogenetic/unrooted-tree.qza[0m
[32mSaved Phylogeny[Rooted] to: taxonomic_phylogenetic/rooted-tree.qza[0m
(qiime2-2020.2) 

: 1

Output artifacts: `aligned-rep-seqs.qza`, `masked-aligned-rep-seqs.qza`, `rooted-tree.qza`, and `unrooted-tree.qza`

### 3. Alpha rarefaction

*Alpha diversity* is within sample diversity. Rarefaction curves estimate the capture of total diversity as function of sequencing depth.

This visualizer computes one or more alpha diversity metrics at multiple sampling depths, in steps between 1 (optionally controlled with `--p-min-depth`) and the value provided as `--p-max-depth`. At each sampling depth step, 10 rarefied tables will be generated, and the diversity metrics will be computed for all samples in the tables. The number of iterations (rarefied tables computed at each sampling depth) can be controlled with `--p-iterations`. 

Average diversity values are plotted for each sample at each even sampling depth, and samples can be grouped based on metadata in the resulting visualization if sample metadata is provided with the `--m-metadata-file` parameter.

The value for `--p-max-depth` should be determined by reviewing the "Frequency per sample" information presented in the `table.qzv`. In general, choosing a value that is somewhere around the median frequency is recommended. The value can be increased if the lines in the resulting rarefaction plot don’t appear to be leveling out, or decrease the value if many of the samples are lost due to low total frequencies closer to the minimum sampling depth than the maximum sampling depth.

In [23]:
qiime diversity alpha-rarefaction \
  --i-table denoise_dada2/table.qza \
  --i-phylogeny taxonomic_phylogenetic/rooted-tree.qza \
  --p-max-depth 10000 \
  --m-metadata-file metadata.tsv \
  --o-visualization taxonomic_phylogenetic/alpha-rarefaction.qzv

[32mSaved Visualization to: taxonomic_phylogenetic/alpha-rarefaction.qzv[0m
(qiime2-2020.2) 

: 1

Output visualization: `alpha-rarefaction.qzv` --> `view.qiime2.org`

The alpha rarefaction plot is important to explain in more detail:

The top plot is an alpha rarefaction plot, and is primarily used to determine if the richness of the samples has been fully observed or sequenced. If the lines in the plot appear to “level out” (i.e., approach a slope of zero, plateau) at some sampling depth along the x-axis, that suggests that collecting additional sequences beyond that sampling depth would not be likely to result in the observation of additional features. If the lines in a plot don’t level out, this may be because the richness of the samples hasn’t been fully observed/sequenced yet (because too few sequences were collected), or it could be an indicator that a lot of sequencing error remains in the data (which is being mistaken for novel diversity).

The bottom plot in this visualization is important when grouping samples by metadata (use the drop down menu to evaluate different metadata). It illustrates the number of samples that remain in each group when the feature table is rarefied to each sampling depth. If a given sampling depth `d` is larger than the total frequency of a sample `s` (i.e., the number of sequences that were obtained for sample `s`), it is not possible to compute the diversity metric for sample `s` at sampling depth `d`. If many of the samples in a group have lower total frequencies than `d`, the average diversity presented for that group at `d` in the top plot will be unreliable because it will have been computed on relatively few samples. When grouping samples by metadata, it is therefore essential to look at the bottom plot to ensure that the data presented in the top plot is reliable.

Done! Next notebook!

In [24]:
source deactivate qiime2-2020.2

