# Perform Alpha Diversity Calculations

This file is rather long but details the methodology and ideas behind some of the measurements performed.

## Alpha and Beta Diversity Review

`core-metrics-phylogenetic` rarifies a `FeatureTable[Frequency]` to a user specified depth, computes several alpha and beta diversity metrics, and generates PCoA plots using Emperor for each of the beta diversity metrics.

### What is a diversity index?

A diversity index is a mathematical measure of species diversity in (or between) a community. Diversity indeces provide more information about community composition than simply species richness (i.e. the number of species present); they also take into account the relative abundance of differenr species into account.

### Alpha Diversity

Alpha Diversity refers to the average species diversity in a habitat or specific area. Alpha Diversity is a local measure.

#### Shannon's Diversity Index

Shannon's Diversity index accounts for both abundance and evenness of the species present. The proportion of species $i$ relative to the total number of species ($p_{i}$) is calculated and then multiplied by the natural logarithm of this proportion ($\ln p_i$). The resulting product is summed across species and multiplied by $-1$. Where $S$ is the total number of species in the community. 

$$H = - \sum\limits_{j=1}^S p_j \ln p_j $$

#### Observed OTUs

Basically groupings of 'things' observed. In our case Observed OTUs would mean the operational taxonomic units are sequences that cluster together. This clustering is usually performed when sequences share 97% similarity. The thought is that these OTUs at 97% similarity correspond roughly to species. This clustering may fail because:

1. some species share more than 97% similarity at one locus
2. A single species may have paralogs that are < 97% similar, causing the species to be split across OTUs
3. some cluster may be spurioys due to artifacts

#### Faith's Phylogenetic Diversity

incorporates the phylogenetic difference between species. calculated as the sum of the lengths of all those branches that are members of the corresponding minimum spanning path, in which 'branch' is a segment of a cladogram, and the minimum spanning path is the minimum distance between the to nodes.

#### Evenness

How close in numbers each species in an environment is. 

### Beta Diversity

Beta Diversity refers to the ratio between local (alpha) diversity and regional diversity. This is the diversity of species between two habitats or regions.

#### Jaccard Distance

The jaccard distance uses the presence/absence of data and ignores abundance measures. 

$$S_j = \frac{A}{A + B + C}$$

where $S_j$ is the Jaccard similarity coefficient

A is the number of shared species

B is number of species unique to the first sample

C is the number of species unique to the second sample

#### Bray-Curtis dissimilarity

Used to quantify the compositional dissimilarity between two different sites, based on counts at each site.

$$BC_{ij} = 1 - \frac{2C_ij}{S_i + S_j}$$

where $C_{ij}$ is the sum of the lesser values for only those species in common between both sites. $S_i$ and $S_j$ are the total number of specimens counted at both sites.

#### Unweighted UniFrac distance

Distance metric that incorporates information on the relative relatedness of community members by incorporating phylogenetic distances between observed organisms in the computation. Both Weighted (quantitative) and Unweighted (qualitative) variants are used. The Unweighted only considers the presence or absence of observed organisms in a sample.

#### Weighted Unifrac distance

The Weighted unifrac distance accounts for the abundance of observed organisms in a sample.

### Sampling Depth

`sampling_depth` is an important parameter that must be provided for these scripts (i.e. rarefaction depth). Because most diversity metrics are sensitive different sampling depths across different samples, this script will randomly subsample the counts for each sample to the value provided for this parameter. Qiime2 documentation recommends making this choice by reviewing `table.qzv` and choosing a value that is as high as possible (so that you retain more sequences per sample) while excluding as few samples as possible.

---

## Split into only AR table

Since we are most interested in the pipe samples (Cast Iron and Cement) we can filter the feature table down to include only those samples. Additionally, we can perform the same filtering to include the pipe-biofilm samples as well.

In [None]:
%%bash
qiime feature-table filter-samples \
    --i-table decontam-taxa-filtered-table.qza \
    --m-metadata-file sample-metadata.tsv \
    --p-where "Sample_Type = 'AR'" \
    --o-filtered-table AR-table.qza

# Create table tncluding ARs and Pipe Biofilm
qiime feature-table filter-samples \
    --i-table decontam-taxa-filtered-table.qza \
    --m-metadata-file sample-metadata.tsv \
    --p-where "Sample_Type = 'AR' OR Sample_Type = 'Pipe Biofilm'" \
    --o-filtered-table biofilm-table.qza

## Filter features only present in one sample

This step is important to perform aain so that we can remove possibly spurious features

In [None]:
%%bash
qiime feature-table filter-features \
    --i-table AR-table.qza \
    --p-min-samples 2 \
    --o-filtered-table AR-filtered-table.qza
    
# and for the biofilm included table
qiime feature-table filter-features \
    --i-table biofilm-table.qza \
    --p-min-samples 2 \
    --o-filtered-table biofilm-filtered-table.qza

### Visualize the new tables

In [None]:
%%bash
qiime feature-table summarize \
    --i-table AR-filtered-table.qza \
    --o-visualization AR-filtered-table.qzv \
    --m-sample-metadata-file sample-metadata.tsv

qiime feature-table summarize \
    --i-table biofilm-filtered-table.qza \
    --o-visualization biofilm-filtered-table.qzv \
    --m-sample-metadata-file sample-metadata.tsv

## Compute Core Diversity Metrics

Perform for both AR only table and AR-biofilm table

In [None]:
%%bash
qiime diversity core-metrics-phylogenetic \
    --i-table AR-filtered-table.qza \
    --i-phylogeny rooted-tree.qza \
    --p-sampling-depth 34585 \
    --m-metadata-file sample-metadata.tsv \
    --output-dir AR-core-metrics-results


qiime diversity core-metrics-phylogenetic \
    --i-table biofilm-filtered-table.qza \
    --i-phylogeny rooted-tree.qza \
    --p-sampling-depth 10204 \
    --m-metadata-file sample-metadata.tsv \
    --output-dir biofilm-core-metrics-results

## Create rareified table for metrics not included in core-metrics

Since Simpson diversity is not in the `core-metrics` pipeline we must create a separate rarefied table to perform the Simpson diversity metric on. 

In [None]:
%%bash
qiime feature-table rarefy \
    --i-table AR-filtered-table.qza \
    --p-sampling-depth 34585 \
    --o-rarefied-table AR-rarefied-filtered-table.qza
    
# biofilm included
qiime feature-table rarefy \
    --i-table biofilm-filtered-table.qza \
    --p-sampling-depth 10204 \
    --o-rarefied-table biofilm-rarefied-filtered-table.qza

## Calculate Simpson Diversity

Save results back into `core-metrics-results` directory. 

In [None]:
%%bash
qiime diversity alpha \
    --i-table AR-rarefied-filtered-table.qza \
    --p-metric simpson \
    --o-alpha-diversity AR-core-metrics-results/simpson_vector.qza

# including biofilm
qiime diversity alpha \
    --i-table biofilm-rarefied-filtered-table.qza \
    --p-metric simpson \
    --o-alpha-diversity biofilm-core-metrics-results/simpson_vector.qza

### Visualize - Shannon Diversity

In [None]:
%%bash
qiime diversity alpha-group-significance \
    --i-alpha-diversity AR-core-metrics-results/shannon_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization AR-core-metrics-resultss/shannon-group-significance.qzv

# including biofilm
qiime diversity alpha-group-significance \
    --i-alpha-diversity biofilm-core-metrics-results/shannon_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization biofilm-core-metrics-results/shannon-group-significance.qzv

### Visualize - Simpson Diversity

In [None]:
%%bash
qiime diversity alpha-group-significance \
    --i-alpha-diversity AR-core-metrics-results/simpson_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization AR-core-metrics-results/simpson-group-significance.qzv

# including biofilm
qiime diversity alpha-group-significance \
    --i-alpha-diversity biofilm-core-metrics-results/simpson_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization biofilm-core-metrics-results/simpson-group-significance.qzv

### Visualize - Faith PD

In [None]:
%%bash
qiime diversity alpha-group-significance \
    --i-alpha-diversity AR-core-metrics-results/faith_pd_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization AR-core-metrics-results/faith-pd-group-significance.qzv

# including biofilm
qiime diversity alpha-group-significance \
    --i-alpha-diversity biofilm-core-metrics-results/faith_pd_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization biofilm-core-metrics-results/faith-pd-group-significance.qzv

## Determine whether any numeric metadata column correlates with alpha diversity

In [None]:
%%bash
# faith-pd
qiime diversity alpha-correlation \
    --i-alpha-diversity AR-core-metrics-results/faith_pd_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization AR-core-metrics-results/faith-correlation.qzv

# simpson diversity
qiime diversity alpha-correlation \
    --i-alpha-diversity AR-core-metrics-results/simpson_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization AR-core-metrics-results/simpson-correlation.qzv

# shannon diversity
qiime diversity alpha-correlation \
    --i-alpha-diversity AR-core-metrics-results/shannon_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization AR-core-metrics-results/shannon-correlation.qzv

## Split feature table by pipe to test for pipe-specific correlations

The next cell will mirror the analysis perfomed above only of features tables that include only the cast iron or cement pipe materials

In [None]:
%%bash
# Split into Cast Iron and Cement Tables
qiime feature-table filter-samples \
    --i-table AR-filtered-table.qza \
    --m-metadata-file sample-metadata.tsv \
    --p-where "Pipe_Material = 'Cast Iron'" \
    --o-filtered-table cast-iron-table.qza
    
qiime feature-table filter-samples \
    --i-table AR-filtered-table.qza \
    --m-metadata-file sample-metadata.tsv \
    --p-where "Pipe_Material = 'Cement'" \
    --o-filtered-table cement-table.qza

# Calculate core-metrics for each
qiime diversity core-metrics-phylogenetic \
    --i-table cast-iron-table.qza \
    --i-phylogeny rooted-tree.qza \
    --p-sampling-depth 34585 \
    --m-metadata-file sample-metadata.tsv \
    --output-dir cast-iron-core-metrics-results
    
qiime diversity core-metrics-phylogenetic \
    --i-table cement-table.qza \
    --i-phylogeny rooted-tree.qza \
    --p-sampling-depth 34585 \
    --m-metadata-file sample-metadata.tsv \
    --output-dir cement-core-metrics-results

# Perform correlation testing
qiime diversity alpha-correlation \
    --i-alpha-diversity cast-iron-core-metrics-results/faith_pd_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization cast-iron-core-metrics-results/faith-correlation.qzv
    
qiime diversity alpha-correlation \
    --i-alpha-diversity cast-iron-core-metrics-results/shannon_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization cast-iron-core-metrics-results/shannon-correlation.qzv
    
qiime diversity alpha-correlation \
    --i-alpha-diversity cement-core-metrics-results/faith_pd_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization cement-core-metrics-results/faith-correlation.qzv
    
qiime diversity alpha-correlation \
    --i-alpha-diversity cement-core-metrics-results/shannon_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization cement-core-metrics-results/shannon-correlation.qzv

# Perform significance testing
qiime diversity alpha-group-significance \
    --i-alpha-diversity cast-iron-core-metrics-results/faith_pd_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization cast-iron-core-metrics-results/faith-alpha-group-significance.qzv
    
qiime diversity alpha-group-significance \
    --i-alpha-diversity cast-iron-core-metrics-results/shannon_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization cast-iron-core-metrics-results/shannon-alpha-group-significance.qzv
    
qiime diversity alpha-group-significance \
    --i-alpha-diversity cement-core-metrics-results/faith_pd_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization cement-core-metrics-results/faith-alpha-group-significance.qzv
    
qiime diversity alpha-group-significance \
    --i-alpha-diversity cement-core-metrics-results/shannon_vector.qza \
    --m-metadata-file sample-metadata.tsv \
    --o-visualization cement-core-metrics-results/shannon-alpha-group-significance.qzv

## Handoff the results to R for plotting

See the `plot-alpha-diversity.R` and `plot-alpha-correlation.R` scripts in the /bin/scripts/ directory.

## Alpha Rarefaction Plot of AR samples

To see if our sampling depth was deep enough we can create a rarefaction plot at different depths. The results below indicate that our sampling at ~34,000 read depth was more than sufficient (i.e. the a-diversity metrics plateau)

In [None]:
%%bash
qiime diversity alpha-rarefaction \
    --i-table AR-table.qza \
    --p-max-depth 80000 \
    --p-metrics simpson \
    --m-metadata-file sample-metadata.tsv \
    --p-min-depth 5000 \
    --o-visualization AR-rarefaction-curve.qzv

# with Biofilm
qiime diversity alpha-rarefaction \
    --i-table biofilm-filtered-table.qza \
    --p-max-depth 80000 \
    --p-metrics simpson \
    --m-metadata-file sample-metadata.tsv \
    --p-min-depth 5000 \
    --o-visualization biofilm-rarefaction-curve.qzv