# Amplicon data analysis - workflow
This page descibes the usual steps in an amplicon data analysis after you have obtained an ASV-table.

## Quality control
First step of an analysis is always to ensure that the data is of sufficient quality. Below are descibed the four main focus points of such a quality check.

### Check composition
Usually we have prior knowledge on which microbes we find in a given environment. Check if you find the expected microbes in your sample, or if your fecal samples are filled with common soil-dwelling bacteria. This should be done a per-sample basis - maybe a few samples are completely off and should be discarded. You would also create rarefaction curves (see notebook on [Compositionality](https://microucph.github.io/amplicon_data_analysis/html/compositionality.html)) to check if any samples have too low read depth to cover the diversity.

### Positive control
For each sequencing run we always include a mock community, which is a sample with known composition. The composition of this sample should of course match the expected to a reasonable degree.

### Negative controls
We always include three different negative controls: Water added at the DNA extraction step, water added at the 1st PCR, and water added at the 2nd PCR. In an ideal world these were all completely devoid of microbes, but lab reagants contain tiny amounts of bacteria or their DNA (also called the [kitome](https://link.springer.com/article/10.1186/s12915-014-0087-z)) which will get amplified during PCR. The sample can also be contaminated from the lab environment or from nearby samples when handled on a 96-well plate (cross-contamination or [splashome](https://bmcmicrobiol.biomedcentral.com/articles/10.1186/s12866-020-01839-y)). The composition of the negative controls should differ considerably from the actual samples. Often contamination of worst when samples have low biomass - in this case utmost care should be taken during handling in the lab. If there is compositional overlap between actual samples and negative controls one mitigation could be [decontaminaiton](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-018-0605-2) during data analysis, where taxa that are found in negative controls are removed from the dataset across all samples. However, if your actual samples contain microbes that are commonly found in lab reagants, other measures should be taken to ensure that the data is trustworthy.

### Technical variation - batch effects
Lastly, one should check whether the composition correlates with any technical covariants (which it shouldn't). Technical covariates could be if DNA extraction was done on different days or by different people or with different batches of kit. Similarly, if samples are from different sequencing runs, were stored for different durations, or otherwise handled differently during the process from sampling to sequencing, this should be ensured not the correlate with the microbial composition. Because these technical differences can create variation in the resulting data it is crucial to randomize samples during lab handling if possible. If there is technical variation one should control for it during data analysis or ensure that it doesn't affect the results. 

## Rarefaction or not
After quality control, one would usually decide the normalization and read-depth-bias strategy. Either rarefy the data and use the rarefied dataset throughout or rely with other ways of controlling for read depth bias. See details in notebook on [compositionality](https://microucph.github.io/amplicon_data_analysis/html/compositionality.html).

## Analysis
There is no one-size-fits-all analysis. The choices depend entirely on the aim and hypotheses of your study:
* Will you test for differences in diversity between samples, use [alpha diversity](https://microucph.github.io/amplicon_data_analysis/html/alpha.html)
* Will you test for differences in overall composition between samples, use [beta diversity](https://microucph.github.io/amplicon_data_analysis/html/beta.html)
* Do you hypothesize that specific microbes have different abundances across your samples, use [differential abundance](https://microucph.github.io/amplicon_data_analysis/html/da.html)
* Do you want to vizualize which microbes are present in your samples, use [pretty heatmaps](https://microucph.github.io/amplicon_data_analysis/html/pheatmap.html), relative abundance bar charts, or [phylogenetic trees](https://itol.embl.de/).
* Do you want to identify a core microbiome or microbes specific to groups of samples, make a [venn diagram](https://microucph.github.io/amplicon_data_analysis/html/venn.html)
* Are you interested in predicting the origin of samples depending on the microbes present, use  [supervised machine learning](https://microucph.github.io/amplicon_data_analysis/html/superlearn.html)
* Do you think your data is separated into different groups, use [clustering](https://microucph.github.io/amplicon_data_analysis/html/cluster.html)
* Are you interested in how microbes co-vary with each other, use [microbial association networks](https://microucph.github.io/amplicon_data_analysis/html/network.html)