# Qiime2 tutorial

## Jupyter notebooks

![image.png](attachment:image.png)

Instead of logging to a server or installing Linux and Qiime on your laptops, we are going to use notebooks in the cloud.

Jupyter is a web app that allows to run code interactively on the cloud.

I prepared the notebooks so that all the tools are already installed.

Note that this is a free service provided by MyBinder, and that some limitations apply:

- 1h of computation
- After 10 min of inactivity, the notebook disconnects.
- There is a time limit of 6 hours of a notebook open

So: save often, and if you need to take a break, type the word `pause` in any `code cell`, and when you are ready, click on the stop button to resume.

Jupyter notebooks are composed of cells, blocks of text like this one, or blocks of code like the following one:

In [None]:
conda activate qiime2-2023.2

In [None]:
echo Today is $(date)

To execute the cell above, either press Ctrl + Enter at the same time, or click the "Play" button in the top bar.

!image of the top bar

Text cells have a white background and code cells are grey.
If you want to add notes to this notebook, select a cell, and then click on the + symbol on the top bar

!insert image

To switch between text and code cells press the dropdown menu on the top bar and choose "code" or "markdown"

!insert image

The panel to the left shows the files of this tutorial. If you double click any of them, a new tab will open showing its content.

!insert image

Let's start with the tutorial!

## QIIME2

Qiime2 is a set of tools to analyze metabarcoding data from bacterial communities, from the raw reads that a genome sequencer produces to tables and plots ready for publication.

The dataset that we are going to analyze is called "Moving images", from [Caporaso et al. (2011)](https://www.ncbi.nlm.nih.gov/pubmed/21624126).

It contains metagenomic samples from:
- 2 individuals
- 4 different body sites
- 5 different time points
- before and after the application of antibiotics

The data to analyze is composed of three files:
- Sample metadata
- Barcodes
- Sequences

Sample metadata is the information that you know of your samples:
- The name you have given them.
- Where and when were they taken.
- Under what conditions.

In [None]:
# Add this to the repo
wget \
  --continue \
  --output-document "sample-metadata.tsv" \
  "https://data.qiime2.org/2023.2/tutorials/moving-pictures/sample_metadata.tsv"

These samples are organized in a table called `sample-metadata.tsv`. Either click it on the left panel, or execute the following cell:

| sample-id | barcode-sequence | body-site   | year    | month   | day     | subject     | reported-antibiotic-usage | days-since-experiment-start |
|-----------|------------------|-------------|---------|---------|---------|-------------|---------------------------|-----------------------------|
| #q2:types | categorical      | categorical | numeric | numeric | numeric | categorical | categorical               | numeric                     |
| L1S8      | AGCTGACTAGTC     | gut         | 2008    | 10      | 28      | subject-1   | Yes                       | 0                           |
| L1S57     | ACACACTATGGC     | gut         | 2009    | 1       | 20      | subject-1   | No                        | 84                          |
| L1S76     | ACTACGTGTGGT     | gut         | 2009    | 2       | 17      | subject-1   | No                        | 112                         |
| L1S105    | AGTGCGATGCGT     | gut         | 2009    | 3       | 17      | subject-1   | No                        | 140                         |
| L2S155    | ACGATGCGACCA     | left palm   | 2009    | 1       | 20      | subject-1   | No                        | 84                          |
| L2S175    | AGCTATCCACGA     | left palm   | 2009    | 2       | 17      | subject-1   | No                        | 112                         |
| L2S204    | ATGCAGCTCAGT     | left palm   | 2009    | 3       | 17      | subject-1   | No                        | 140                         |
| L2S222    | CACGTGACATGT     | left palm   | 2009    | 4       | 14      | subject-1   | No                        | 168                         |
| L3S242    | ACAGTTGCGCGA     | right palm  | 2008    | 10      | 28      | subject-1   | Yes                       | 0                           |
| L3S294    | CACGACAGGCTA     | right palm  | 2009    | 1       | 20      | subject-1   | No                        | 84                          |
| L3S313    | AGTGTCACGGTG     | right palm  | 2009    | 2       | 17      | subject-1   | No                        | 112                         |
| L3S341    | CAAGTGAGAGAG     | right palm  | 2009    | 3       | 17      | subject-1   | No                        | 140                         |
| L3S360    | CATCGTATCAAC     | right palm  | 2009    | 4       | 14      | subject-1   | No                        | 168                         |
| L5S104    | CAGTGTCAGGAC     | tongue      | 2008    | 10      | 28      | subject-1   | Yes                       | 0                           |
| L5S155    | ATCTTAGACTGC     | tongue      | 2009    | 1       | 20      | subject-1   | No                        | 84                          |
| L5S174    | CAGACATTGCGT     | tongue      | 2009    | 2       | 17      | subject-1   | No                        | 112                         |
| L5S203    | CGATGCACCAGA     | tongue      | 2009    | 3       | 17      | subject-1   | No                        | 140                         |
| L5S222    | CTAGAGACTCTT     | tongue      | 2009    | 4       | 14      | subject-1   | No                        | 168                         |
| L1S140    | ATGGCAGCTCTA     | gut         | 2008    | 10      | 28      | subject-2   | Yes                       | 0                           |
| L1S208    | CTGAGATACGCG     | gut         | 2009    | 1       | 20      | subject-2   | No                        | 84                          |
| L1S257    | CCGACTGAGATG     | gut         | 2009    | 3       | 17      | subject-2   | No                        | 140                         |
| L1S281    | CCTCTCGTGATC     | gut         | 2009    | 4       | 14      | subject-2   | No                        | 168                         |
| L2S240    | CATATCGCAGTT     | left palm   | 2008    | 10      | 28      | subject-2   | Yes                       | 0                           |
| L2S309    | CGTGCATTATCA     | left palm   | 2009    | 1       | 20      | subject-2   | No                        | 84                          |
| L2S357    | CTAACGCAGTCA     | left palm   | 2009    | 3       | 17      | subject-2   | No                        | 140                         |
| L2S382    | CTCAATGACTCA     | left palm   | 2009    | 4       | 14      | subject-2   | No                        | 168                         |
| L3S378    | ATCGATCTGTGG     | right palm  | 2008    | 10      | 28      | subject-2   | Yes                       | 0                           |
| L4S63     | CTCGTGGAGTAG     | right palm  | 2009    | 1       | 20      | subject-2   | No                        | 84                          |
| L4S112    | GCGTTACACACA     | right palm  | 2009    | 3       | 17      | subject-2   | No                        | 140                         |
| L4S137    | GAACTGTATCTC     | right palm  | 2009    | 4       | 14      | subject-2   | No                        | 168                         |
| L5S240    | CTGGACTCATAG     | tongue      | 2008    | 10      | 28      | subject-2   | Yes                       | 0                           |
| L6S20     | GAGGCTCATCAT     | tongue      | 2009    | 1       | 20      | subject-2   | No                        | 84                          |
| L6S68     | GATACGTCCTGA     | tongue      | 2009    | 3       | 17      | subject-2   | No                        | 140                         |
| L6S93     | GATTAGCACTCT     | tongue      | 2009    | 4       | 14      | subject-2   | No                        | 168                         |


It contains data about:
- 35 samples
- in 9 columns

TSV stands for __tab separated values__. A tab is the space between each word. Excel is capable of opening and saving this type of files.

The first row is the __header__. It cointains the names of the properties that each sample has. For example, the first sample
- Is named L1S8
- Has the barcode AGCTGACTAGTC
- Comes from the gut
- Was taken on 2008

and so on...


It is very important to add as many variables as possible:
- Day that you took the sample
- Day that it was processed in the lab
- Day that it was processed for sequencing
- Geographical coordinates of the sampling sites
- pH / salinity of the samples
- Chemical composition of the sample (Fe, N, CO2, etc).
- Purification / extraction protocol (if more than one is used).

The next file are the barcodes used (data/barcodes.fq.gz).

To each sample, one sequence is assigned to identify the sample. Think about it as the molecular ID card number of the sample.

This file contains 302,581 16S fragments.

In [4]:
gzip -dc data/barcodes.fastq.gz | paste - - - - | head -10

gzip: data/barcodes.fastq.gz: No such file or directory
(qiime2-2023.2) 


: 1

For example, the first sequence of the experiment, the barcode is `ATGCAGCTCAGT`, which belongs to sample `L2S204`.

Use Ctrl + F to find the following barcodes:
- CCCCTCAGCGGC
- GACGAGTCAGTC

Solution:
<Details> 
They are not in the table.
    
Sequencing machines make mistakes often, or are samples that are not used in this concrete experiment.
    
Either:
- use a program to find the closest barcode in the table, or
- throw away the sequences
</Details>




In [None]:
wget \
    --continue \
    --output-document "data/sequences.fastq.gz" \
    "https://data.qiime2.org/2023.2/tutorials/moving-pictures/emp-single-end-sequences/sequences.fastq.gz"

And now the file with the 16S fragments:

In [None]:
gzip -dc data/sequences.fastq.gz | paste - - - - | head -5

As you can see, sequences can be pretty long. And it goes forever. This file contains +300K sequences. Real datasets contain hundreds of million sequences.

Data processing by hand, or with a word processor and spreadsheets is impossible, and therefore you need to use tools taylored for this purpose, like Qiime2.

Qiime is a toolbox that contains almost all the programs you need to process metabarcoding data. To see what it can do, just type qiime in a code cell:

In [None]:
s

The tools we are going to focus in this tutorial are:
- tools
- demux
- dada2
- feature-table
- phylogeny
- diversity
- emperor
- feature classifier
- taxa
- composition

To see how any of them work, type `qiime [TOOL] [SUBTOOL] --help`, and it will show how to use it:

In [None]:
qiime demux --help

In [None]:
qiime taxa barplot --help

The tools included in Qiime2 follow the same pattern:
```
qiime command subcommand \
  --i-input      files \
  --p-parameters parameter1 \
  --m-metadata   metadata-file.tsv \
  --o-output     output-file
```

All commands transforms input files into output files:
- **I**nput files are specified with the `--i-*`
- **M**etadata is specified with the `--m-*`
- **P**arameters on how to execute the command is specified with `--p-*`
- **O**utput files are specified with the `--o-*`

As a rule of thumb, use that order: input, metadata, parameters, output. This will make the code cleaner and easier for you to follow.

The next picture is the map of the typical analysis in Qiime. Drag and drop it to another tab in your browser if you need to look at it later on

<img src="assets/img/qiime_map.svg"  width="1200" height="600">

## 1. Importing sequences

The first step for Qiime is to tell it where are your sequences. To do so, use the `qiime tools import` command

In [None]:
qiime tools import --help

To import the sequences execute the following:

In [None]:
qiime tools import \
  --input-path  data `# the folder with the data` \
  --type        EMPSingleEndSequences `# the type of sequences` \
  --output-path sequences.qza `# the file in which to store it`

We can see the content with `peek`:

In [None]:
qiime tools peek sequences.qza

It is not very useful, but it says that it contains sequences of the type EMP Single End.

Double clicking `sequences.qza` does not work. The file is very big to see it manually or generate reports: ".qz**v**" files

Note: qiime generates two types of files:
- qz**a** - artifacts: tables
- qz**v** - visualizations: plots

The artifacts are for the programs, the visualizations for us.

## 2. Demux: demultiplex sequences

Demultiplexing consists of grouping the 16S sequences according to the sample they belong. The purpose is to separate the reads by sample.

To do so, we will use `qiime demux`:

In [None]:
qiime demux --help

Since our dataset is of the type `emp-single`, we will use that

In [None]:
qiime demux emp-single --help

We have to specify:
- The input qza
- The metadata file that contains the barcodes
- The exact column
- Where to store the results
- Statistics of the procedure

In [None]:
qiime demux emp-single \
    --i-seqs                     sequences.qza \
    --m-barcodes-file            sample-metadata.tsv \
    --m-barcodes-column          barcode-sequence \
    --o-per-sample-sequences     demux.qza \
    --o-error-correction-details demux-details.qza

Let's see what happened in this step. To do so, convert `demux.qza` into `demux.qzv`:

In [None]:
qiime demux summarize \
  --i-data          demux.qza \
  --o-visualization demux.qzv

Download `demux.qzv` and upload it to https://view.qiime2.org

Or click this [link](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2023.2%2Fdata%2Ftutorials%2Fmoving-pictures%2Fdemux.qzv)

## 3. Quality Control with DADA2

DADA2 is a program to detect and correct errors in amplicon sequences. The command to run is the following:

In [None]:
qiime dada2 --help

In [None]:
qiime dada2 denoise-single --help

The mandatory parameters are the following:
- **I**nputs:
  - the demultiplexed sequences: `demux.qza`
- **P**arameters:
  - how many base pairs to keep: 120
  - how many bases to trim on the 5' end (left): 0
- **O**utputs:
  - representative sequences: the clustered amplicons
  - table: the table that says how many of **each bacteria** are in **each sample**
  - denoising stats: stats for the nerds.

Let's compose the command

In [None]:
qiime dada2 denoise-single \
    --i-demultiplexed-seqs       demux.qza \
    --p-trim-left                0 \
    --p-trunc-len                120 \
    --o-representative-sequences rep-seqs.qza \
    --o-table                    table.qza   \
    --o-denoising-stats          dada2-stats.qza

It takes 1-2 minutes.

And now let's see what is inside the `stats.qza`.

In [None]:
qiime metadata tabulate \
  --m-input-file    dada2-stats.qza \
  --o-visualization dada2-stats.qzv

We can visualize the qzv with [the qiime view web](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2023.2%2Fdata%2Ftutorials%2Fmoving-pictures%2Fstats-dada2.qzv)

Finally, we can get the feature table (also called OTU table), which contains what samples contains how many copies of each bacteria

In [None]:
qiime feature-table summarize \
  --i-table                table.qza \
  --o-visualization        table.qzv \
  --m-sample-metadata-file sample-metadata.tsv

And also the one that contains the relationship between each OTU and the sequence

In [None]:
qiime feature-table summarize \
  --i-table                table.qza \
  --o-visualization        table.qzv \
  --m-sample-metadata-file sample-metadata.tsv

[table.qzv](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2023.2%2Fdata%2Ftutorials%2Fmoving-pictures%2Ftable.qzv)

## 4. Diversity analysis

We want to know:
- the diversity of each sample (alpha diversity)
- and how diverse are each pair of samples (beta diversity)

But before doing these analyses, we need to make all the sequences comparable.

### 4.1 Alignment and tree construction

To do so we are going to:
- align all the sequences
- compute their phylogenetic tree

Thankfully, we can do it in one go with `qiime phylogeny`

In [None]:
qiime phylogeny --help

There are three methods to align and build the tree:
- MAFFT + FastTree (fastest)  <- the one
- MAFFT + IQTREE
- MAFFT + RAxML (most precise)

Let's see the help to know what we need to give it in order to work:

In [None]:
qiime phylogeny align-to-tree-mafft-fasttree --help

In [None]:
qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences        rep-seqs.qza \
  --o-alignment        diversity-aligned-seqs.qza \
  --o-masked-alignment diversity-masked-aligned-seqs.qza \
  --o-tree             diversity-unrooted-tree.qza \
  --o-rooted-tree      diversity-rooted-tree.qza

## 4.2 Core metrics

Also, the alpha and beta diversities are computed in one single command. It is done with `qiime diversity core-metrics-phylogenetic`:

In [None]:
qiime diversity core-metrics-phylogenetic --help

In [None]:
qiime diversity core-metrics-phylogenetic \
  --i-phylogeny      diversity-rooted-tree.qza \
  --i-table          table.qza \
  --p-sampling-depth 1200 \
  --m-metadata-file  sample-metadata.tsv \
  --output-dir       diversity-core-metrics-results

The results appear in the diversity-core-metrics-results folder

In [None]:
ls -l diversity-core-metrics-results

## 4.3 Alpha diversity

In [5]:
# Faith
qiime diversity alpha-group-significance \
  --i-alpha-diversity diversity-core-metrics-results/faith_pd_vector.qza \
  --m-metadata-file   sample-metadata.tsv \
  --o-visualization   diversity-core-metrics-results/faith-pd-group-significance.qzv

(qiime2-2023.2) 
Usage: [94mqiime diversity alpha-group-significance[0m [OPTIONS]

  Visually and statistically compare groups of alpha diversity values.

[1mInputs[0m:
  [94m[4m--i-alpha-diversity[0m ARTIFACT [32mSampleData[AlphaDiversity][0m
                       Vector of alpha diversity values by sample.  [35m[required][0m
[1mParameters[0m:
  [94m[4m--m-metadata-file[0m METADATA...
    (multiple          The sample metadata.
     arguments will    
     be merged)                                                     [35m[required][0m
[1mOutputs[0m:
  [94m[4m--o-visualization[0m VISUALIZATION
                                                                    [35m[required][0m
[1mMiscellaneous[0m:
  [94m--output-dir[0m PATH    Output unspecified results to a directory
  [94m--verbose[0m / [94m--quiet[0m  Display verbose output to stdout and/or stderr during
                       execution of this action. Or silence output if
                       e

: 1

In [None]:
# evenness
qiime diversity alpha-group-significance \
  --i-alpha-diversity diversity-core-metrics-results/evenness_vector.qza \
  --m-metadata-file   sample-metadata.tsv \
  --o-visualization   diversity-core-metrics-results/evenness-group-significance.qzv

## 4.4 Beta diversity

In [None]:
qiime diversity beta-group-significance \
  --i-distance-matrix metrics/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file   sample-metadata.tsv \
  --m-metadata-column body-site \
  --o-visualization   metrics/unweighted-unifrac-body-site-significance.qzv \
  --p-pairwise

In [None]:
qiime diversity beta-group-significance \
  --i-distance-matrix metrics/unweighted_unifrac_distance_matrix.qza \
  --m-metadata-file   sample-metadata.tsv \
  --m-metadata-column subject \
  --o-visualization   metrics/unweighted-unifrac-subject-group-significance.qzv \
  --p-pairwise

In [None]:
qiime emperor plot \
  --i-pcoa          metrics/unweighted_unifrac_pcoa_results.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-custom-axes   days-since-experiment-start \
  --o-visualization metrics/unweighted-unifrac-emperor-days-since-experiment-start.qzv

In [None]:
qiime emperor plot \
  --i-pcoa          metrics/bray_curtis_pcoa_results.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-custom-axes   days-since-experiment-start \
  --o-visualization metrics/bray-curtis-emperor-days-since-experiment-start.qzv

## 5. Taxonomy

We have analyzed these bacteria without knowing who they are!

Maybe it is time to know who is behind each one of those 120 bp 16S sequences, don't you think? We are going to do it using Machine Learning.

First, let's download from Greengenes, a database specialized in metabarcoding, a trained classifier:

In [6]:
wget \
    --continue \
    --output-document gg-13-8-99-515-806-nb-classifier.qza \
    https://data.qiime2.org/2023.2/common/gg-13-8-99-515-806-nb-classifier.qza

--2023-03-16 17:46:52--  https://data.qiime2.org/2023.2/common/gg-13-8-99-515-806-nb-classifier.qza
Resolving data.qiime2.org (data.qiime2.org)... 54.200.1.12
Connecting to data.qiime2.org (data.qiime2.org)|54.200.1.12|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://s3-us-west-2.amazonaws.com/qiime2-data/2023.2/common/gg-13-8-99-515-806-nb-classifier.qza [following]
--2023-03-16 17:46:53--  https://s3-us-west-2.amazonaws.com/qiime2-data/2023.2/common/gg-13-8-99-515-806-nb-classifier.qza
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.205.120, 52.218.221.216, 3.5.83.113, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.205.120|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28289645 (27M) [binary/octet-stream]
Saving to: ‘gg-13-8-99-515-806-nb-classifier.qza’


2023-03-16 17:47:19 (1.09 MB/s) - ‘gg-13-8-99-515-806-nb-classifier.qza’ saved [28289645/28289645]

(qii

: 1

In [None]:
qiime feature-classifier classify-sklearn \
  --i-classifier     gg-13-8-99-515-806-nb-classifier.qza \
  --i-reads          rep-seqs.qza \
  --o-classification taxonomy.qza

In [None]:
qiime metadata tabulate \
  --m-input-file    taxonomy.qza \
  --o-visualization taxonomy.qzv

[taxonomy.qzv](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2023.2%2Fdata%2Ftutorials%2Fmoving-pictures%2Ftaxonomy.qzv)

In [None]:
qiime taxa barplot \
  --i-table         table.qza \
  --i-taxonomy      taxonomy.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-visualization taxa-bar-plots.qzv

[taxa-bar-plots.qzv](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2023.2%2Fdata%2Ftutorials%2Fmoving-pictures%2Ftaxa-bar-plots.qzv)

# 6 Composition with ANCOM

We can be interested in seeing if there are differences in the composition of the communities in different body sizes, for example. To do so, we can use ANCOM.

The procedure is as follows:
- Filter the tables according to the region of interest, in this case that body-site has to be the gut

In [None]:
qiime feature-table filter-samples \
    --i-table          table.qza \
    --m-metadata-file  sample-metadata.tsv \
    --p-where          "[body-site]='gut'" \
    --o-filtered-table gut-table.qza

Then, fix the table with pseudo-counts, becasue ANCOM cannot work with zeros in the table.

In [None]:
qiime composition add-pseudocount \
    --i-table             gut-table.qza \
    --o-composition-table comp-gut-table.qza

Finally, run ANCOM

In [None]:
qiime composition ancom \
    --i-table           comp-gut-table.qza \
    --m-metadata-file   sample-metadata.tsv \
    --m-metadata-column subject \
    --o-visualization   ancom-subject.qzv

[ancom-subject.qzv](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2023.2%2Fdata%2Ftutorials%2Fmoving-pictures%2Fancom-subject.qzv)