# Quality control with QIIME2

In [14]:
conda activate qiime2-2023.2

(qiime2-2023.2) 


: 1

The dataset that we are going to analyze is called "Moving images", from [Caporaso et al. (2011)](https://www.ncbi.nlm.nih.gov/pubmed/21624126).

It contains metagenomic samples from:
- 2 individuals
- 4 different body sites
- 5 different time points
- before and after the application of antibiotics

The data to analyze is composed of three files:
- Sample metadata
- Barcodes
- Sequences

Sample metadata is the information that you know of your samples:
- The name you have given them.
- Where and when were they taken.
- Under what conditions.

These samples are organized in a table called `sample-metadata.tsv`. Either click it on the left panel or see it below:

| sample-id | barcode-sequence | body-site   | year    | month   | day     | subject     | reported-antibiotic-usage | days-since-experiment-start |
|-----------|------------------|-------------|---------|---------|---------|-------------|---------------------------|-----------------------------|
| #q2:types | categorical      | categorical | numeric | numeric | numeric | categorical | categorical               | numeric                     |
| L1S8      | AGCTGACTAGTC     | gut         | 2008    | 10      | 28      | subject-1   | Yes                       | 0                           |
| L1S57     | ACACACTATGGC     | gut         | 2009    | 1       | 20      | subject-1   | No                        | 84                          |
| L1S76     | ACTACGTGTGGT     | gut         | 2009    | 2       | 17      | subject-1   | No                        | 112                         |
| L1S105    | AGTGCGATGCGT     | gut         | 2009    | 3       | 17      | subject-1   | No                        | 140                         |
| L2S155    | ACGATGCGACCA     | left palm   | 2009    | 1       | 20      | subject-1   | No                        | 84                          |
| L2S175    | AGCTATCCACGA     | left palm   | 2009    | 2       | 17      | subject-1   | No                        | 112                         |
| L2S204    | ATGCAGCTCAGT     | left palm   | 2009    | 3       | 17      | subject-1   | No                        | 140                         |
| L2S222    | CACGTGACATGT     | left palm   | 2009    | 4       | 14      | subject-1   | No                        | 168                         |
| L3S242    | ACAGTTGCGCGA     | right palm  | 2008    | 10      | 28      | subject-1   | Yes                       | 0                           |
| L3S294    | CACGACAGGCTA     | right palm  | 2009    | 1       | 20      | subject-1   | No                        | 84                          |
| L3S313    | AGTGTCACGGTG     | right palm  | 2009    | 2       | 17      | subject-1   | No                        | 112                         |
| L3S341    | CAAGTGAGAGAG     | right palm  | 2009    | 3       | 17      | subject-1   | No                        | 140                         |
| L3S360    | CATCGTATCAAC     | right palm  | 2009    | 4       | 14      | subject-1   | No                        | 168                         |
| L5S104    | CAGTGTCAGGAC     | tongue      | 2008    | 10      | 28      | subject-1   | Yes                       | 0                           |
| L5S155    | ATCTTAGACTGC     | tongue      | 2009    | 1       | 20      | subject-1   | No                        | 84                          |
| L5S174    | CAGACATTGCGT     | tongue      | 2009    | 2       | 17      | subject-1   | No                        | 112                         |
| L5S203    | CGATGCACCAGA     | tongue      | 2009    | 3       | 17      | subject-1   | No                        | 140                         |
| L5S222    | CTAGAGACTCTT     | tongue      | 2009    | 4       | 14      | subject-1   | No                        | 168                         |
| L1S140    | ATGGCAGCTCTA     | gut         | 2008    | 10      | 28      | subject-2   | Yes                       | 0                           |
| L1S208    | CTGAGATACGCG     | gut         | 2009    | 1       | 20      | subject-2   | No                        | 84                          |
| L1S257    | CCGACTGAGATG     | gut         | 2009    | 3       | 17      | subject-2   | No                        | 140                         |
| L1S281    | CCTCTCGTGATC     | gut         | 2009    | 4       | 14      | subject-2   | No                        | 168                         |
| L2S240    | CATATCGCAGTT     | left palm   | 2008    | 10      | 28      | subject-2   | Yes                       | 0                           |
| L2S309    | CGTGCATTATCA     | left palm   | 2009    | 1       | 20      | subject-2   | No                        | 84                          |
| L2S357    | CTAACGCAGTCA     | left palm   | 2009    | 3       | 17      | subject-2   | No                        | 140                         |
| L2S382    | CTCAATGACTCA     | left palm   | 2009    | 4       | 14      | subject-2   | No                        | 168                         |
| L3S378    | ATCGATCTGTGG     | right palm  | 2008    | 10      | 28      | subject-2   | Yes                       | 0                           |
| L4S63     | CTCGTGGAGTAG     | right palm  | 2009    | 1       | 20      | subject-2   | No                        | 84                          |
| L4S112    | GCGTTACACACA     | right palm  | 2009    | 3       | 17      | subject-2   | No                        | 140                         |
| L4S137    | GAACTGTATCTC     | right palm  | 2009    | 4       | 14      | subject-2   | No                        | 168                         |
| L5S240    | CTGGACTCATAG     | tongue      | 2008    | 10      | 28      | subject-2   | Yes                       | 0                           |
| L6S20     | GAGGCTCATCAT     | tongue      | 2009    | 1       | 20      | subject-2   | No                        | 84                          |
| L6S68     | GATACGTCCTGA     | tongue      | 2009    | 3       | 17      | subject-2   | No                        | 140                         |
| L6S93     | GATTAGCACTCT     | tongue      | 2009    | 4       | 14      | subject-2   | No                        | 168                         |


It contains data about:
- 35 samples
- in 9 columns

TSV stands for __tab separated values__. A tab is the space between each word. Excel is capable of opening and saving this type of files.

The first row is the __header__. It cointains the names of the properties that each sample has. For example, the first sample
- Is named L1S8
- Has the barcode AGCTGACTAGTC
- Comes from the gut
- Was taken on 2008

and so on...


It is very important to add as many variables as possible:
- Day that you took the sample
- Day that it was processed in the lab
- Day that it was processed for sequencing
- Geographical coordinates of the sampling sites
- pH / salinity of the samples
- Chemical composition of the sample (Fe, N, CO2, etc).
- Purification / extraction protocol (if more than one is used).

The next file are the barcodes used (`data/barcodes.fq.g`').

To each sample, one sequence is assigned to identify the sample. Think about it as the molecular ID card number of the sample.

This file contains 302,581 16S fragments.

| read_name                            | sequence     | separator | quality      |
|--------------------------------------|--------------|-----------|--------------|
| @HWI-EAS440_0386:1:23:17547:1423#0/1 | ATGCAGCTCAGT | +         | IIIIIIIIIIIH |
| @HWI-EAS440_0386:1:23:14818:1533#0/1 | CCCCTCAGCGGC | +         | DDD@D?@B<<+/ |
| @HWI-EAS440_0386:1:23:14401:1629#0/1 | GACGAGTCAGTC | +         | GGEGDGGGGGDG |
| @HWI-EAS440_0386:1:23:15259:1649#0/1 | AGCAGTCGCGAT | +         | IIIIIIIIIIII |
| @HWI-EAS440_0386:1:23:13748:2482#0/1 | AGCACACCTACA | +         | GGGGBGGEEGGD |
| @HWI-EAS440_0386:1:23:6532:3028#0/1  | GAGAGAATGATC | +         | HIIIIIIIIIII |
| @HWI-EAS440_0386:1:23:8677:3027#0/1  | CACAGTGGACGT | +         | FHHHHHHHHHHH |
| @HWI-EAS440_0386:1:23:5678:3052#0/1  | ATAGCTCCATAC | +         | IIIIIIIIIIII |
| @HWI-EAS440_0386:1:23:11889:3171#0/1 | ACGTTAGCACAC | +         | IIIIIIGIIIII |
| @HWI-EAS440_0386:1:23:2112:3374#0/1  | GAGAGAATGATC | +         | FEEBBCEEEEDG |


For example, the first sequence of the experiment, the barcode is `ATGCAGCTCAGT`, which belongs to sample `L2S204`.

Use Ctrl + F to find the following barcodes:
- CCCCTCAGCGGC
- GACGAGTCAGTC

Solution:
<Details> 
They are not in the table.
    
Sequencing machines make mistakes often, or are samples that are not used in this concrete experiment.
    
Either:
- use a program to find the closest barcode in the table, or
- throw away the sequences
</Details>




And now the file with the 16S fragments:

| sequence_name                        | sequence                                                                                                                                                 | separator | quality                                                                                                                                                  |
|:--------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------|:----------------------------------------------------------------------------------------------------------------------------------------------------------|
| @HWI-EAS440_0386:1:23:17547:1423#0/1 | TACGNAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGGGGGGATTGGTGTG | +         | IIIE)EEEEEEEEGFIIGIIIHIHHGIIIGIIHHHGIIHGHEGDGIFIGEHGIHHGHHGHHGGHEEGHEGGEHEBBHBBEEDCEDDD>B?BE@@B>@@@@@CB@ABA@@?@@=>?08;3=;==8:5;@6?###################### |
| @HWI-EAS440_0386:1:23:14818:1533#0/1 | CCCCNCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAGCCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGGCACATTAAATTTAAACTGATGATTCCACTGCAACAA | +         | 64<2$24;1)/:*B<?BBDDBBD<>BDD############################################################################################################################ |
| @HWI-EAS440_0386:1:23:14401:1629#0/1 | TACGNAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACGCTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGGTGTCTTGAGTACAGTAGAGGCAGGGGGGGGGTTGGGGG | +         | GGGC'ACC8;;>;HHHHGHDHHHHHEEHHEHHHHHECHEEEHCHFHHHAGGEHHFBCCBABBBE>>>E=>A>A<>>B8B:B=BBABA@AAAA@?>???>>>9>@AA@@@@AA######################################## |
| @HWI-EAS440_0386:1:23:15259:1649#0/1 | TACGNAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACGCTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGGTGTCTTGAGTACAGTAGAGGCAGGGGGGAGTTTGGGGG | +         | IIIE)DEE?CCCBIIIIIIGIIIIIHIIIIIIIIIHIIIHFIDIGIIIHHIHIGIIIHFIHBGGBFDGEHHEI=CBGBEEEEEHEEGECD?>B@=?@BAA=9A?@A>ABBBCDB:C:@??9>?;:?;?BA@?B################### |
| @HWI-EAS440_0386:1:23:13748:2482#0/1 | TACGNAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATATCTTGAGTGCAGTTGAGGCAGGCGGGATTCGTGGTG | +         | GGGC'?CC4<5<6HHHHHHH@@HGGEEGHHFHHHHBBHGFFGCGBBGG@DGBDGFDFEEHHHFHEHEHBHHEEDEEEAB@BBEAEEBEEAEBB8:>>:EEEB@>@>>B@:@@@9=@>:B:>=>8:/7>=>@##################### |
| @HWI-EAS440_0386:1:23:6532:3028#0/1  | TACGNAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGCGGACGCTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGGTGTCTTGAGTACAGTAGAGGCAGGGGGGAGTCTTGGGG | +         | IIIG)EGGBDEDDIIIIHIGIIIIGFIIIDIIIFIHIIHHIIHFIHIIGGHIIIHHIHIEEHHHGIGHEIIEFBA8G?EEEEECHBHDEFEHDGEECDEEBEEDFGC@BEEEEE@EBBBBBBBADBAB9?;=9@????############## |
| @HWI-EAS440_0386:1:23:8677:3027#0/1  | TACANAGGTCTCAAGCGTTGTTCGGAATCACTGGGCGTAAAGCGTGCGTAGGCTGTTTCGTAAGTCGTGTGTGAAAGGCGCGGGCTCAACCCGCGGCCGGCACATGATACTGTGAGACTAGCGTAACGGAGGGGGAACCGGAATTCTTGGTG | +         | HHHA'CCCEFFFFHHHHHHHHHHHGHHHHHHHHHHHHHHHHHGHHHHHHHHHGHHHHHHHG>HGDDGDGCGCEHFE>EEEB@EFEBB>B;D@@########################################################### |
| @HWI-EAS440_0386:1:23:5678:3052#0/1  | TACGNAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGAAGAGCAAGTCTGATGTGAAAGGCTGGGGCTTAACCCCAGGACTGCATTGGAAACTGTTTTTCTTGAGTGCCGGAGAGGTAAGCGGAATTCCTGGGG | +         | IIIF)FFFBBBCCIHIIIIIIIIIHIIIIHIIIIIGIHIIHIIHIIIIEIFIIGHIIBFIIHHIEEIIIIIBEGGGGEC>;@@;?@A?EBBEEBADA;=5<==>=3+>:4<=9@B>>1@9.=5=?=??######################## |
| @HWI-EAS440_0386:1:23:11889:3171#0/1 | TACGNAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGGGAGCGTAGACGGTTTTGCAAGTCTGAAGTGAAAGCCCGGGGCTTAACCCCGGGACTGCTTTGGAAACTGTATGACTAGAGTGCAGGAGAGGTAAGTGGAATTCCTAGTG | +         | IIIF)EEFFFFEEIHIIIIIIIIIIDIIIIIIIIIGIGEEEGGIHHIIHIHFIIFHHIIIIGGIGGIIH@FEBFDEDD<EBCBBEEEBBEBBF:=829?>>??AA3;;@@@B@@BB@################################### |
| @HWI-EAS440_0386:1:23:2112:3374#0/1  | TACGNAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGATGGATGTTTAAGTCAGTTGTGAAAGTTTGCGGCTCAACCGTAAAATTGCAGTTGATACTGGATGTCTTGAGTGCAGTTGAGGCAGGGGGGATTCGTGTGG | +         | GGGE(DEE;;@@@HHHHHHHHHHHHHHHHGHHHHHHGHEHHHHFHHHHFHHHHHHFHHEHHFHHHHHHEHHEHDBBG=?@C@C@HEFEE@FFE3?;??AED3C@1@@3?=<>?DADCBEBBBC?EB?B######################## |


As you can see, sequences can be pretty long. And it goes forever. This file contains +300K sequences. Real datasets contain hundreds of million sequences.

Data processing by hand, or with a word processor and spreadsheets is impossible, and therefore you need to use tools taylored for this purpose, like Qiime2.

Qiime is a toolbox that contains almost all the programs you need to process metabarcoding data. To see what it can do, just type qiime in a code cell:

The tools we are going to focus in this tutorial are:
- tools
- demux
- dada2
- feature-table
- phylogeny
- diversity
- emperor
- feature classifier
- taxa
- composition

To see how any of them work, type `qiime [TOOL] [SUBTOOL] --help`, and it will show how to use it:

In [15]:
qiime demux --help

Usage: [94mqiime demux[0m [OPTIONS] COMMAND [ARGS]...

  Description: This QIIME 2 plugin supports demultiplexing of single-end and
  paired-end sequence reads and visualization of sequence quality information.

  Plugin website: https://github.com/qiime2/q2-demux

  Getting user support: Please post to the QIIME 2 forum for help with this
  plugin: https://forum.qiime2.org

[1mOptions[0m:
  [94m--version[0m            Show the version and exit.
  [94m--example-data[0m PATH  Write example data and exit.
  [94m--citations[0m          Show citations and exit.
  [94m--help[0m               Show this message and exit.

[1mCommands[0m:
  [94memp-paired[0m        Demultiplex paired-end sequence data generated with the
                    EMP protocol.
  [94memp-single[0m        Demultiplex sequence data generated with the EMP protocol.
  [94mfilter-samples[0m    Filter samples out of demultiplexed data.
  [94msubsample-paired[0m  Subsample paired-end sequences without r

: 1

In [16]:
qiime taxa barplot --help

Usage: [94mqiime taxa barplot[0m [OPTIONS]

  This visualizer produces an interactive barplot visualization of taxonomies.
  Interactive features include multi-level sorting, plot recoloring, sample
  relabeling, and SVG figure export.

[1mInputs[0m:
  [94m[4m--i-table[0m ARTIFACT [32mFeatureTable[Frequency][0m
                         Feature table to visualize at various taxonomic
                         levels.                                    [35m[required][0m
  [94m[4m--i-taxonomy[0m ARTIFACT [32mFeatureData[Taxonomy][0m
                         Taxonomic annotations for features in the provided
                         feature table. All features in the feature table must
                         have a corresponding taxonomic annotation. Taxonomic
                         annotations that are not present in the feature table
                         will be ignored.                           [35m[required][0m
[1mParameters[0m:
  [94m--m-metadata-file[0m 

: 1

The tools included in Qiime2 follow the same pattern:
```
qiime command subcommand \
  --i-input      files \
  --p-parameters parameter1 \
  --m-metadata   metadata-file.tsv \
  --o-output     output-file
```

All commands transforms input files into output files:
- **I**nput files are specified with `--i-*`
- **M**etadata is specified with `--m-*`
- **P**arameters on how to execute the command is specified `--p-*`
- **O**utput files are specified with `--o-*`

As a rule of thumb, use that order: input, metadata, parameters, output. This will make the code cleaner and easier for you to follow.

The next picture is the map of the typical analysis in Qiime. Drag and drop it to another tab in your browser if you need to look at it later on

<img src="assets/img/qiime_map.svg"  width="1200" height="600">

## 1. Importing sequences

The first step for Qiime is to tell it where are your sequences. To do so, use the `qiime tools import` command

In [17]:
qiime tools import --help

Usage: [94mqiime tools import[0m [OPTIONS]

  Import data to create a new QIIME 2 Artifact. See https://docs.qiime2.org/
  for usage examples and details on the file types and associated semantic
  types that can be imported.

[1mOptions[0m:
  [94m[4m--type[0m TEXT             The semantic type of the artifact that will be
                          created upon importing. Use --show-importable-types
                          to see what importable semantic types are available
                          in the current deployment.                [35m[required][0m
  [94m[4m--input-path[0m PATH       Path to file or directory that should be imported.
                                                                    [35m[required][0m
  [94m[4m--output-path[0m ARTIFACT  Path where output artifact should be written.
                                                                    [35m[required][0m
  [94m--input-format[0m TEXT     The format of the data to be imported.

: 1

To import the sequences execute the following:

In [18]:
qiime tools import \
  --type        EMPSingleEndSequences `# the type of sequences` \
  --input-path  data `# the folder with the data` \
  --output-path sequences.qza `# the file in which to store it`

[32mImported data as EMPSingleEndDirFmt to sequences.qza[0m
[0m(qiime2-2023.2) 


: 1

We can see the content with `peek`:

In [19]:
qiime tools peek sequences.qza

[32mUUID[0m:        4cf09396-1f8d-4299-90d3-1a39b8dea7a5
[32mType[0m:        EMPSingleEndSequences
[32mData format[0m: EMPSingleEndDirFmt
(qiime2-2023.2) 


: 1

It is not very useful, but it says that it contains sequences of the type EMP Single End.

Double clicking `sequences.qza` does not work. The file is very big to see it manually or generate reports: ".qz**v**" files

Note: qiime generates two types of files:
- qz**a** - artifacts: tables, for the machine
- qz**v** - visualizations: plots, for the humans

The artifacts are for the programs, the visualizations for us.

## 2. Demux: demultiplex sequences

Demultiplexing consists of grouping the 16S sequences according to the sample they belong. The purpose is to separate the reads by sample.

To do so, we will use `qiime demux`:

In [20]:
qiime demux --help

Usage: [94mqiime demux[0m [OPTIONS] COMMAND [ARGS]...

  Description: This QIIME 2 plugin supports demultiplexing of single-end and
  paired-end sequence reads and visualization of sequence quality information.

  Plugin website: https://github.com/qiime2/q2-demux

  Getting user support: Please post to the QIIME 2 forum for help with this
  plugin: https://forum.qiime2.org

[1mOptions[0m:
  [94m--version[0m            Show the version and exit.
  [94m--example-data[0m PATH  Write example data and exit.
  [94m--citations[0m          Show citations and exit.
  [94m--help[0m               Show this message and exit.

[1mCommands[0m:
  [94memp-paired[0m        Demultiplex paired-end sequence data generated with the
                    EMP protocol.
  [94memp-single[0m        Demultiplex sequence data generated with the EMP protocol.
  [94mfilter-samples[0m    Filter samples out of demultiplexed data.
  [94msubsample-paired[0m  Subsample paired-end sequences without r

: 1

Since our dataset is of the type `emp-single`, we will use that

In [21]:
qiime demux emp-single --help

Usage: [94mqiime demux emp-single[0m [OPTIONS]

  Demultiplex sequence data (i.e., map barcode reads to sample ids) for data
  generated with the Earth Microbiome Project (EMP) amplicon sequencing
  protocol. Details about this protocol can be found at
  http://www.earthmicrobiome.org/protocols-and-standards/

[1mInputs[0m:
  [94m[4m--i-seqs[0m ARTIFACT [32mRawSequences | EMPSingleEndSequences |[0m
    [32mEMPPairedEndSequences[0m
                       The single-end sequences to be demultiplexed.
                                                                    [35m[required][0m
[1mParameters[0m:
  [94m[4m--m-barcodes-file[0m METADATA
  [94m[4m--m-barcodes-column[0m COLUMN  [32mMetadataColumn[Categorical][0m
                       The sample metadata column containing the per-sample
                       barcodes.                                    [35m[required][0m
  [94m--p-golay-error-correction[0m / [94m--p-no-golay-error-correction[0m
            

: 1

We have to specify:
- The input qza
- The metadata file that contains the barcodes
- The exact column
- Where to store the results
- Statistics of the procedure

In [22]:
qiime demux emp-single \
    --i-seqs                     sequences.qza \
    --m-barcodes-file            sample-metadata.tsv \
    --m-barcodes-column          barcode-sequence \
    --o-per-sample-sequences     demux.qza \
    --o-error-correction-details demux-details.qza

[32mSaved SampleData[SequencesWithQuality] to: demux.qza[0m
[32mSaved ErrorCorrectionDetails to: demux-details.qza[0m
[0m(qiime2-2023.2) 


: 1

Let's see what happened in this step. To do so, convert `demux.qza` into `demux.qzv`:

In [23]:
qiime demux summarize \
  --i-data          demux.qza \
  --o-visualization demux.qzv

[32mSaved Visualization to: demux.qzv[0m
[0m(qiime2-2023.2) 


: 1

Download `demux.qzv` and upload it to https://view.qiime2.org

Or click this [link](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2023.2%2Fdata%2Ftutorials%2Fmoving-pictures%2Fdemux.qzv)

## 3. DADA2: correct errors

DADA2 is a program to detect and correct errors in amplicon sequences. The command to run is the following:

In [25]:
qiime dada2 --help

Usage: [94mqiime dada2[0m [OPTIONS] COMMAND [ARGS]...

  Description: This QIIME 2 plugin wraps DADA2 and supports sequence quality
  control for single-end and paired-end reads using the DADA2 R library.

  Plugin website: http://benjjneb.github.io/dada2/

  Getting user support: Please post to the QIIME 2 forum for help with this
  plugin: https://forum.qiime2.org

[1mOptions[0m:
  [94m--version[0m            Show the version and exit.
  [94m--example-data[0m PATH  Write example data and exit.
  [94m--citations[0m          Show citations and exit.
  [94m--help[0m               Show this message and exit.

[1mCommands[0m:
  [94mdenoise-ccs[0m     Denoise and dereplicate single-end Pacbio CCS
  [94mdenoise-paired[0m  Denoise and dereplicate paired-end sequences
  [94mdenoise-pyro[0m    Denoise and dereplicate single-end pyrosequences
  [94mdenoise-single[0m  Denoise and dereplicate single-end sequences
(qiime2-2023.2) 


: 1

In [26]:
qiime dada2 denoise-single --help

Usage: [94mqiime dada2 denoise-single[0m [OPTIONS]

  This method denoises single-end sequences, dereplicates them, and filters
  chimeras.

[1mInputs[0m:
  [94m[4m--i-demultiplexed-seqs[0m ARTIFACT [32mSampleData[SequencesWithQuality |[0m
    [32mPairedEndSequencesWithQuality][0m
                         The single-end demultiplexed sequences to be
                         denoised.                                  [35m[required][0m
[1mParameters[0m:
  [94m[4m--p-trunc-len[0m INTEGER  Position at which sequences should be truncated due
                         to decrease in quality. This truncates the 3' end of
                         the of the input sequences, which will be the bases
                         that were sequenced in the last cycles. Reads that
                         are shorter than this value will be discarded. If 0
                         is provided, no truncation or length filtering will
                         be performed                 

: 1

The mandatory parameters are the following:
- **I**nputs:
  - the demultiplexed sequences: `demux.qza`
- **P**arameters:
  - how many base pairs to keep: 120
  - how many bases to trim on the 5' end (left): 0
- **O**utputs:
  - representative sequences: the clustered amplicons
  - table: the table that says how many of **each bacteria** are in **each sample**
  - denoising stats: stats for the nerds.

Let's compose the command

In [27]:
qiime dada2 denoise-single \
    --i-demultiplexed-seqs       demux.qza \
    --p-trim-left                0 \
    --p-trunc-len                120 \
    --o-representative-sequences rep-seqs.qza \
    --o-table                    table.qza   \
    --o-denoising-stats          dada2-stats.qza

[32mSaved FeatureTable[Frequency] to: table.qza[0m
[32mSaved FeatureData[Sequence] to: rep-seqs.qza[0m
[32mSaved SampleData[DADA2Stats] to: dada2-stats.qza[0m
[0m(qiime2-2023.2) 


: 1

It takes 1-2 minutes.

And now let's see what is inside the `stats.qza`.

In [None]:
qiime metadata tabulate \
    --m-input-file    dada2-stats.qza \
    --o-visualization dada2-stats.qzv

We can visualize the qzv with [the qiime view web](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2023.2%2Fdata%2Ftutorials%2Fmoving-pictures%2Fstats-dada2.qzv)

Finally, we can get the feature table (also called OTU table), which contains what samples contains how many copies of each bacteria

In [None]:
qiime feature-table summarize \
    --i-table                table.qza \
    --o-visualization        table.qzv \
    --m-sample-metadata-file sample-metadata.tsv

And also the one that contains the relationship between each OTU and the sequence

In [None]:
qiime feature-table summarize \
    --i-table                table.qza \
    --o-visualization        table.qzv \
    --m-sample-metadata-file sample-metadata.tsv

[table.qzv](https://view.qiime2.org/?src=https%3A%2F%2Fdocs.qiime2.org%2F2023.2%2Fdata%2Ftutorials%2Fmoving-pictures%2Ftable.qzv)

In [None]:
pause