# Welcome to the Bioinformatics Analysis of Microbiome Data


We will be using [**QIIME2**] (https://qiime2.org) to analyse the read data. Instead of working locally, we will run our analyses on [**Google Colab**], a free, cloud-based platform that allows you to write and execute code in a web browser without needing to install software on your computer. 

This notebook and corresponding setup script have been adapted from the [**uzh-microbiome-tutorial**](https://github.com/bokulich-lab/uzh-microbiome-tutorial.git); all source code is licensed under the Apache License 2.0.


**Notes:**

-**Bash commands**
Google Colab, by default, interprets code as Python. However, many tasks—like downloading files, moving directories, or running software like QIIME 2—are done using bash commands. To run these bash commands in Colab, we prefix them with `!`. This allows us to interact with QIIME2 using the [`q2cli`](https://github.com/qiime2/q2cli/) (QIIME 2 command-line interface). You would not need to use this prefix when using the terminal. 


-**Read before you run**
You can run all cells in the notebook by going to `Runtime > Run all`. However, it is best to run the commands bit by bit to integrate the information and understand what we are doing. 

## Setup

QIIME 2 is usually installed by following the [official installation instructions](https://docs.qiime2.org/2023.9/install/). However, because we are using Google Colab and there are some caveats to using conda here, we will use the setup script obtained from our collaborators at the Bokulich lab. 

We start by cloning the repository down from GitHub into a directory named "materials" (this is within the "content" directory). 

Note: This command is intended for use in Google Colab. If working locally, you would clone the repository on your machine. 

In [None]:
! git clone https://github.com/natashaztarora/BME307_2024.git materials
! mkdir /content/prefetch_cache ## This directory is necessary for Google Colab.

Next we navigate to the "materials" directory and create a new subdirectory called "uzh" within it.

In [None]:
%cd materials
! mkdir uzh

Now we are ready to set up our environment: we will be installing dependencies and configuring the environemnt. This will take about 10 minutes.
**Note:** This setup is only relevant for Google Colaboratory and will not work on your local machine.

In [None]:
%run setup_qiime2

And we will use some Python packages below, so let's load these here:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

Before we import the raw sequence data into QIIME2, let's create a new directory inside "content/materials/uzh" called "raw_data_zipped" and upload the fastq files you have downloaded from SwitchDrive. 

In [None]:
%cd uzh
! mkdir raw_data_zipped

## Import data into QIIME 2

Next, we will import our reads into QIIME 2 and convert them into the file format that is required for QIIME2 analyses.

### How we do this

We run qiime tools import specifying the following parameters:

-type: whether your data is single-end or paired-end
-input-format specifies the format of the data. The available choices are provided [**here**] (https://docs.qiime2.org/2024.5/tutorials/importing/)
-output-path: species the output path of the artefact you generate.

Run the following command:

In [None]:
! qiime tools import \
    --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path raw_data_zipped \
    --input-format CasavaOneEightSingleLanePerSampleDirFmt \
    --output-path demux-paired-end.qza

You can now check whether the data was imported by running qiime demux summarise, specifying the name of the input file and the name of the artefact you want to generate. You can visualise this artefact by dropping it in QIIME2 view (https://view.qiime2.org/).

In [None]:
! qiime demux summarize \
    --i-data demux-paired-end.qza \
    --o-visualization qualities.qzv

Now let's explore the outputs (QZV) with [view.qiime2.org](https://view.qiime2.org).

## Remove primers with Cutadapt

We need to remove the primers that were used for targeted amplification. 


### How we do this

To do this we use cutadapt trim-paired, specifying these main parameters:

-forward primer: which is “GTGYCAGCMGCCGCGGTAA”
-reverse primer: which is “CCGYCAATTYMTTTRAGTTT”
-whether you have wobble bases
-whether you should discard reads that were not trimmed

In [None]:
! qiime cutadapt trim-paired \
    --i-demultiplexed-sequences demux-paired-end.qza \
    --p-front-f GTGYCAGCMGCCGCGGTAA \
    --p-front-r CCGYCAATTYMTTTRAGTTT \
    --p-match-adapter-wildcards \
    --p-discard-untrimmed \
    --verbose \
    --o-trimmed-sequences paired-end-demux-trimmed.qza | tee cutadaptresults.log


Summarise the .qza artefact using the command below, and then visualise the trimmed reads in QIIME 2 view (https://view.qiime2.org/).

In [None]:
! qiime demux summarize \
    --i-data paired-end-demux-trimmed.qza \
    --o-visualization paired-end-demux-trimmed-summary.qzv 

## Denoise with DADA2

Now we will “denoise” the reads, that is, clean up the data to remove erroneous reads, endeavouring to retain only true biological reads. These reads may differ by one single nucleotide, and they are referred to as exact sequence variants (ESVs) or amplicon sequence variants (ASVs).

### How denoising is done

As we are working with paired end reads, we use qiime2 dada2 denoise-paired. Through this command, quality filtering, merging of forward and reverse reads, dereplication and removal of chimeras is conducted.

The quality filtering aspect refers to trimming the ends of reads where quality is suboptimal, users can also discard sequences below a particular length. This step is done first to optimize the merging of forward and reverse reads. The merging is done according to default parameters (not specified in the command).

Dereplication refers to checking the presence of all identical sequencing reads and then reducing these to one “unique sequence” with a note of its abundance. Removal of chimeras refers to the removal of sequences that are “hybrids” of different parent sequences, and which do not correspond to true ASVs..

Here we will be specifying the following parameters:

Truncation length for forward reads: at what length the forward reads will be cut and all reads below this length will be discarded
Truncation length for reverse reads: at what length the reverse reads will be cut and all reads below this length will be discarded
Note that now we will have 3 output files:

1. an abundance table comprising the unique sequences and their abundance
2. a fasta file with the unique sequences, which we refer to as the representative sequences
3. a file containing the statistics for the denoising steps

You can find more information on DADA2 here (https://benjjneb.github.io/dada2/).

Run the following command:

In [None]:
! qiime dada2 denoise-paired \
    --i-demultiplexed-seqs paired-end-demux-trimmed.qza \
    --p-trunc-len-f 225 \
    --p-trunc-len-r 225 \
    --o-table table.qza \
    --o-representative-sequences rep-seqs.qza \
    --o-denoising-stats denoising-stats.qza 

Open QIIME2 view (https://view.qiime2.org/) and drop the table.qzv in the drag&drop window to see the results.

Optional command: Visualise the representative sequences after denoising with DADA2 We use qiime feature-table tabulate-seqs to see the unique/representative sequences.

Run the following command:

In [None]:
! qiime feature-table summarize \
    --i-table QIIME2_files/table.qza \
    --o-visualization QIIME2_files/table.qzv \
    --m-sample-metadata-file Metadata/metadata.tsv

## Assign taxonomy¶

We now assign taxonomy to the unique/representative sequences found across all samples. We do this with the q2-feature-classifier plugin, making use of a pre-trained Naive Bayes classifier. This classifier is an algorithm that was trained on the SILVA reference database (downloadedDecember 2019) comprising hundreds of thousands of bacterial sequences with taxonomic information. The output is a file containing the results for the different taxonomic ranks (from domain to species), and the level of confidence for the taxonomic assignment.

In [None]:
! qiime feature-classifier classify-sklearn \
    --i-classifier Taxonomy_classifier/silva-138-ssu-nr99-99-V4V5-classifier.qza \
    --i-reads QIIME2_files/rep-seqs.qza \
    --o-classification QIIME2_files/taxonomy.qza


Tabulate the taxonomy with the following command. 

## Generate a phylogenetic tree

Now to view the tree, you can try [iTOL](https://itol.embl.de/upload.cgi).

After opening the web page, click Choose File and select the tree artifact we generated above. Click Upload: after a few seconds you should see the tree.

You may find it easier to navigate the tree in its "rectangular" representation: to change the view, select the Rectangular option in the Mode section of the Basic tab.

In [None]:
! qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences dada2/representative_sequences.qza \
    --output-dir phylogeny

## Analyze phylogenetic diversity

In [None]:
! qiime diversity core-metrics-phylogenetic \
    --i-phylogeny phylogeny/rooted_tree.qza \
    --i-table dada2/table.qza \
    --p-sampling-depth 1100 \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --output-dir core-metrics-results

In [None]:
! qiime diversity alpha-group-significance \
    --i-alpha-diversity core-metrics-results/faith_pd_vector.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --o-visualization core-metrics-results/faith_pd_group_significance.qzv

In [None]:
# Optional
! qiime diversity alpha-group-significance \
    --i-alpha-diversity core-metrics-results/evenness_vector.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --o-visualization core-metrics-results/evenness_group_significance.qzv

In [None]:
! qiime emperor plot \
    --i-pcoa core-metrics-results/bray_curtis_pcoa_results.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --o-visualization core-metrics-results/bray_curtis_pcoa.qzv

## Classify by taxonomy

There are several ways to classify your sequences into bacterial species. One of them is to use consensus assignment based on e.g. BLAST search of a sequence against a database of known sequences. Another one is using a machine learning classifier trained on a reference database to recognize corresponding bacterial species. We will use a pretrained classifier to identify bacterial species present in our samples.

We can use the `classify-sklearn` action from the feature-classifier plugin to do that. This step will require the `FeatureData[Sequence]` artifact (containing our ASVs) that we generated previously and a pre-trained taxonomic classifier.

In [None]:
! wget https://data.qiime2.org/2023.9/common/gg-13-8-99-515-806-nb-weighted-classifier.qza

In [None]:
! qiime feature-classifier classify-sklearn \
    --i-reads dada2/representative_sequences.qza \
    --i-classifier gg-13-8-99-515-806-nb-weighted-classifier.qza \
    --p-n-jobs 2 \
    --output-dir taxonomy

In [None]:
! qiime metadata tabulate \
    --m-input-file taxonomy/classification.qza \
    --o-visualization taxonomy/classification.qzv

In [None]:
! qiime taxa barplot \
    --i-table dada2/table.qza \
    --i-taxonomy taxonomy/classification.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --o-visualization taxonomy/taxa_barplot.qzv

## Optional section: Understand differentially abundant features

This section may be omitted for time, but provides an interesting mechanistic view of microbiome interactions.

In [None]:
! mkdir diff_abund

! qiime taxa collapse \
    --i-table dada2/table.qza \
    --i-taxonomy taxonomy/classification.qza \
    --p-level 6 \
    --o-collapsed-table diff_abund/table_l6.qza

In [None]:
! qiime composition add-pseudocount \
    --i-table diff_abund/table_l6.qza \
    --o-composition-table diff_abund/comp_table_l6.qza

In [None]:
! qiime feature-table filter-samples \
    --i-table diff_abund/comp_table_l6.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --p-where "[body-site]='gut'" \
    --o-filtered-table diff_abund/comp_gut_table_l6.qza

In [None]:
! qiime composition ancom \
    --i-table diff_abund/comp_gut_table_l6.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --m-metadata-column subject \
    --o-visualization diff_abund/ancom_gut_subject_l6.qzv

# Additional Tools
* `q2-fondue`
* Beta diversity methods in `q2-diversity`:
  * `qiime diversity beta-group-significance`
  * `qiime diversity adonis`