# Welcome to the Bioinformatics Analysis of Microbiome Data


We will be using [**QIIME2**] (https://qiime2.org) to analyse the read data. Instead of working locally, we will run our analyses on [**Google Colab**], a free, cloud-based platform that allows you to write and execute code in a web browser without needing to install software on your computer. 

This notebook and corresponding setup script have been adapted from the [**uzh-microbiome-tutorial**](https://github.com/bokulich-lab/uzh-microbiome-tutorial.git); all source code is licensed under the Apache License 2.0.


**Notes:**

-**Bash commands**
Google Colab, by default, interprets code as Python. However, many tasks—like downloading files, moving directories, or running software like QIIME 2—are done using bash commands. To run these bash commands in Colab, we prefix them with `!`. This allows us to interact with QIIME2 using the [`q2cli`](https://github.com/qiime2/q2cli/) (QIIME 2 command-line interface). You would not need to use this prefix when using the terminal. 


-**Read before you run**
You can run all cells in the notebook by going to `Runtime > Run all`. However, it is best to run the commands bit by bit to integrate the information and understand what we are doing. 

## Setup

QIIME 2 is usually installed by following the [official installation instructions](https://docs.qiime2.org/2023.9/install/). However, because we are using Google Colab and there are some caveats to using conda here, we will use the setup script obtained from our collaborators at the Bokulich lab. 

We start by cloning the repository down from GitHub into a directory named "Materials". 
Note: This command is intended for use in Google Colab. If working locally, you would clone the repository on your machine. 

In [None]:
! git clone https://github.com/natashaztarora/BME307_2024.git materials
! mkdir /content/prefetch_cache ## This directory is necessary for Google Colab.

Next we navigate to our newly created directory "materials" in Google Colab. 

In [None]:
%cd materials

Now we are ready to set up our environment: we will be installing dependencies and configuring the environemnt. This will take about 10 minutes.
**Note:** This setup is only relevant for Google Colaboratory and will not work on your local machine.

In [None]:
%run setup_qiime2

And we will use some Python packages below, so let's load these here:

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

## Import data into QIIME 2
Run the following cells first! Feel free to run these first few cells while Anton explains the basics of QIIME 2.

In [1]:
! mkdir -p data

In [None]:
qiime tools import \
    --type 'SampleData[PairedEndSequencesWithQuality]' \
    --input-path Raw_data_zipped \
    --input-format CasavaOneEightSingleLanePerSampleDirFmt \
    --output-path demux-paired-end.qza

In [None]:
! qiime demux summarize \
    --i-data demux-paired-end.qza \
    --o-visualization qualities.qzv

Now let's explore the outputs (QZV) with [view.qiime2.org](https://view.qiime2.org).

## Denoise amplicon sequence variants

Feature table is a type of artifact accepted by many QIIME 2 plugins/actions and used in many downstream analyses. It is used to map features (e.g. specific DNA sequences) to samples, for example by using feature counts per sample. There are several ways to construct a feature table in QIIME 2. The major choice to make while working with sequencing data is between ASVs and OTUs. Below you will see how to perform denoising of sequences to produce a table of ASVs.

### DADA2: Amplicon Sequence Variants

There exist several tools one can use for denoising of NGS reads. Here, we will use DADA2 to create a feature table of ASVs. DADA2 builds an error model which can identify differences between sequences, filters out noisy sequences and generates a feature table with error-corrected sequences.

To denoise the single-end reads we execute the cell below, specifying some additional parameters/outputs:

* `p-trunc-len` - we will truncate the reads to 135 bp (sequences shorter than this will be removed automatically)
* `p-n-threads` - if we have more than 1 CPU available, we can specify the number here to make the processing faster
* `output-dir`:
  * `o-table` - this will be our ASVs feature table
  * `o-representative-sequences` - this will be a list of all the denoised features (DNA sequences)
  * `o-denoising-stats` - this will be some stats from the denoising process

In [None]:
! qiime dada2 denoise-single \
    --i-demultiplexed-seqs sequences.qza \
    --p-trunc-len 135 \
    --p-n-threads 2 \
    --output-dir dada2

In [None]:
# Optional
! qiime metadata tabulate \
    --m-input-file dada2/denoising_stats.qza \
    --o-visualization dada2/denoising_stats.qzv

In [None]:
! qiime feature-table summarize \
    --i-table dada2/table.qza \
    --m-sample-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --o-visualization dada2/table.qzv

In [None]:
# Optional
! qiime feature-table tabulate-seqs \
    --i-data dada2/representative_sequences.qza \
    --o-visualization dada2/representative_sequences.qzv

## Generate a phylogenetic tree

Now to view the tree, you can try [iTOL](https://itol.embl.de/upload.cgi).

After opening the web page, click Choose File and select the tree artifact we generated above. Click Upload: after a few seconds you should see the tree.

You may find it easier to navigate the tree in its "rectangular" representation: to change the view, select the Rectangular option in the Mode section of the Basic tab.

In [None]:
! qiime phylogeny align-to-tree-mafft-fasttree \
    --i-sequences dada2/representative_sequences.qza \
    --output-dir phylogeny

## Analyze phylogenetic diversity

In [None]:
! qiime diversity core-metrics-phylogenetic \
    --i-phylogeny phylogeny/rooted_tree.qza \
    --i-table dada2/table.qza \
    --p-sampling-depth 1100 \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --output-dir core-metrics-results

In [None]:
! qiime diversity alpha-group-significance \
    --i-alpha-diversity core-metrics-results/faith_pd_vector.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --o-visualization core-metrics-results/faith_pd_group_significance.qzv

In [None]:
# Optional
! qiime diversity alpha-group-significance \
    --i-alpha-diversity core-metrics-results/evenness_vector.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --o-visualization core-metrics-results/evenness_group_significance.qzv

In [None]:
! qiime emperor plot \
    --i-pcoa core-metrics-results/bray_curtis_pcoa_results.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --o-visualization core-metrics-results/bray_curtis_pcoa.qzv

## Classify by taxonomy

There are several ways to classify your sequences into bacterial species. One of them is to use consensus assignment based on e.g. BLAST search of a sequence against a database of known sequences. Another one is using a machine learning classifier trained on a reference database to recognize corresponding bacterial species. We will use a pretrained classifier to identify bacterial species present in our samples.

We can use the `classify-sklearn` action from the feature-classifier plugin to do that. This step will require the `FeatureData[Sequence]` artifact (containing our ASVs) that we generated previously and a pre-trained taxonomic classifier.

In [None]:
! wget https://data.qiime2.org/2023.9/common/gg-13-8-99-515-806-nb-weighted-classifier.qza

In [None]:
! qiime feature-classifier classify-sklearn \
    --i-reads dada2/representative_sequences.qza \
    --i-classifier gg-13-8-99-515-806-nb-weighted-classifier.qza \
    --p-n-jobs 2 \
    --output-dir taxonomy

In [None]:
! qiime metadata tabulate \
    --m-input-file taxonomy/classification.qza \
    --o-visualization taxonomy/classification.qzv

In [None]:
! qiime taxa barplot \
    --i-table dada2/table.qza \
    --i-taxonomy taxonomy/classification.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --o-visualization taxonomy/taxa_barplot.qzv

## Optional section: Understand differentially abundant features

This section may be omitted for time, but provides an interesting mechanistic view of microbiome interactions.

In [None]:
! mkdir diff_abund

! qiime taxa collapse \
    --i-table dada2/table.qza \
    --i-taxonomy taxonomy/classification.qza \
    --p-level 6 \
    --o-collapsed-table diff_abund/table_l6.qza

In [None]:
! qiime composition add-pseudocount \
    --i-table diff_abund/table_l6.qza \
    --o-composition-table diff_abund/comp_table_l6.qza

In [None]:
! qiime feature-table filter-samples \
    --i-table diff_abund/comp_table_l6.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --p-where "[body-site]='gut'" \
    --o-filtered-table diff_abund/comp_gut_table_l6.qza

In [None]:
! qiime composition ancom \
    --i-table diff_abund/comp_gut_table_l6.qza \
    --m-metadata-file data/moving_pictures/moving_pictures_metadata.tsv \
    --m-metadata-column subject \
    --o-visualization diff_abund/ancom_gut_subject_l6.qzv

# Additional Tools
* `q2-fondue`
* Beta diversity methods in `q2-diversity`:
  * `qiime diversity beta-group-significance`
  * `qiime diversity adonis`