# Phylogenetic analysis with QIIME 2

In this bonus exercise we will learn how to generate and look at phylogenies using QIIME 2. 

Phylogenetics is the study of the evolutionary history of organisms. This contrasts with taxonomic classification and systematics, which is concerned with the naming and classification of organisms. In theory, microbial taxonomy should correspond to their phylogenetic relationship, but this is not always the case due to historical differences in taxonomic classification techniques (formerly based on morphological and then biochemical traits) that are still being rectified via the use of molecular techniques to uncover actual evolutionary relationships. The rRNA gene operon has a long history of use as a "molecular clock" for estimating the evolutionary relationship between organisms, due to its ubiquity in cellular life forms, its relatively high degree of conservation, but also the presence of divergent regions that are less highly conserved and enable differentiation of many species. Hence, genes like the 16S rRNA gene of bacteria can be used for phylogeny estimation, as we will perform in this notebook. This generally requires first aligning our sequences and then building a phylogeny that represents the distances between these sequences, but other methods exist that instead place sequences in an existing phylogeny or alignment, as shown at the end of this notebook.

If this is your first time building and inspecting phylogenies, we highly recommend the following article as a primer:
* Baldauf SL. Phylogeny for the faint of heart: a tutorial. Trends Genet. 2003 Jun;19(6):345-51. doi: [10.1016/S0168-9525(03)00112-4](https://doi.org/10.1016/s0168-9525(03)00112-4).

There are many different software packages out there for visualizing phylogenies, and some are much more feature-rich than those available in QIIME 2. However, most are general-purpose tools that are not designed for applications in microbial ecology. We will use [empress](https://github.com/biocore/empress), a fast and scalable phylogenetic tree viewer that allows exploration of hierarchical relationships between features in a dataset (e.g., microbial species). In addition to functionality common to established tree viewers (e.g. metadata coloring, clade collapsing, and barplots), Empress supports new functionality useful for (microbial) ecology research, including integration and synchronized animations with ordination plots. We will also try out [iTOL](https://itol.embl.de/upload.cgi), which provides a feature-rich web interface for visualizing phylogenies.

**Exercise overview:**<br>
[1. Phylogeny analysis](#phylogeny_analysis)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.1 _De novo_](#de_novo)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.2 Fragment insertion](#fragm_insert)


<a id='setup'></a>

## 0. Setup

The cell below will import all the packages required in the downstream analyses as well as set all the necessary variables and data paths.

In [None]:
from qiime2 import Visualization

# location of this week's data and all the results produced by this notebook 
# - this should be a path relative to your working directory
data_dir = 'phylogeny_data'

In [None]:
%%bash -s $data_dir
# Please do NOT modify this cell - here we copy the required data into
# your personal Jupyter workspace.

mkdir -p "$1"
cp -rn /data/phylo_data/* "$1"
chmod -R +rxw "$1"

<a id='phylogeny_analysis'></a>

## 1. Phylogeny analysis

Several diversity metrics calculated downstream require construction of a phylogenetic tree from either OTUs or ASVs. We can distinguish two main phylogeny reconstruction approaches:

1. [_de novo_ reconstruction](#de_novo)
2. [reference-based fragment insertion](#fragm_insert)

Below, you will see how to use both of those.

<a id='de_novo'></a>

### 1.1 Phylogeny _de novo_

In this approach we align the marker genes (like the 16S rRNA) across divergent taxa and try to reconstruct the tree based on the resulting alignment. One of the issues of this approach is that short sequences (like the ones we are using in this experiment) may not carry enough information to capture a meaningful phylogeny.

#### 1.1.1 Sequence alignment
Let's first use the `mafft` action from the `alignment` plugin to obtain a multiple sequence alignment of our sequences:

In [None]:
! qiime alignment mafft \
    --i-sequences $data_dir/rep-seqs-filtered.qza \
    --o-alignment $data_dir/aligned-rep-seqs.qza

#### 1.1.2 Alignment masking

It has been suggested by some authors that masking (removing) the ambiguously aligned regions from the alignment (i.e.: regions that are phylogenetically uninformative due e.g. to alignment errors) can increase the performance of the reconstructed phylogeny. To mask the alignment, run the following cell:

In [None]:
! qiime alignment mask \
    --i-alignment $data_dir/aligned-rep-seqs.qza \
    --o-masked-alignment $data_dir/masked-aligned-rep-seqs.qza

#### 1.1.3 Tree construction

Finally, we can use that alignment to construct our phylogenetic tree. There are many methods to do that, e.g: FastTree, RAxML or IQ-TREE (all of those supported in QIIME 2). Here, we will use FastTree, mainly due to its speed. FastTree produces an unrooted tree, hence in the second step we will place the root of the tree at the midpoint of the longest tip-to-tip distance in the unrooted tree.

In [None]:
! qiime phylogeny fasttree \
    --i-alignment $data_dir/masked-aligned-rep-seqs.qza \
    --o-tree $data_dir/fasttree-tree.qza

! qiime phylogeny midpoint-root \
    --i-tree $data_dir/fasttree-tree.qza \
    --o-rooted-tree $data_dir/fasttree-tree-rooted.qza

#### 1.1.4 Tree visualization

Let's try to visualize the tree. We can do this using the `empress` plugin for QIIME 2 or an online tool: [iTOL](https://itol.embl.de/upload.cgi).

First, we will use QIIME 2:

In [None]:
! qiime empress tree-plot \
    --i-tree $data_dir/fasttree-tree-rooted.qza \
    --m-feature-metadata-file $data_dir/taxonomy.qza \
    --o-visualization $data_dir/fasttree-tree-rooted.qzv

Open the qzv files on [view.qiime2.org](https://view.qiime2.org).

Now, for comparison, you can try [iTOL](https://itol.embl.de/upload.cgi).

After opening the web page, click _Choose File_ and select the tree artifact we generated above. Click _Upload_: after a few seconds you should see the tree. In order to label all the nodes with corresponding taxonomies, find the _taxonomy.qza_ artifact and drag-and-drop it onto the tree: this will add the labels (don't worry if a warning about a couple of missing features appears: these are the taxa we filtered out earlier). If you want, you can also add the alignment itself to the tree! Just drag-and-drop it onto the tree again.

You may find it easier to navigate the tree in its "rectangular" representation: to change the view, select the _Rectangular_ option in the _Mode_ section of the _Basic_ tab.

#### 1.1.5 Bootstrapping

Bootstrapping trees is a statistical approach to asserting robustness of the branch splits. In simple terms, it is based on reconstructing the same tree _n_ times by resampling and counting how often a certain branch occurs at the same position. Bootstrapping is a lengthy process, but if you are interested you can see below how it can be done in QIIME 2. The tree generated with this method will have an additional set of _bootstrap values_ that you will then be able to see on the tree (in the iTOL browser).

**Note:** This step takes >30 min to run.

In [None]:
! qiime phylogeny raxml-rapid-bootstrap \
    --i-alignment $data_dir/masked-aligned-rep-seqs.qza \
    --p-seed 1723 \
    --p-rapid-bootstrap-seed 9384 \
    --p-bootstrap-replicates 100 \
    --p-substitution-model GTRCAT \
    --p-n-threads 3 \
    --o-tree $data_dir/raxml-cat-bootstrap-tree.qza

Now visualize the new tree using your method of choice. Remember to root the tree first, as the `raxml-rapid-bootstrap` action produces an unrooted tree.

In [None]:
# your code goes here

<a id='fragm_insert'></a>

### 1.2 Fragment insertion

A method alternative to _de novo_ tree reconstruction is **fragment insertion**. In this method, instead of constructing the entire tree from scratch, we rather use a tree that was already constructed and only try to insert our sequences into that existing tree.

As our reference, we will use a tree that was built from the Greengenes 13_8 database at 99% identity.

In [None]:
! wget -nv -O $data_dir/sepp-refs-gg-13-8.qza https://data.qiime2.org/2021.4/common/sepp-refs-gg-13-8.qza

**Note:** This is a resource intensive command that again requires a large amount of memory and may take quite long to run (>30 min). Do not increase the number of threads below to more than 2 as this also increases memory demand and may cause your workspace to crash.

In [None]:
! qiime fragment-insertion sepp \
    --i-representative-sequences $data_dir/rep-seqs-filtered.qza \
    --i-reference-database $data_dir/sepp-refs-gg-13-8.qza \
    --p-threads 2 \
    --o-tree $data_dir/sepp-tree.qza \
    --o-placements $data_dir/sepp-tree-placements.qza

Finally, you can proceed to tree visualization with your method of choice. Keep in mind that this tree is already rooted so no need to run the `phylogeny midpoint-root` action.

### 1.3 Checkpoint

Look at the trees obtained using the _de novo_ and fragment insertion approach. What is the main difference between them?