# Workshop 8
# Phylogenetics

In this workshop we will be working with a multiple sequence alignment of bacterial genomes. We will then create and visualise a phylogenetic tree from the alignment and use geographical information to answer questions related to a public-health crisis.


## Background

It's 2010 and there has been a massive earthquake in Haiti. This has led to a dramatic increase in the number of cases of cholera, the disease caused by the bacterium *Vibrio cholerae*. This is the first time in more than 100 years that there has been a cholera outbreak in Haiti. Urgent efforts are underway to identify the source of the outbreak.

Fortunately, in addition to field epidemiologists carrying out on-site investigations in the affected areas, we have WGS of some of the isolates from the outbreak. 
 
You are the lead bioinformatician in charge of analysing this data. You need to process and interpret the data, and provide feedback to the public health organisations in charge of outbreak response.

**Why might there be an increase in cholera cases following an event like an earthquake?**

Your answer here.


**What is unusual about the genome of _V. cholerae_?**

Your answer here.

## Alignment file

v_cholerae.aln contains an alignment of all the high-quality variable sites found in the core genome from 19 *V. cholerae* isolates from Haiti and 21 other *V. cholerae* isolates from around the world.

[`snippy-core`](https://github.com/tseemann/snippy) was used to generate the alignment. Snippy was also used in workshop 5.

**Describe the content of the file. What format is the file in?**

Your answer here.

<br>

## Calculate distance matrix

First we need to know how different the aligned sequences are from each other. We will use a tool called [`snp-dists`](https://github.com/tseemann/snp-dists) to calculate the SNV distances between each pair of sequences in `v_cholerae.aln`. The tool will output these distances as a matrix. We will use this matrix to plot a histogram in task 1.

In [None]:
# Use the following command to run snp-dists to create a file of pairwise distances.
!snp-dists v_cholerae.aln > pairwise_distances.tsv

## Task 1 - Summary of distances

Import the pairwise distance matrix created from the previous step into a pandas data frame. Make sure you change the directory name to where your .tsv file is. The data is stored in a tab-separated format, so set `delimiter='\t'`.

**NOTE:** The resulting matrix is a **symmetric matrix**, 
which is a property of distances: $$\mathrm{dist}[i,j] = \mathrm{dist}[j,i]$$

In [None]:
# Import the relevant libraries and the pairwise distances file.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import skbio

# Import the pairwise distances file.
pwd = pd.read_csv('pairwise_distances.tsv', delimiter='\t', encoding='utf-8', index_col=0)

# Display a section of the distance matrix.
pwd.iloc[:5,:5]

In [None]:
# Get the dimensions of the dataframe
pwd_dim = pwd.shape
pwd_dim

<br>

#### Plot a distance heatmap

To visualise the distance matrix, try plotting a heatmap below.

**HINT:** We made a heatmap plot using `seaborn` (<a href="https://seaborn.pydata.org/generated/seaborn.heatmap.html">See documentation</a>) in workshop 7. 

In [None]:
# Plot a heatmap of the pairwise SNP distances to show that distances are symmetric


# Your code here

You can also check <a href="https://matplotlib.org/stable/tutorials/colors/colormaps.html">here</a> for some alternative colour maps! But be aware, some are more appropriate for heatmaps than others.

As you will see above, distance matrix is symmetric. To avoid using each distance twice when we generate summary statistics, we must only use half of the distances (one triangle of the distance matrix).
We can do this using <a href="https://numpy.org/doc/stable/reference/generated/numpy.triu_indices.html">`np.triu_indices`</a>.

In [None]:
# Get the indices of the upper triangle, excluding the diagonals (k=1)
triu_ind = np.triu_indices(pwd_dim[0], k=1)

# Extract the distances at these indices into a 1D numpy array.
distances = pwd.values[triu_ind]

# Find the median pairwise distance.
np.median(distances)

#### Pairwise distance histogram

Now plot a histogram of the pairwise distances. Histograms were covered in workshop 3.

In [None]:
# Plot a histogram of the pairwise SNP distances using your choice of plotting tool.


# Your code here



<br>


**How well do you think the median alone describes the dataset?**

Your answer here.

<br>

**What does the histogram tell you about the relatedness of the samples in your dataset? What further analysis or plotting could you do to investigate this?**

NOTE: There is no correct/incorrect answer.

Your answer here.

There appears to be various 'clusters' of samples. Some that are very close to each other, but different from the rest. We can demonstrate this by performing clustering with <a href="https://seaborn.pydata.org/generated/seaborn.clustermap.html">`sns.clustermap`</a>.

In [None]:
# Plot heatmap using seaborn
import seaborn as sns

ax = sns.clustermap(pwd, vmin=0, cmap="plasma_r", linewidths=0.05,
                 cbar_kws={'label': "Pairwise distances"},   # Set colour bar label
                 xticklabels=False, yticklabels=True)  # Only show y axis labels

# Show plot
plt.show()

<br>

## Task 2 - Create a phylogenetic tree

We will now use the <a href="https://github.com/iqtree/iqtree2">`IQ-TREE2`</a> tool tool to generate a **maximum likelihood** tree. 

By default, IQ-TREE will test a wide range of **nucleotide substitution models** to see which **best fits** the data. For larger alignments this can be a very time consuming step, so if speed is important, it might make sense to omit this step and use a general model (e.g. GTR). You can find more information about how to run a specific model by looking at `iqtree2 -h`.

IQ-TREE2 is installed on SWAN. 

If you wish to install it on your (Unix based) personal device, use the following command:
> `conda install -c bioconda iqtree`

In [None]:
# Use the following command to run iqtree2 and create a phylogenetic tree.
!iqtree2 -s v_cholerae.aln

<br>

To answer the questions below, you may need to read parts of the IQ-TREE2 <a href="https://github.com/iqtree/iqtree2"> documentation</a>. A list of common substitution models used by iqtree can be found <a href="http://www.iqtree.org/doc/Substitution-Models">here</a>.

**NOTE**: If you clear the above output, the contents are still available in `v_cholerae.aln.log` - one of the output files.

**Examine the IQTREE output. Which model was selected as the best?**

Your answer here.


**What is the full name of the model?  Hint: what do the initials stand for?**

Your answer here.


**Examine the output file v_cholerae.aln.treefile. What format is this file in?**

Your answer here.

<br>

`v_cholerae.aln.iqtree` provides a summary of the results including an ASCII-style phylogenetic tree diagram. In task 3 we will improve on this level of tree visualisation.

## Task 3 - Visualising the tree with added geographical information

Download the tree (the file with the extension *.treefile*) to your local computer. 

Change the file extension of the tree file to *.nwk*. You will also need the `geog_loc_microreact.csv` file. 

Follow the link below and upload both the `v_cholerae.aln.nwk` file and the `geog_loc_microreact.csv` file. 

>https://microreact.org/upload 

<br>

Spend some time becoming familiar with the various options and buttons. Microreact is a great tool for investigating relationships between phylogeny and geography (phylogeographical relationships).

Try experimenting with the different tree visualisation settings. Use the show controls button in the top right of the white tree pane.

**Identify the genome with the ID SRR135545 in the tree. Which country was this isolate from?**

Your answer here.

**Search this ID in the [SRA (sequence read archive)](https://www.ncbi.nlm.nih.gov/sra). What instrument was used to sequence the genome?**

Your answer here.

**Are the isolates from Haiti a monophyletic group? Is this consistent with the heatmap plot above?**

Your answer here.

**Is there phylogeographical signal in the *V. cholerae* genomes from within Haiti? In other words, are the isolates from different regions of Haiti genetically distinct from each other? If you were given another Haitian isolate with no geographical information, how confident would you be in your prediction of its geographic origin?**

Your answer here.

**Where does the closest genetic neighbour of the Haitian genomes originate from?**

Your answer here.

**What would you write if you were asked to summarise your results for the public health officials?**

Your answer here

<br>

## Task 4 - Average distances between countries

We can use `skbio` to import the treefile from the `IQTREE2` algorithm.

`skbio` has a range of methods for tree objects, including visualising and traversing, which we will expore below.

See the <a href="http://scikit-bio.org/docs/0.5.6/generated/skbio.tree.TreeNode.html#skbio.tree.TreeNode">documentation</a> for more methods to apply to `skbio` tree objects.

In [None]:
import skbio
iqtree_output = skbio.tree.TreeNode.read("v_cholerae.aln.treefile")

In [None]:
# Visualise tree
print(iqtree_output.ascii_art())

In [None]:
# Access a specific node
reference_node = iqtree_output.find("Reference")
reference_node

In [None]:
# Compute the distance between two nodes
query_node = iqtree_output.find("ERR1879431")
reference_node.distance(query_node)

<br>

Let's say that we want to know how different the samples from Haiti are, compared to samples from another country. How could we use skbio tree methods to find the mean distance between the reference and all of the samples from Haiti?

For this we will need to load the metadata (*geog_loc_microreact.csv*) provided on the LMS. This *.csv* file contains information about each sample, including what country they are.

In [None]:
metadata = pd.read_csv('geog_loc_microreact.csv', sep="\t")
metadata.head(5)

Let's write a function that takes an skbio tree object and returns the mean distance between the "*Reference*" sample and all samples (nodes) from a given country. Assume country_name is a string. Assume the tree is an skbio tree object. Assume that the reference sample will always be named "*Reference*" and the column names of meta_data are as above.

In [None]:
import numpy as np

def mean_distance(tree, country_name, meta_data):
    """
    Compute the mean distance between the reference node and all samples from country_name. 
    Assume that tree is an skbio tree object
    Assume that country_name is a string that corresponds to a country in meta_data.
    Assume that meta_data is a pandas dataframe with column names "id" for samples and "country" for country name.
    Return the mean as a floating-point number.
    """
    # Your code here
        

In [None]:
# Test your function here
print(mean_distance(iqtree_output, "Haiti", metadata))

print(mean_distance(iqtree_output, "Mexico", metadata))

Here we can say that, on average, the samples from Mexico are more closely related to the reference than the new samples from Haiti. This is because they have a smaller distance.

The distances here are very different to the pairwise differences that we saw in Task 1, why is that?

<br>

<br>

## Extension - Plotting pairwise distances to observe country-of-origin differences

Use the pairwise distances from `snp-dists` to plot another `seaborn` clustermap. Except this time, try incorporating information about country-of-origin so that our plot can illustrate country differences.



In [None]:
### Plot heatmap using seaborn

<br>

`Workshop developed by Steven Morgan, Dr. Dieter Bulach, Dharmesh Bhuva and Holly Whitfield.`