# Microbial Genomics: Lab 5
## Topic: Phylogenetic Trees
#### Tools used: BLAST, FastTree

In [None]:
# Compatibility code for running in Google Colab - please run but free to ignore
import os

if "colab" in str(get_ipython()):
    from google.colab import drive

    drive.mount("/content/drive")
    os.chdir("/content/drive/My Drive/microbial_genomics_labs/labs")

In [None]:
!pip install biopython
!apt install ncbi-blast+ clustalw fasttree -y

## Part A: Lab Exercises (10 pts)
### Exercise 1: Understanding Tree Structures & Filetypes
In bioinformatics, trees are used for a large range of purposes: to reconstruct evolutionary history, compare multiple species in a study, relate phenotypes to genes, and more. To begin understanding how to use trees to accomplish these feats, we must first understand how they are constructed and stored.

Most tree-building algorithms take some kind of aligned sequence file as input:
* Aligned fasta: similar to standard fasta file, except the contained sequences are aligned. We worked with these in Lab 3.
* PHYLIP: an alternate to aligned fasta files, with both interleaved and de-interleaved versions

Both of these formats are text-based, and most alignment tools can output either. We'll often use PHYLIP format to avoid ambiguity with other fasta formats, but feel free to use either throughout this course.

Though there are a few different tree output formats, the most common is Newick. A Newick tree file (often using a `.tree` or `.newick` suffix) is a text-based file that contains information about the nodes, edges, and leaves of the tree. It will look something like this:

`(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);`

This tree has 4 leaves (named `A` through `D`) and includes distances for each branch. A sub-tree appears within parenthesis, and each colon denotes the distance between a leaf (or subtree) and the preceding node. A visual representation would look like this (with each `-` representing 0.1 distance units):
```
|-A
|--B  
|     |---C
|-----|
      |----D
```

If we added in names for the internal nodes, the tree would look like this:

`(A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F;`

That is, we now have `E` and `F` as internally named nodes:
```
|-A
|--B
F     |---C
|-----E
      |----D
```

Note that the Newick file will always have all required information to build the structure of the tree, but may or may not contain node names and distances; by convention, the minimal possible tree contained in a Newick file would be `(,,(,));`, i.e., no node names or distances, which would look like:
```
|-----*
|-----*
|     |-----*
|-----|
      |-----*
```
The branches are all set to an arbitrary length, and the names are left blank. This is technically a valid tree- but it's not very useful.

**Use the file `ex_tree.newick` in the `lab5` folder to answer the following questions:**
1. How many nodes are in this tree? How many sub-trees? Manually draw the tree out as shown abve.
2. Go to the [ETE Toolkit Newick Viewer](http://etetoolkit.org/treeview/) and upload this file. Uncheck "Resolve Taxonomic IDs" and select "Do not display alignment", then click "View Tree". This should display a visualization of your tree. How close was your manually-drawn tree? Based on the node names, what do you think this tree is showing?
3. Re-run the same steps from above, but this time check "Resolve Taxonomic IDs" and select "Aligned Blocks" before clicking "View Tree". What is being shown this time? Based on this information, can you tell more accurately what the tree is decribing?

In [None]:
# Exercise 1

### Exercise 2: Using FastTree
FastTree is a program that is used to build trees based on aligned sequences using an Maximum Likelihood algorithm (although a full discussion on the implementation is out of scope for this course, you can read more about the algorithm [here](http://www.microbesonline.org/fasttree/)). As inputs, the program takes either an aligned fasta file, or a PHYLIP file, and it outputs a Newick-formatted tree.

Running fasttree is relatively simple; at a minimum, you provide it an input alignment and the output target:
`fasttree input.aln > output.tree`

There are many other options that control different models of tree construction, etc., but we will mostly be using the `-gamma` and `-gtr` execution modes. Note that if performing tree-building on nucleotide sequences, we'll also need to include the `-nt` flag.

**Use the short sequences in `lab5/short_seqs.fasta` to complete the following exercise:**
1. Align `short_seqs.fasta` using clustalw (or any other alignment tool you'd like). Save the output as a new file, aligned_seqs.phy (using PHYLIP format rather than .aln). Look at [the ClustalW readme](http://www.clustal.org/omega/README) to see how to set the output file format.
2. Run `fasttree` on the alignment you generated using the `-nt`, `-gtr` and `-gamma` flags. Save the output to a file called `short_seqs.tree`. Remember to use `%%` at the top of a new cell for access to Bash.
3. Open the `short_seqs.tree` file. How many subtrees are there? What is the longest branch in the tree? Based on looking at the raw sequences in `short_seqs.fasta`, which two organisms do you think are furthest away from each other on the tree?
4. Visualize the tree using the ETE Viewer used in Exercise 1. Were your answers above correct?

In [None]:
# Exercise 2

In [None]:
%%bash

fasttree lab5/short_seqs.phy > lab5/output.tree

### Exercise 3: Tree Rooting & Ancestry
The trees we have generated and analysed so far have been "unrooted"; that is, they describe how several sequences or organisms relate to each other, but make no assertations about how the sequences got to their current state. However, what if we want to know how a set of organisms evolved over time, and which emerged more recently than others? To do this type of analysis, we need to know which organism (represented by a leaf node in the tree) is the common ancestor of the tree. For example, if we knew that leaf node A was the common ancestor of the tree from above, we could root the tree at that node, which would look like this:
```
  |--B
A-|     |---C
  |-----|
        |----D
```
Note that the distances between all of the leaves in the tree are still the same as above; all we've done is shifted where we draw the internal nodes.

Because a Newick file contains all the information for the connections in a tree regardless of how we draw it, rooting is typically done in the visualization stage of tree-building, rather than in the actual computational building phase. That is, FastTree is agnostic to how we draw our tree; it just tells us how much distance is between each node and leaves the rest to us. We won't spend too much time on tree visualization strategies for now, but understanding rooting is important to the interpretation of tree data.

**Answer the questions below based on our discussion:**
1. What would you expect a rooted tree to look like, if we wrote it in Newick format? Try writing out the above tree in Newick.
2. What are some of the ways that you would go about determining a common ancestor in a tree of bacterial strains? List some of the tools you'd use, as well as an outline of the steps you would take.

In [None]:
# Exercise 3

## Part B: Homework (30 pts)

In this lab, we will build two trees and compare them. The first tree will consist of a single gene that is extracted from several bacterial genomes- we'll look at how this gene differs based on the organism it comes from. The second will use a set of 7 genes that collectively define the "Multi-Locus Sequence Type", or MLST, or an organism. By looking at the presence or absense of each of these genes, we can essentially tell how closely or distantly related a genome is to a well-defined reference. Each reference has several well-defined characteristics, such as pathogenicity, virulence, geographic prevalance, etc., that we may be interested in for clinical studies.

#### Question 1: Single-gene MSA (8 pts)

The `lab5/MLST` folder contains sequences from 7 genes. Choose one of these genes to build a tree.

For each bacterial genome in the `lab5/genomes` folder, BLAST your chosen gene against the genome and save the sequence of the top hit. Combine the resulting sequences from all 10 genomes into a single file and use clustalw to align them; save the result in a PHYLIP file.

In [None]:
# Question 1

#### Question 2: Single-gene tree (7 pts)

Use FastTree to build a tree out of your alignment from Question 1 and view it in the ETE Tree Viewer. Answer the following questions:
1. Which two organisms are furthest away on the tree? Look these two organisms up in the NCBI database and provide some potential reasons for this.
2. How many subtrees are present in your tree? Are there any obvious clusters you can see?
3. What do you think this tree is saying?

In [None]:
# Question 2

#### Question 3: Multi-locus tree (10 pts)

Repeat Question 1, but instead of choosing a single gene, perform the BLAST search for all 7 MLST genes for every genomes (i.e., you should be performing 7 * 10 = 70 BLAST searches).
Your pipeline should follow this logical "pseudocode":
```pseudocode

all_mlsts = empty list

for organism in organism_list:

    organism_mlst = empty fasta sequence
    
    for gene in gene_list:
        blast gene against organism
        retrieve top hit
        extract hit sequence
        append hit sequence to organism_mlst
    
    append organism_mlst to all_mlsts

save all_mlsts as all_mlsts.fasta
align all_mlsts.fasta > organism_mlst_aligned.fasta (make sure to set output format to FASTA)
use gblocks (see below) to process organism_mlst_aligned.fasta, save as organism_mlst_conserved.aln
    
use fasttree to build a tree out of organism_mlst.conserved.aln > mlst.tree
        
```
This code is essentially performing an MLST analysis on our genomes, and telling us which are closest according to the MLST model (see [this paper](https://www.nature.com/articles/s41598-017-04707-4#Sec2) for a similar example). We use [Gblocks](https://ngphylogeny.fr/tools/tool/276/form) in order to find conserved regions within our alignment (i.e., most of the genome will not align to the gene, and we don't need to keep that extra information). Gblocks should be used with the default settings; just upload each of your alignment files to the tool and save the output as a new alignment. All of your files should be saved in the folder `lab5/results`. Feel free to write this code either in a standalone script or within this notebook.

In [None]:
# Question 3

#### Question 4: Tree comparison (5 pts)

We now have two trees. Visualize both in the ETE Tree Viewer and answer the following questions:
1. How do the two trees compare? Are there any significant differences between the two? Describe any major shifts you see.
2. Which tree do you think is more reliable as a measure of similarity between our organisms? Why?
3. How would you go about rooting this tree?

In [None]:
# Question 4