# Week 7 Lab: Phylogenetics and COVID-19

Skills: Multiple sequence alignment, Phylogenetics, Installing tools, Finding data

Note, there are no exercise notebooks this week. Just one report notebook to fill out. This will give you extra time to work on your proposals, which are due on Friday!

## Intro

COVID-19 is caused by a novel type of coronavirus, a class of viruses that can cause disease ranging from the common cold to more severe conditions like MERS or SARS. In this lab, we'll analyze the genome multiple strains of the novel coronavirus (SARS-CoV-2), and use comparative genomics techniques to explore how the virus relates to other types of coronaviruses.

In your final project, you will aim to reproduce results of a published paper by obtaining their data and following their methods. In this lab, we will similarly use published work to guide us (although we're combining data from multiple sources and deviating a bit since we don't have acces to all the same datasets, and are trying to use tools that take less time to run).

Our primary goal will be to reconstruct the phylogenetics analysis from [A Novel Coronavirus from Patients with Pneumonia in China, 2019](https://www.nejm.org/doi/full/10.1056/NEJMoa2001017), one of the earliest reports of the full SARS-CoV-2 genome. Specifically, we'll be producing something like their [Figure 4b](https://www.nejm.org/doi/full/10.1056/NEJMoa2001017):

<img src="COVID-19-tree.png">

This tree shows the relationship between multiple SARS-CoV-2 genomes and other coronavirus strains, including SARS, MERS, and coronavirus strains isolated from bats.

Unfortunately, they don't give us a whole lot of methods info to go on. Here is a snippet from the relevant methods part: "Multiple-sequence alignment of the 2019-nCoV and reference sequences was performed with the use of Muscle. Phylogenetic analysis of the complete genomes was performed with RAxML (13) with 1000 bootstrap replicates and a general time-reversible model used as the nucleotide substitution model." Not a whole lot of details, but with enough detective work we'll still be able to construct a similar tree!

## Overview

Our overall goal will be to compare genomes of different virus species. Whereas in previous labs, we usually started from raw reads (in fastq format), for this lab, we'll work with existing assemblies for the viral genomes of interest and skip doing the assembly ourselves. If you choose to do the extra credit part, you'll try your hand at starting from the raw reads and using your own assembly in the phylogenetics analysis.

In this lab, we'll go through:
1. Obtaining genome assemblies from NCBI and raw reads from SRA.
2. Performing multiple sequence alignment.
3. Building a phylogenetic tree to explore the evolutionary relationship between virus strains.
4. Visualizing phylogenetic trees.
5. (Extra credit) Obtaining raw reads from SRA and performing assembly.

This lab will also be a warm-up for your final project. In previous labs, we have mostly set up the data and tools for you beforehand. But when you start working on your own research projects, you'll probably find that the setup process, and wrangling datasets into the correct formats, often is harder to figure out than actually doing the analysis! So in this lab, we'll give you some pointers on how to obtain the data and tools you need. We'll also give some tips in lecture. But you'll be doing a lot of those steps on your own.

### Summary of tools covered
In this lab we'll be using or referring to the following tools:

* [mafft](https://mafft.cbrc.jp/alignment/software/): for performing multiple sequence alignment
* [RaxML](https://github.com/stamatak/standard-RAxML): for building phylogenetics trees


## Note

Unlike in previous weeks, rather than scattering questions throughout the instructions notebook, this week all report instructions are just given in the report notebook.

# 1. Downloading the data

Our first step will be to download the assembled genomes for the viruses we'd like to compare. To help you get started, we've provided (based on manually copying from the figure above...) the NCBI accession numbers for the genomes to compare:

```
/datasets/cs185-sp21-A00-public/week7/lab7_accessions.txt
```

This is just a text file, with one accession per line, where each accession is unique to a virus strain. These accessions, and brief descriptions, are also listed here: https://docs.google.com/spreadsheets/d/1p1JpKKj1lUmGqrq2fdnX-FHvv7wt-jIdyY4qrRXyT9I/edit?usp=sharing

To see info about a certain accession, you can go to NCBI using a link like: https://www.ncbi.nlm.nih.gov/nuccore/AY508724.1

You'll see for instance that this genome is from "SARS coronavirus NS-1, complete genome" (the original SARS from the early 2000s). You can also scroll through the entire genome sequence since it's so short!

To see another accession, replace the last value after the "/" in the URL with the genome you're interested in.

Our goal in this first section is to download the genomes for each of these accessions into one big fasta file, which we will need to input into the tools we use below. So you'd like to create a fasta file that looks something like this:


```
>AY508724.1 SARS coronavirus NS-1, complete genome
TACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGT...
>AY485277.1 SARS coronavirus Sino1-11, complete genome
ATATTAGGTTTTTACCTACCCAGGAAAAGCCAACCAACCTCGATCTCTTGTAGATCTGTTCTCTAAACG...
```

Put your fasta file in  `~/week7/lab7_virus_genomes.fa`.

Some things you might find helpful:
* URLs of the form: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AY508724&rettype=fasta&retmode=text (similar for other accessions) will point you to the genome sequence for each accession (in fasta format)
* You can use the commands `wget` or `curl` to directly download files from URLs.

A good way to go about this would be to write a for loop or similar script that would loop through accessions, construct a URL to the fasta file for each, then use `wget` to download. Our solution is able to do all of this in just one line of UNIX commands. Although your solution doesn't have to.

# 2. Performing multiple sequence alignment

Now that we've collected all the genome sequences we need in a single fasta file, we'll want to compare them to each other. A typical and critical step in order to do so is to create a "multiple sequence alignment" between them so we can compare nucleotides at specific bases across strains.

The NEJM paper used a tool called [MUSCLE](https://www.ebi.ac.uk/Tools/msa/muscle/) for this. We'll deviate from their methods and use an alternative tool called [mafft](https://mafft.cbrc.jp/alignment/software/), since we found that it runs quite a bit faster. `mafft` (and `MUSCLE`) take as input a fasta file with multiple genomes, and outputs a new fasta-like file showing the alignment between those genomes.

## 2.1 Install mafft
You'll first need to install `mafft`. Since we don't have "root" permissions, you'll want to follow their instructions here: https://mafft.cbrc.jp/alignment/software/installation_without_root.html to install to your own home directory.

Some notes that may be helpful:
* If you change the top line of the Makefile to `PREFIX=$(HOME)/local`, it will install to your home directory
* This will install `mafft` to `$HOME/local/bin`. You can run on the command line with the full path: `$HOME/local/bin/mafft`. 
* Alternatively, you can add this directory to your `$PATH`, which is where UNIX searches for tools. For instance if you do
```
export PATH=$PATH:$HOME/local/bin
```
You should then be able to just type `mafft` at the command line like you would for other tools. Before moving on make sure when you type `mafft` or `$HOME/local/bin/mafft` in the command line the mafft instructions come up.

Note, if you close and reopen your terminal, you'll have to type this line again to reset the $PATH.

## 2.2 Run mafft

Once mafft is installed, you'll need to run it! If you type the `mafft` command, it will walk you through inputting the fasta file and ask a name for our output file. Run `mafft` and save the output to a file `lab7_virus_genomes.aln`. Look at the output file. It should have many gap "-" characters in addition to nucleotides in it. This so that all the comparable nucleotides are lined up in the final multiple sequence alignment.

# 3. Building a tree

Now, we're ready to build our tree. We'll use a tool called RaxML (the same one used in the NEJM paper). You can find the RaxML manual here: https://github.com/stamatak/standard-RAxML/blob/master/manual/NewManual.pdf. I found this manual *very overwhelming*. We'll give some guidance in how to use this tool to build our tree. 

## 3.1 Install RaxML

You'll first have to install RaxML. We downloaded the `.tar.gz` file for v8.2.12 from the Github releases page: https://github.com/stamatak/standard-RAxML/releases and followed the instructions on their Github home page (https://github.com/stamatak/standard-RAxML) for compiling the version "SSE3.PTHREADS". After it compiles, it should create a binary file ` raxmlHPC-PTHREADS-SSE3`. It is convenient to add to a location on your $PATH, e.g.:

```
cp raxmlHPC-PTHREADS-SSE3 ~/local/bin/raxml
```

Then you can just run the commands by typing `raxml` at the command line.

## 3.2 Run RaxML

Now, use RaxML to build a tree. We will first build a maximum likelihood tree, and annotate the branches with  confidence values, as was done in the NEJM figure. Even though they didn't give us much detail, we do know:
* They used bootstrapping to annotate support for each split. Their methods say they did 1000 bootstraps. We'll do fewer to save time. Our solution used 100. But you might want to try with fewer to make sure things are working first.
* They used a "general time-reversible model". (Hint look for "GTRCAT" in the RaxML manual).

We'll actually need to use multiple RaxML commands:
* The first will find the maximum likelihood tree based on our mafft alignment.
* The second will perform the bootstrap search. This one can take a while. You might want to use `nohup`.
* The third will draw the bipartitions (bootstrap values) on the best tree generated by the first command. This should be fast.

Figure out from the RaxML manual, or from googling, how to do this. You might find Step 4 of this tutorial helpful: https://cme.h-its.org/exelixis/web/software/raxml/hands_on.html. (more helpful than the manual...)

You should end up with a file named something like `RAxML_bipartitionsBranchLabels.lab7_raxml_bs` with your final tree containing branch labels with the bootstrap values. This file is in [newick format](https://en.wikipedia.org/wiki/Newick_format), a common file format to describe trees.

# 4. Visualize the tree

Finally, visualize your tree. We recommend you use one of multiple online tools for viewing trees from Newick files. e.g.:

* [iTOL](https://itol.embl.de/upload.cgi)
* [ETE Treeview](http://etetoolkit.org/treeview/)

For both of these, you can directly copy the text of your newick file into the text box provided and visualize the output. You might want to improve on the intitial visualization they provide. e.g.:
* Make sure bootstrap support values are displayed at the branchpoints.
* You may want to change the node labels, either programmatically or manually, to make interpreting your tree easier. For instance, rather than display accessing codes like AY508724.1 you could label the nodes something like "SARS coronavirus NS-1" based on the annotations here: https://docs.google.com/spreadsheets/d/1p1JpKKj1lUmGqrq2fdnX-FHvv7wt-jIdyY4qrRXyT9I/edit?usp=sharing.
* You'll want to at least be able to pick out which leaves correspond to COVID-19 samples, the original SARS, and MERS.

# 5. Write up

Once you've finished building your tree, head over to the report notebook to complete the lab!