# Environment:

Login into the GDAV server

`ssh youruser@IP`

Create a directory in your home folder called `compgenomics_ex2`

`mkdir compgenomics_ex2`

and Enter the directory 

`cd compgenomics_ex2`

All the files needed for this exercise are copied in the GDAV server at `/home/compgenomics/4proteomes/`. Make sure you can see them and take a few seconds to understand what they contain: 

```
ls  /home/compgenomics/4proteomes/
$ ls /home/compgenomics/4proteomes/
4proteomes.faa    -> all protein sequences from 4 species: human, elefant, Zebrafish and Ciona intestinalis. 
G3T0S8_LOXAF.faa  -> protein sequence of the elefant gene called G3T0G8
TPH1A_rerio.faa   ->  protein sequence of the Zebrafish gene called TPH1A
TPH2_human.faa    -> protein sequence of Human TPH2 
scripts/          ->  a directory with ad hoc programs and scripts`
```

## Tools needed
(already installed in the GDAV server)

- BLAST+
- IQ-Tree
- MAFFT
- BioPython
- ete3
- scripts/extract_seqs_from_blast_result.py
- scripts/midpoint_rooting.py

# Exercise

## Goal 1

Reconstruct the phylogeny of all TPH2 proteins in the 4 target proteomes, and interpret the tree to identify orthologs. 

### 1. Identify TPH homologs
As in exercise 1, use BLAST to identify all significant hits of the query protein `TPH2_human.faa` in the 4proteomes dataset. 

Tip: You can reuse the BLAST database of exercise 1 and use a command similar to:

```
$ blastp -task blastp -query /home/compgenomics/ex1/TPH2_human.faa -db ~/compgenomics_ex1/4proteomes.blastdb -outfmt 6 -evalue 0.001 > TPH2_homologs.blastout
```

### 2. Extract all homologs in FASTA format

Extract all homologs of the TPH2 sequences using the script provided in `/home/compgenomics/ex1/scripts/extract_seqs_from_blast_result.py`. 

```
$ python /home/compgenomics/ex1/scripts/extract_seqs_from_blast_result.py TPH2_homologs.blastout /home/compgenomics/ex1/4proteomes.faa > TPH2_homologs.faa
```

### 3. Multiple Sequence Alignment (MSA)
Before inferring a phylogenetic tree, homologous sequences need to be aligned. There are multiple programs to do it: ClustalOmega, MAFFT, MUSLE, etc. Here, we will use MAFFT, which has a very simple command line. 

```
$ mafft TPH2_homologs.faa > TPH2_homologs.alg
```

Check the content of the output (saved in TPH2_homologs.alg). What's the main difference compared to the input FASTA file?

### 4. Phylogenetic Reconstruction
Similarly to MSA programs, there are many software to build phylogenetic trees: RAXML, IQ-TREE, PhyML, FastTree, MrBayes, PhyloBayes, etc. Here we will use IQ-Tree, which uses a Maximum Likelihood approximation. 

You only need to provide the MSA file as input, and some parameters defining how exhaustive should be the inference. To get a fast result, the following arguments are recommended (avoiding the step of model testing, which is very slow).

```
$ iqtree -s TPH2_homologs.alg -m LG
```

### 5. Visualize tree

Main IQ-Tree output is the file ending with the `.treefile` extension. The tree file is in [https://en.wikipedia.org/wiki/Newick_format](Newick Format). 

You can use the command line tool `ete3` to display the directly in the terminal: 

```
$ ete3 view --text -t TPH2_homologs.alg.treefile 

   /-A0A0R4ILE6_DANRE_tph2
  |
  |                  /-A0A2R8RPJ0_DANRE_th
  |               /-|
  |              |  |   /-TY3H_HUMAN_TH
  |            /-|   \-|
  |           |  |      \-G3U1E7_LOXAF_TH
  |         /-|  |
  |        |  |   \-Q1LWZ5_DANRE_th2
  ...
```

### 5. Root the tree

By default, phylogenetic trees returned by almost all programs are UNROOTED. There are many methods to root a tree, but a common one is the midpoint_rooting. For convenience, an *ad hoc* script to root Newick trees is provided in `/home/compgenomics/ex1/scripts/midpoint_rooting.py` 

You can use to root your tree before visualizing it: 

```
$ python /home/compgenomics/ex1/scripts/midpoint_rooting.py TPH2_homologs.alg.treefile | ete3 view --text 

            /-A0A2R8RPJ0_DANRE_th
         /-|
        |  |   /-TY3H_HUMAN_TH
      /-|   \-|
     |  |      \-G3U1E7_LOXAF_TH
   /-|  |
  |  |   \-Q1LWZ5_DANRE_th2
  |  |
  |   \-F6Y7Q5_CIOIN_th
  |
  |         /-Q7SYH6_DANRE_pah
--|      /-|
  |     |  |   /-PH4H_HUMAN_PAH
  |   /-|   \-|
...
```


Alternatively, you can print the content of the file directly in the terminal and pasted it into any of the online tree visualization server (or transfer the file and upload it): 

- (http://etetoolkit.org/treeview)
- (http://itol.embl.de)

### Questions: 

- What's the evolutionary relationship between F1R1D3_DANRE_tph1a and Q6IWP4_DANRE_tph1b ?
- What's the evolutionary relationship between TPH2_HUMAN_TPH2 and TPH1_HUMAN_TPH1 ?
- What's the Zebrafish (Danio Rerio) ortholog(s) of the human sequence `TPH1_HUMAN_TPH1`? 
- How many duplication events can you identify?
- How many putative orthologous groups can you identify?  

## Goal 2

Repeat the same protocol to find all homologs of the P53 sequence in the 4 target proteomes, build a phylogeny, and identigy orthologs. 
```
>P53_HUMAN_TP53
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAP
PQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRH
KKLMFKTEGPDSD
```

### Questions:
1. Could you identify duplication and speciation events?
2. How many putative orthologous groups can you identify?  
3. Is there anything unusual in the evolution of this gene family?
4. Is the tree rooted? Where would you root it? 
5. Upload the tree into http://itol.embl.de and explore rooting options. Does it change the inference of duplication events?

