# Phylogenetic Analysis

The goal of this notebook is to perform phylogenetic inference on the haplotypes identified in the brain. 

**Requirements:**
Make sure you have `Biopython`, `MAFFT`, and `ETE3` installed in your environment. 


In [2]:
import os 
import pandas as pd
import numpy as np
from copy import deepcopy
from subprocess import call 
from Bio import SeqIO, AlignIO, SeqRecord, Seq
from Bio.Align.Applications import MafftCommandline 

# from ete3 import Tree, TreeStyle, NodeStyle, TextFace, SequenceFace, CircleFace, faces, AttrFace 

In [3]:
# Path to output directory
outpath = "../../results/phylogeny"

if not os.path.exists(outpath):
    os.mkdir(outpath)

## Make haplotypes from mutations

I'll take the mutations from the variant calling analysis and make the haplotypes using the reference genome. 

In [7]:
# SSPE reference
reference = [base for record in SeqIO.parse("../../config/ref/MeVChiTok-SSPE.fa", "fasta") for base in record.seq]

# Import the mutations
haplotypes_df = pd.read_csv("../../results/variants/validated_variants.csv")

# We only need the mutations that have been assigned to a haplotype
haplotypes_df = haplotypes_df[~haplotypes_df.Haplotype.isin(['subclonal', 'both'])]

haplotypes_df.head()

Unnamed: 0,POS,REF,ALT,AF,DP,Effect,Gene_Name,AA_Change,Accession,Tissue,SNP,Gene,Background,Haplotype,Proposed_Background,Haplotype_Name
0,537,T,C,0.3265,132284,Missense,N,Ser144Pro,SSPE_1,SSPE 1,T537C,N,genome-2,genome-2,unknown,genome 2
1,684,C,T,0.114,127945,Synonymous,N,Leu193Leu,SSPE_1,SSPE 1,C684T,N,genome-1,cluster 1,unknown,cluster 1
5,1328,T,C,0.6809,131283,Synonymous,N,Ser407Ser,SSPE_1,SSPE 1,T1328C,N,genome-1,genome-1-1,unknown,genome 1
7,1632,G,A,0.3126,121322,Missense,N,Gly509Ser,SSPE_1,SSPE 1,G1632A,N,genome-2,cluster 7,unknown,cluster 11
9,2139,T,C,0.6473,62808,Synonymous,P/V/C,His111His,SSPE_1,SSPE 1,T2139C,P/V/C,genome-1,genome-1,unknown,genome 01


In [8]:
def make_haplotype(haplotype_name, reference):
    """
    Add the SNPs the reference to make a haplotype.
    """
    
    assert haplotype_name in set(haplotypes_df.Haplotype_Name.to_list())
    
    # SNPs for a given haplotype
    snps_df = haplotypes_df[haplotypes_df.Haplotype_Name == haplotype_name]
    
    # get a list of tuples for the SNPs
    snps = [(pos, alt) for pos, alt in zip(snps_df.POS, snps_df.ALT)]
    
    # Copy the reference 
    haplotype = deepcopy(reference)
    
    # Change the mutation in the list.
    for pos, alt in snps:
        haplotype[pos + 1] = alt
        
    return haplotype
    

### Haplotype Relationship

![haplotypes](example_tree.png)

I'll make the fasta files for each haplotype by cumulativley building them alone the above tree. Starting with the main backgrounds (Genome-1 and Genome-2)

In [9]:
# 1. Make genome-1 and genome-2
genome_1 = make_haplotype("genome 01", reference)
genome_2 = make_haplotype("genome 2", reference)

Then, I'll add the main branch-point from Genome-1

In [11]:
# 2. Make genome-1-1 and cluster 6
genome_1_1 = make_haplotype("genome 1", genome_1)
cluster_6 = make_haplotype("cluster 6", genome_1)

Then I'll add the Genome-1-1 children nodes

In [12]:
# 4. Make genome-1-1 children 
cluster_1 = make_haplotype("cluster 1", genome_1_1)
cluster_2 = make_haplotype("cluster 2", genome_1_1)
cluster_4 = make_haplotype("cluster 4", genome_1_1)
cluster_5 = make_haplotype("cluster 5", genome_1_1)
cluster_7 = make_haplotype("cluster 7", genome_1_1)
cluster_8 = make_haplotype("cluster 8", genome_1_1)

# 5. Cluster 3 is descended from cluster 4
cluster_3 = make_haplotype("cluster 3", cluster_4)

Then, I'll add the Genome-2 children.

In [13]:
# 6. Make genome-2 clusters
cluster_9 = make_haplotype("cluster 9", genome_2)
cluster_10 = make_haplotype("cluster 10", genome_2)
cluster_11 = make_haplotype("cluster 11", genome_2)
cluster_12 = make_haplotype("cluster 12", genome_2)
cluster_13 = make_haplotype("cluster 13", genome_2)

Then, I'll write all of these out to a fasta file. 

In [14]:
# 7. Save all the termimal nodes in a dictionary
haplotype_seqs = {
    "sspe_ancestor":"".join(nt for nt in reference),
    "genome_1":"".join(nt for nt in genome_1),
    "genome_2":"".join(nt for nt in genome_2),
    "genome_1_1":"".join(nt for nt in genome_1_1),
    "cluster_1":"".join(nt for nt in cluster_1),
    "cluster_2":"".join(nt for nt in cluster_2),
    "cluster_3":"".join(nt for nt in cluster_3),
    "cluster_4":"".join(nt for nt in cluster_4),
    "cluster_5":"".join(nt for nt in cluster_5),
    "cluster_6":"".join(nt for nt in cluster_6),
    "cluster_7":"".join(nt for nt in cluster_7),
    "cluster_8":"".join(nt for nt in cluster_8),
    "cluster_9":"".join(nt for nt in cluster_9),
    "cluster_10":"".join(nt for nt in cluster_10),
    "cluster_11":"".join(nt for nt in cluster_11),
    "cluster_12":"".join(nt for nt in cluster_12),
    "cluster_13":"".join(nt for nt in cluster_13)
}

# 8. Write out to a fasta file
haplotype_records = [SeqRecord.SeqRecord(Seq.Seq(seq), id = hap) for hap, seq in haplotype_seqs.items()]
SeqIO.write(haplotype_records, os.path.join(outpath, "haplotype-sequences.fa"), "fasta")

17

## Visualize the Tree

Make a tree with the Fasta file of each terminal haplotype. These are alredady aligned by the nature of how I made them. 

In [15]:
# Building the phylogeny with IQtree with 1000 bootstrap iterations with GTR+I+G (Invariable site plus discrete Gamma model) with asr
alignfasta = os.path.join(outpath, "haplotype-sequences.fa")
call(f"iqtree -s {alignfasta} -m GTR+I+G -bb 1000 -redo", shell=True)


IQ-TREE multicore version 2.1.4-beta COVID-edition for Linux 64-bit built Jun 24 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    rhino02 (AVX512, FMA3, 754 GB RAM)
Command: iqtree -s ../../results/phylogeny/haplotype-sequences.fa -m GTR+I+G -bb 1000 -redo
Seed:    789293 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Thu Mar  2 20:40:40 2023
Kernel:  AVX+FMA - 1 threads (72 CPU cores detected)

HINT: Use -nt option to specify number of threads because your CPU has 72 cores!
HINT: -nt AUTO will automatically determine the best number of threads to use.

Reading alignment file ../../results/phylogeny/haplotype-sequences.fa ... Fasta format detected
Alignment most likely contains DNA/RNA sequences
Alignment has 17 sequences with 15894 columns, 69 distinct patterns
47 parsimony-informative, 50 singleton sites, 15797 constant sites
               Gap/Ambiguity  Composition  p-

Iteration 460 / LogL: -22895.752 / Time: 0h:0m:4s (0h:0m:0s left)
Iteration 470 / LogL: -22895.757 / Time: 0h:0m:4s (0h:0m:0s left)
Iteration 480 / LogL: -22895.693 / Time: 0h:0m:4s (0h:0m:0s left)
Iteration 490 / LogL: -22895.759 / Time: 0h:0m:4s (0h:0m:0s left)
Iteration 500 / LogL: -22895.760 / Time: 0h:0m:4s (0h:0m:0s left)
Log-likelihood cutoff on original alignment: -22906.696
NOTE: Bootstrap correlation coefficient of split occurrence frequencies: 0.957
NOTE: UFBoot does not converge, continue at least 100 more iterations
Iteration 510 / LogL: -22895.693 / Time: 0h:0m:4s (0h:0m:0s left)
Iteration 520 / LogL: -22895.750 / Time: 0h:0m:4s (0h:0m:0s left)
Iteration 530 / LogL: -22895.789 / Time: 0h:0m:4s (0h:0m:0s left)
Iteration 540 / LogL: -22895.794 / Time: 0h:0m:5s (0h:0m:0s left)
Iteration 550 / LogL: -22895.693 / Time: 0h:0m:5s (0h:0m:0s left)
Log-likelihood cutoff on original alignment: -22906.696
Iteration 560 / LogL: -22895.794 / Time: 0h:0m:5s (0h:0m:0s left)
Iteration 570

0