# Phylogenetic Analysis

The goal of this notebook is to perform phylogenetic inference on the haplotypes identified in the brain. 

**Requirements:**
Make sure you have `Biopython`, `MAFFT`, and `ETE3` installed in your environment. 


In [3]:
import os 
import pandas as pd
import numpy as np
from copy import deepcopy
from subprocess import call 
from Bio import SeqIO, AlignIO, SeqRecord, Seq
from Bio.Align.Applications import MafftCommandline 

# from ete3 import Tree, TreeStyle, NodeStyle, TextFace, SequenceFace, CircleFace, faces, AttrFace 

In [4]:
# Path to output directory
outpath = "../../results/phylogeny"

if not os.path.exists(outpath):
    os.mkdir(outpath)

## Make haplotypes from mutations

I'll take the mutations from the variant calling analysis and make the haplotypes using the reference genome. 

In [5]:
# Final tree from SPRUCE/MACHINA
cluster_relationships = pd.read_csv("../../results/phylogeny/filtered_spruce_tree.csv")
cluster_relationships['from'] = cluster_relationships['from'].replace('_', ' ', regex=True)
cluster_relationships['to'] = cluster_relationships['to'].replace('_', ' ', regex=True)
rename_clusters = {"Anc": "SSPE ancestor"}
cluster_relationships['from'] = cluster_relationships['from'].replace(rename_clusters)

# SSPE reference
reference = [base for record in SeqIO.parse("../../config/ref/MeVChiTok-SSPE.fa", "fasta") for base in record.seq]

# Import the mutations
haplotypes_df = pd.read_csv("../../results/variants/validated_variants.csv")
rename_haplotypes = {"genome-1": "genome 01", 
                     "genome-1-1": "genome 1",
                     "genome-2": "genome 2"}
haplotypes_df['Haplotype'] = haplotypes_df['Haplotype'].replace(rename_haplotypes)

# We only need the mutations that have been assigned to a haplotype
haplotypes_df = haplotypes_df[~haplotypes_df.Haplotype.isin(['subclonal', 'both'])]

haplotypes_df.head()

Unnamed: 0,POS,REF,ALT,AF,DP,Effect,Gene_Name,AA_Change,Accession,Tissue,SNP,Gene,Cluster,Background,Haplotype
0,537,T,C,0.3265,132284,Missense,N,Ser144Pro,SSPE_1,SSPE 1,T537C,N,genome-2,genome-2,genome 2
1,684,C,T,0.114,127945,Synonymous,N,Leu193Leu,SSPE_1,SSPE 1,C684T,N,cluster 1,genome-1,cluster 4
5,1328,T,C,0.6809,131283,Synonymous,N,Ser407Ser,SSPE_1,SSPE 1,T1328C,N,genome-1-1,genome-1,genome 1
7,1632,G,A,0.3126,121322,Missense,N,Gly509Ser,SSPE_1,SSPE 1,G1632A,N,cluster 7,genome-2,cluster 10
9,2139,T,C,0.6473,62808,Synonymous,P/V/C,His111His,SSPE_1,SSPE 1,T2139C,P/V/C,genome-1,genome-1,genome 01


In [9]:
def make_haplotype(to_node, from_seq):
    """
    Add the SNPs the ancestor to build a haplotype.
    """
    
    # SNPs for a given haplotype
    snps_df = haplotypes_df[haplotypes_df.Haplotype == to_node]
    
    # get a list of tuples for the SNPs
    snps = {(pos, alt) for pos, alt in zip(snps_df.POS, snps_df.ALT)}
    
    # Copy the reference 
    haplotype = deepcopy(from_seq)
    
    # Change the mutation in the list.
    for pos, alt in snps:
        from_seq[pos - 1] = alt
        
    return haplotype
    

def traverse_tree(node, tree_df, haplotypes):
    """
    Traverse the finalized SPRUCE tree and build the haplotypes into a 
    dictionary before alignment.
    """
    children = tree_df[tree_df['from'] == node]

    if children.empty:
        return haplotypes

    for _, row in children.iterrows():
        from_node = row['from']
        to_node = row['to']
        from_seq = haplotypes[from_node]
        print(f"Makeing haplotype for {to_node} using background {from_node}")
        new_haplotype = make_haplotype(to_node, from_seq)
        haplotypes[to_node] = new_haplotype
        traverse_tree(to_node, tree_df, haplotypes)

    return haplotypes


### Haplotype Relationships

I'll make the fasta files for each haplotype by cumulativley building them along the tree that we deteremine using `SPRUCE` filtered down using briding reads. 

In [10]:
cluster_relationships

Unnamed: 0,from,to,tree,treenum
0,genome 2,cluster 10,tree 1,1
1,genome 2,cluster 12,tree 1,1
2,genome 2,cluster 11,tree 1,1
3,genome 2,cluster 13,tree 1,1
4,genome 2,cluster 9,tree 1,1
5,genome 1,cluster 4,tree 1,1
6,genome 1,cluster 2,tree 1,1
7,genome 1,cluster 1,tree 1,1
8,genome 1,cluster 8,tree 1,1
9,genome 1,cluster 7,tree 1,1


In [11]:
# Root is the SSPE Reference Sequence
root = 'SSPE ancestor'
haplotypes = {}
haplotypes[root] = reference
# Build the haplotypes 
resulting_haplotypes = traverse_tree(root, cluster_relationships, haplotypes)
haplotype_seqs = {hap: "".join(nt for nt in seq) for hap, seq in resulting_haplotypes.items()}

Makeing haplotype for genome 2 using background SSPE ancestor
Makeing haplotype for cluster 10 using background genome 2
Makeing haplotype for cluster 12 using background genome 2
Makeing haplotype for cluster 11 using background genome 2
Makeing haplotype for cluster 13 using background genome 2
Makeing haplotype for cluster 9 using background genome 2
Makeing haplotype for genome 01 using background SSPE ancestor
Makeing haplotype for genome 1 using background genome 01
Makeing haplotype for cluster 4 using background genome 1
Makeing haplotype for cluster 2 using background genome 1
Makeing haplotype for cluster 1 using background genome 1
Makeing haplotype for cluster 5 using background cluster 1
Makeing haplotype for cluster 8 using background genome 1
Makeing haplotype for cluster 7 using background genome 1
Makeing haplotype for cluster 3 using background genome 1
Makeing haplotype for cluster 6 using background genome 01


In [12]:
haplotype_seqs['cluster 9'] == haplotype_seqs['genome 2']

False

In [13]:
# Write out to a fasta file
haplotype_records = [SeqRecord.SeqRecord(Seq.Seq(seq), id = hap) for hap, seq in haplotype_seqs.items()]
SeqIO.write(haplotype_records, os.path.join(outpath, "haplotype-sequences.fa"), "fasta")

17

## Visualize the Tree

Make a tree with the Fasta file of each terminal haplotype. These are alredady aligned by the nature of how I made them. 

In [14]:
# Building the phylogeny with IQtree with 1000 bootstrap iterations with GTR+I+G (Invariable site plus discrete Gamma model) with asr
alignfasta = os.path.join(outpath, "haplotype-sequences.fa")
call(f"iqtree -s {alignfasta} -m GTR+I+G -bb 1000 -redo", shell=True)


IQ-TREE multicore version 2.1.4-beta COVID-edition for Linux 64-bit built Jun 24 2021
Developed by Bui Quang Minh, James Barbetti, Nguyen Lam Tung,
Olga Chernomor, Heiko Schmidt, Dominik Schrempf, Michael Woodhams.

Host:    rhino02 (AVX512, FMA3, 754 GB RAM)
Command: iqtree -s ../../results/phylogeny/haplotype-sequences.fa -m GTR+I+G -bb 1000 -redo
Seed:    55499 (Using SPRNG - Scalable Parallel Random Number Generator)
Time:    Fri Mar 31 15:04:29 2023
Kernel:  AVX+FMA - 1 threads (72 CPU cores detected)

HINT: Use -nt option to specify number of threads because your CPU has 72 cores!
HINT: -nt AUTO will automatically determine the best number of threads to use.

Reading alignment file ../../results/phylogeny/haplotype-sequences.fa ... Fasta format detected
NOTE: Change sequence name 'SSPE ancestor <unknown description>' -> SSPE_ancestor
NOTE: Change sequence name 'genome 2 <unknown description>' -> genome_2
NOTE: Change sequence name 'cluster 10 <unknown description>' -> cluster_10


0