# Examine Fixed Mutations 

What are the mutations that have fixed in samples isolated from the brain? We call this consensus the "SSPE Reference" and the Cattaneo lab call this the "Brain Founder".

In [1]:
import os 
import pandas as pd
import numpy as np
from Bio import SeqIO

In [2]:
# Get the list of consensus mutation in the brain
consensus_snps_df = pd.read_csv("../../config/ref/annotated_SSPE_consensus_snps.csv")

# Get the reference sequence as s list
reference_seq = [base for record in SeqIO.parse("../../config/ref/MeVChiTok.fa", "fasta") for base in record.seq]

## Mutations to Stop Codons

Some of the most interesting mutations will be to stop codons – either mutations that remove stop codons or introduce new stop codons. These will change the reading frames in all of the measles isolates. 

In [3]:
stop_codon_mutations = consensus_snps_df[(consensus_snps_df["MUT_AA"] == "*") | (consensus_snps_df["WT_AA"] == "*")]
stop_codon_mutations

Unnamed: 0,POS,REF,ALT,Gene,WT_AA,MUT_AA,POS_AA
76,9124,G,T,H,*,Y,618.0
129,3812,G,A,M,W,*,125.0


There are two mutations that change stop codons. One of these is a very significant truncation of the Matrix protein (**W125***) and the other removes the stop codon in Hemagglutinin (***618Y**).  

### Matrix Premature Stop

Stop codons are known dirver mutations of SSPE. It's thought that an inactive or truncated Matrix protein is important for SSPE progression. 

Based on the accumulation of mutations in the Matrix protein, it almost seems like this mutations was aquired twice. One plausible explaination of this is that the mutation is in a site particularly succeptible to ADAR mediated mutations. 

ADAR has some [sequence specificity](https://www.nature.com/articles/ncomms1324). Is this sequence specifcity also around the site of the introduced stop codon? 

In [4]:
reference_seq[3812-1] = "a"

"".join(reference_seq[(3812-1-5):(3812-1+6)])

'CCTTGaAGAAA'

I doesn't look like the sequence context for ADAR, and what's more, it can't be an ADAR mutations. These will always be **A -> G** or **T -> C**.

### Haemagglutinin Stop Mutation 

The normal stop codon at the end of Haemagglutinin is mutated to a tyrosine. Where is the next stop codon? 

In [5]:
def parse_gff(gff_path, genes_to_keep = ['N', 'P', 'M', 'F', 'H', 'L']):
    """
    Parse the Measles GFF file to get the coordinates of the coding sequences.
    """
    # Hold the parsed genes
    gff_dict = {}
    
    # Parse the GFF file
    with open(gff_path) as gff_file:
        for line in gff_file:
            # Ignore the header
            if line.startswith('#'):
                continue
            # Only take the coding sequences
            else:
                record = line.strip().split("\t")
                if record[2] == "CDS":
                   
                    # Get the start and stop 
                    start = int(record[3])
                    stop = int(record[4])
                    
                    # Get the gene and product names
                    annot_dict = {annot.split("=")[0]: annot.split("=")[1] for annot in record[8].split(";")}
                    gene = annot_dict['gene']

                    # There are four P/V/C reading frames
                    if gene == 'P/V/C':
                        gene = annot_dict['product'][0].upper()
                    
                    # Only keep the annotations of interest
                    if gene in genes_to_keep:
                        gff_dict[gene] = [start, stop]
                        
    return gff_dict

In [6]:
# Parse the coordinates of H from the GFF file.
H_coordinates = parse_gff("../../config/gff/MeVChiTok.gff")['H']

In [7]:
stop_codons = ['TAA','TAG','TGA']

after_H_stop = reference_seq[H_coordinates[1]:]

codons_after_H = ["".join([a,b,c]) for a,b,c in zip(after_H_stop[0::3], after_H_stop[1::3], after_H_stop[2::3])]

for i, codon in enumerate(codons_after_H):
    if codon in stop_codons:
        print(f"There is another stop codon at {i+1}: {codon}")
        break

There is another stop codon at 3: TAG


It looks like there is another in-frame stop codon only there codons downstream of the original stop. This means that there are three additional amino acids in the SSPE Haemagglutinin sequence. What are these? 

In [8]:
# Update the coordinates of H
new_H_coordinates = [H_coordinates[0], H_coordinates[1] + (3*3)]
new_H_coordinates

[7271, 9133]

In [9]:
new_codons = reference_seq[H_coordinates[1]:new_H_coordinates[1]]
new_codons = ["".join([a,b,c]) for a,b,c in zip(new_codons[0::3], new_codons[1::3], new_codons[2::3])]
print(*new_codons)

GGC TGC TAG


The old stop codon become a Tyrosine, then there is a Glycine and a Cysteine. 

## Distribution of Mutation Effects

What is the distribution of fixed mutations and their effects? Are there lots of apparent ADAR mutations? Where are these mutations in the genome? 

In [10]:
print(f"There are {consensus_snps_df.shape[0]} mutations fixed in the SSPE samples.")

There are 137 mutations fixed in the SSPE samples.


In [11]:
possible_ADAR_df = consensus_snps_df.query("REF == 'T' and ALT == 'C' or REF == 'A' and ALT == 'G'")

print(f"There are {possible_ADAR_df.shape[0]} mutations that are possibly a result of ADAR.")

There are 57 mutations that are possibly a result of ADAR.


In [12]:
possible_ADAR_df.head()

Unnamed: 0,POS,REF,ALT,Gene,WT_AA,MUT_AA,POS_AA
4,5256,T,C,intergenic,,,
5,1872,A,G,P,E,E,22.0
8,4318,T,C,M,L,S,294.0
11,1458,T,C,N,Y,H,451.0
12,4143,T,C,M,Y,H,236.0


In [13]:
possible_ADAR_df.value_counts('Gene')

Gene
intergenic    23
M              7
L              7
F              7
N              5
P              4
H              4
dtype: int64

In [14]:
consensus_snps_df.value_counts('Gene')

Gene
intergenic    49
L             22
F             18
M             14
P             13
H             11
N             10
dtype: int64

I'll look more deeply into the distribution of these mutation effects in `R`. 

It seems unlikley that the stop codon mutation happend more than once since it's not an ADAR mediated mutation. However, it's possible that there are very few in-frame codons in matrix that are only one mutation away from a stop codon. 

In [55]:
M_coordinates = parse_gff("../../config/gff/MeVChiTok.gff")['M']

M_sequence = reference_seq[M_coordinates[0]-1:M_coordinates[1]]

len(M_sequence)

1008

In [56]:
M_codons = ["".join([a,b,c]) for a,b,c in zip(M_sequence[0::3], M_sequence[1::3], M_sequence[2::3])]
M_codons

['ATG',
 'ACA',
 'GAG',
 'ATC',
 'TAC',
 'GAC',
 'TTC',
 'GAC',
 'AAG',
 'TCG',
 'GCA',
 'TGG',
 'GAC',
 'ATC',
 'AAA',
 'GGG',
 'TCG',
 'ATC',
 'GCT',
 'CCG',
 'ATA',
 'CAA',
 'CCT',
 'ACC',
 'ACC',
 'TAC',
 'AGT',
 'GAT',
 'GGC',
 'AGG',
 'CTG',
 'GTG',
 'CCC',
 'CAG',
 'GTC',
 'AGA',
 'GTC',
 'ATA',
 'GAT',
 'CCT',
 'GGT',
 'CTA',
 'GGT',
 'GAT',
 'AGG',
 'AAG',
 'GAT',
 'GAA',
 'TGC',
 'TTT',
 'ATG',
 'TAC',
 'ATG',
 'TTT',
 'CTG',
 'CTG',
 'GGG',
 'GTT',
 'GTT',
 'GAG',
 'GAC',
 'AGC',
 'GAT',
 'CCC',
 'CTA',
 'GGG',
 'CCT',
 'CCA',
 'ATC',
 'GGG',
 'CGA',
 'GCA',
 'TTC',
 'GGG',
 'TCC',
 'CTG',
 'CCC',
 'TTA',
 'GGT',
 'GTT',
 'GGT',
 'AGA',
 'TCC',
 'ACA',
 'GCA',
 'AAA',
 'CCC',
 'GAG',
 'GAA',
 'CTC',
 'CTC',
 'AAA',
 'GAG',
 'GCC',
 'ACT',
 'GAG',
 'CTT',
 'GAC',
 'ATA',
 'GTT',
 'GTT',
 'AGA',
 'CGT',
 'ACA',
 'GCA',
 'GGG',
 'CTC',
 'AAT',
 'GAA',
 'AAA',
 'CTG',
 'GTG',
 'TTC',
 'TAC',
 'AAC',
 'AAC',
 'ACC',
 'CCA',
 'CTA',
 'ACC',
 'CTC',
 'CTC',
 'ACA',
 'CCT',
 'TGa',


In [57]:
def hammingDist(str1, str2):
    i = 0
    count = 0
 
    while(i < len(str1)):
        if(str1[i] != str2[i]):
            count += 1
        i += 1
    return count

hammingDist("TAG", "CAG")

1

In [58]:
stop_codons

['TAA', 'TAG', 'TGA']

In [59]:
stop_count = 0

for i, codon in enumerate(M_codons):
    
    for stop in stop_codons:
        
        if hammingDist(stop, codon) == 1:
            print(f"The codon {codon} at position {i+1} is only 1 mutation away from a stop ({stop}).")
            
            stop_count += 1
            break

The codon GAG at position 3 is only 1 mutation away from a stop (TAG).
The codon TAC at position 5 is only 1 mutation away from a stop (TAA).
The codon AAG at position 9 is only 1 mutation away from a stop (TAG).
The codon TCG at position 10 is only 1 mutation away from a stop (TAG).
The codon TGG at position 12 is only 1 mutation away from a stop (TAG).
The codon AAA at position 15 is only 1 mutation away from a stop (TAA).
The codon TCG at position 17 is only 1 mutation away from a stop (TAG).
The codon CAA at position 22 is only 1 mutation away from a stop (TAA).
The codon TAC at position 26 is only 1 mutation away from a stop (TAA).
The codon CAG at position 34 is only 1 mutation away from a stop (TAG).
The codon AGA at position 36 is only 1 mutation away from a stop (TGA).
The codon AAG at position 46 is only 1 mutation away from a stop (TAG).
The codon GAA at position 48 is only 1 mutation away from a stop (TAA).
The codon TGC at position 49 is only 1 mutation away from a stop (T

In [62]:
print(f"Over {(stop_count/len(M_codons)) * 100:.2f} codons in M are only a single mutation away from a stop codon.")

Over 25.89 codons in M are only a single mutation away from a stop codon.
