# Semester Project
###  07 May, 2019
### John Coffin


#### Motivation: 
The main focus of my dissertation is to understand how populations of fish persist in extreme environments. One potential mechanism enabling their persistence is through adaptive evolution at genes involved in coping with environmental stressors, but predicting which genes are under positive selection is difficult. The purpose of my semester project is to generate a script to analyze DNA sequence variation in several related extremophile fish species that my lab works with, which will enable me to identify genes that may be experiencing positive natural selection. </br>


#### Background:
Mutations change the DNA sequence, but due to the degeneracy of the DNA code, several different codons can code for the same amino acid. 

![alt text](https://github.com/jlcoffin/project/blob/master/codon_table.png)

Therefore, `not all mutations cause a functional difference at the protein level`.  We call these **synonymous substitutions**. Because these mutations do not change the amino acid, they do not cause a phenotypic difference and are thus "invisible" to natural selection. On the other hand, when mutations change the amino acid sequence, they can have important functional consequences, primarily in the form of protein function. Most of these **nonsynonymous substitutions** end up being deleterious and are removed from the population by purifying selection, but sometimes, they can impart some type of selective advantage, in which case these mutations increase in the population through positive selection in what is called a selective sweep. 

![alt text](https://github.com/jlcoffin/project/blob/master/gaffinis.jpeg)

Typically, we say that if the ratio of nonsynonymous substitutions to synonymous substitutions in a gene is greater than one, then that gene is experiencing positive selection. This is because if the ratio is less than one, then the random force of genetic drift could be responsible for the observed nonsynonymous substitution. If I can identify which genes are experiencing positive selection, then I can further explore those genes for how these sequence differences help fish survive in extreme environments. </br>
</br>
In order to identify selection, my script will need to accomplish several tasks: </br>
* Obtain sequence data from transcriptomes I have generated and several related species my labmates have generated
* Detect nucleotide variants in each species
* Determine whether variants change the amino acid coded (nonsynonymous substitutions, dN) or not (synonymous substitutions, dS) </br>

In [4]:
#Import modules

import pandas as pd

import numpy as np

In [5]:
#Input sequences

Gaff = ("ATGACTCCTCTTATTTACTCCACCCTAATTATCAGCCTTGGAATTGGCACCACCCTAACATTTGCCAGCACCCACTGATATCTCGCCTGAATAGGCATTGAAATCAACACATTAGCTATTATTCCCCTTATAACACAAAGCCACAGCCCCCGAGCAACTGAAGCCACCACTAAGTACTTCTTCGCACAAGCCACCGCCTCAGCCACACTTCTCTTCGCTGCTACCTCTAACGCTTTCTTCACTGGAAGTTGAGACATTCTTCAAACCAATAATCCCCTTACCTATACCCTAATAACCCTTGCCCTGGCCATAAAAATTGGCCTAGCCCCGCTTCACAGCTGAATGCCCGAAGTAATGCAAGGCCTAACTTTACTTACAGGCCTGATCCTTTCCACCTGACAAAAACTTGCCCCCTTTTGCCTTATTTATCAAATTCAACCGCACAGCCCGAACCTTTTCATGGCCTTAGGGCTCCTGTCCATTGTTGCAGGGGGATGAGGAGGCTTCAACCAAGTACAGCTACGAAAAATCCTTGCATACTCATCAATTGCCCATCTCGGATGAATAGTTCTTATCCTTTCTTTCTCGCCCCCTCTAACACTTGTTGCCTTTTCCACTTATCTTATAATAACTTTCTCCCTATTTTATTCTTTTATAATAACCAAAACCACACACATTAATTCCCTTTCCACCTCCTGAGCCAAAATTCCAACTCTCACCGCTTTAACCCCCCTAATTCTTCTCTCCCTTGGAGGCCTGCCCCCTCTAACAGGATTTTTACCAAAATGAATAATCCTTCAAGAACTGACCAAACAAGACCTAGCCCTAGTCGCCGTCCTCGCCGCCCTATTTTCACTCTTTAGTCTTTATTTTTACCTCCGACTATCATACGCAATAACATTAACTATACCCCCGAATAACCCCGCAGGAACCCTTCCTTGGCGACTCAACCCCTCCCACAATACCCTCCCCCCAGCTTTAACTACCACAATAACAATTTTTCTTCTCCCAATCACCCCAGCCTTTATAACACTCTTTTTCCTCTAA")
Pmex = ("ATGGCTCCCCTAATCTACTCCGCCCTCATCATTAGTCTTGGCCTCGGTACAACCATAACCTTTGCCAGCACCCATTGATACCTTGCCTGAATAGGCATCGAAATTAATACATTAGCCATCATCCCTCTAATAGCCCAAAACCACATCCCCCGAGCAATTGAAGCTACCACTAAATACTTTTTTGTCCAAGCGACTGCCTCAGCAACACTTCTATTTGCTGGTATCTCCAACGCATTTCTAACCGGACAATGAGATATTACCTACACCCCCTATACCCTTACTTCTACACTAATTACCCTTGCCCTTGCAATAAAAATTGGCCTTGCCCCCCTTCACAGCTGAATACCAGAAGTAATGCAAGGCCTTAATTTACTTACCGGGCTGATTTTATCCACTTGACAAAAACTTGCCCCCTTATACCTAATTTATCAAATTCAACCAAACAACCCAAACATTTTTATTGCCTTAGGCCTTATATCTATTATTGTTGGAGGATGAGGAGGATTCAACCAAGTCCAACTTCGAAAAATCCTAGCATACTCATCAATCGCCCACCTAGGGTGAATAATCTTAATTCTCTCATTTTCACCCCCACTCGCCCTACTCACCATTATCATTTACATTTTAATAACCTTCTCATTATTCTCCTCTTTTATACTAACCCGCACCACCCATATCAACTCCCTTTCCACCACCTGAGCCAAAATTCCAATCCTTACTATTTCCACCCCTCTAATCCTTCTATCCCTAGGCGGACTACCCCCCCTTACAGGATTTATACCAAAATGAATAATCCTCCAAGAATTAACCAAACAAAGCCTATGCCCACTCGCCACCATAGCTGCACTCTCATCCCTTTTCAGCCTTTATTTTTACCTCCGACTATCATACGCAATAACACTCACTATACCACCAAATAACCCTGCAGGGACACTCCCATGACGACTTAACCCTCGACACAACACCCTTCCCCTAGCCCTAACAACCACCTCAACAATTTGCCTCCTTCCCATAACCCCCGCTATCATATCCCTAATACCTTTCTGA")
Psul = ("ATGGCTCCCCCAATCTACTCCGCCCTCATCATCAGTCTTGGCCTCGGTACAACCATAACCTTTGCCAGCACCCATTGATTCCTTGCCTGAATAGGCATCGAAATTAATACATTAGCCATCATCCCTCTAATAGCCCAAAACCACATCCCCCGAGCAATTGAAGCTACCACTAAATACTTTTTTGTCCAAGCGACTGCCTCAGCAACACTTCTATTTGCTGGTATCTCCAACGCATTTCTTACCGGACAATGAGATATTACCCACACCCCCTATACCCTCACTTCTACACTAATTACCCTTGCCCTTGCAATAAAAATTGGCCTTGCCCCCCTTCACAGCTGAATACCAGAAGTAATGCAAGGCCTTAATTTACTTACCGGACTGATTCTATCCACTTGACAAAAACTTGCCCCCTTATACCTAATTTATCAAATTCAACCAAACAACCCAAACATTTTTATTGCCTTAGGCCTTCTATCTATTATTGTTGGAGGATGAGGAGGATTCAACCAAGTCCAACTTCGAAAAATCCTAGCATACTCATCAATCGCCCACCTAGGGTGAATAATCTTAATTCTCCCATTTTCACCCCCACTCGCCCTACTCACCATTATTATTTACATTTTAATAACCTTCTCATTATTCTCCTCTTTTATACTAACCCGCACCACCCACATCAACTCCCTTTCCACTACCTGAGCCAAAATTCCAATCCTTACTATCTCCACCCCCCTAATCCTTCTATCCCTAGGAGGACTACCCCCTCTTACAGGATTTATACCAAAATGAATAATCCTCCAAGAATTAACCAAACAAAGCCTATGCCCACTCGCCACCATAGCTGCACTCTCATCCCTTTTCAGCCTTTATTTTTACCTCCGACTATCATACGCAATAACACTCACTATACCACCAAATAACCCTGCAGGGACACTCCCATGACGACTTAACCCTCGACACAATACCCTTCCCCTAGCCCTAACAACCACCTCAACAATTTGCCTCCTTCCCATAACCCCCACTATCATATCCCTAATACCTTTCTAG")
Xmac = ("ATGGCTCCCTTTGTTTACTCCACCCTTATCATTAGTCTTGGTCTCGGCACAGCCATAACCTTCGCCAGCACCCACTGATACCTCGCCTGAATAGGCATCGAAATCAATACACTAGCCATTATCCCCCTAATAACCCAAAACCACAACCCTCGAGCAATTGAAGCCACCACCAAATACTTTTTTGCACAAGCCACCGCTTCAGCAACACTGCTATTCGCCGCTGTCTCAAATGCATTCTTGACCGGGGGATGAGACATTCTTCAAATCAACCACCCCCTAACTTCCACACTCACCACCCTAGCTCTAGCTATAAAAATCGGTCTCGCCCCCCTCCACAGCTGAATGCCAGAAGTAATACAAGGCTTAAGCCTACTTACCGGCCTTATCCTTTCCACTTGGCAAAAACTTGCCCCCTTTTGCCTCATTTACCAAATTCAGCCAGACAACCCCAATGTTTTCATTACCTTAGGCCTCTTATCTGTGATCGTGGGCGGCTGAGGGGGTTTCAACCAAGTCCAACTCCGAAAAATTCTTGCATACTCATCAATCGCCCATCTGGGCTGAATAATTCTCATTCTCCCATTCTCACCCCCTCTCACCTTACTTACCCTTTTTACTTACCTAATAATGACCTTTTCACTTTTTTCCTCCTTCATACTAGTTCGCACCACACATATTAACTCTTTATCCATCTCCTGAGCCAAAATCCCAACTCTTACTGCTTCTGTCCCCCTCATCCTCTTATCCCTGGGAGGTCTACCACCCCTAACAGGATTCTTACCAAAATGACTTATCCTCCAAGAGTTAACCAAACAAGACCTGGCCCCAATTGCCACTCTCGCTGCCCTTTCATCCCTCTTCAGCCTTTATTTTTATCTCCGACTATCATACACAATAACACTCACTATGCCCCCAAACAACCCCGCCGGAACACTCCCCTGACGACTTAACCCCCAACACAACACCCTCCCCCTGGCTTTAACCACCACCACAACAATTTTCCTTCTCCCTGCCACCCCAACCATTCTGGCTCTTTTCACTTTCTGA")


In [6]:
#Split each sequence into codon triplets, which will be translated into amino acids

Gaff_codons = []
for i in np.arange(0,len(Gaff),3):
    Gaff_codons.append(Gaff[i:i+3])

Pmex_codons = []
for i in np.arange(0,len(Pmex),3):
    Pmex_codons.append(Pmex[i:i+3])

Psul_codons = []
for i in np.arange(0,len(Psul),3):
    Psul_codons.append(Psul[i:i+3])
    
Xmac_codons = []
for i in np.arange(0,len(Xmac),3):
    Xmac_codons.append(Xmac[i:i+3])

print(Gaff_codons[:10])
print(Xmac_codons[:10])

['ATG', 'ACT', 'CCT', 'CTT', 'ATT', 'TAC', 'TCC', 'ACC', 'CTA', 'ATT']
['ATG', 'GCT', 'CCC', 'TTT', 'GTT', 'TAC', 'TCC', 'ACC', 'CTT', 'ATC']


In [7]:
#Identify which nucleotides differ between each sequence and the reference sequence, Xiphophorus maculatus (Xmac)

Gaff_variants_index = []
Gaff_invariants_index = []
Gaff_variants = []
Gaff_invariants = []
for codon in np.arange(0, len(Gaff_codons)):
    if Gaff_codons[codon] == Xmac_codons[codon]:
        Gaff_invariants_index.append(codon) #adding 1 here because python is zero-indexed, so the first codon is really indexed as zero and not 1
        Gaff_invariants.append(Gaff[codon])
    else:
        Gaff_variants_index.append(codon)
        Gaff_variants.append(Gaff[codon])
        
affinis_variants = {"CodonNumber": Gaff_variants_index, "DifferentNucleotide": Gaff_variants}
Gaff_snps = pd.DataFrame(data = affinis_variants)
Gaff_snps.head(5)


Unnamed: 0,CodonNumber,DifferentNucleotide
0,1,T
1,2,G
2,3,A
3,4,C
4,8,T


In [8]:
Pmex_variants_index = []
Pmex_invariants_index = []
Pmex_variants = []
Pmex_invariants = []
for codon in np.arange(0, len(Pmex_codons)):
    if Pmex_codons[codon] == Xmac_codons[codon]:
        Pmex_invariants_index.append(codon)
        Pmex_invariants.append(Pmex[codon])
    else:
        Pmex_variants_index.append(codon)
        Pmex_variants.append(Pmex[codon])
mexicana_variants = {"CodonNumber": Pmex_variants_index, "DifferentNucleotide": Pmex_variants}
Pmex_snps = pd.DataFrame(data = mexicana_variants)
Pmex_snps.head(5)

Unnamed: 0,CodonNumber,DifferentNucleotide
0,3,G
1,4,C
2,7,C
3,8,C
4,13,T


In [9]:
Psul_variants_index = []
Psul_invariants_index = []
Psul_variants = []
Psul_invariants = []
for codon in np.arange(0, len(Psul_codons)):
    if Psul_codons[codon] == Xmac_codons[codon]:
        Psul_invariants_index.append(codon)
        Psul_invariants.append(Psul[codon])
    else:
        Psul_variants_index.append(codon)
        Psul_variants.append(Psul[codon])
sulphuraria_variants = {"CodonNumber": Psul_variants_index, "DifferentNucleotide": Psul_variants}
Psul_snps = pd.DataFrame(data = sulphuraria_variants)
Psul_snps.head(5)

Unnamed: 0,CodonNumber,DifferentNucleotide
0,3,G
1,4,C
2,7,C
3,8,C
4,10,C


In [10]:
#Create a function to translate DNA into amino acids

def translate(sequence): 
       
    conversion_table = { 
        'ATA':'I', 
        'ATC':'I', 
        'ATT':'I', 
        'ATG':'M', 
        'ACA':'T', 
        'ACC':'T', 
        'ACG':'T', 
        'ACT':'T', 
        'AAC':'N', 
        'AAT':'N', 
        'AAA':'K', 
        'AAG':'K', 
        'AGC':'S', 
        'AGT':'S', 
        'AGA':'R', 
        'AGG':'R',                  
        'CTA':'L', 
        'CTC':'L', 
        'CTG':'L', 
        'CTT':'L', 
        'CCA':'P', 
        'CCC':'P', 
        'CCG':'P', 
        'CCT':'P', 
        'CAC':'H', 
        'CAT':'H', 
        'CAA':'Q', 
        'CAG':'Q', 
        'CGA':'R', 
        'CGC':'R', 
        'CGG':'R', 
        'CGT':'R', 
        'GTA':'V', 
        'GTC':'V', 
        'GTG':'V', 
        'GTT':'V', 
        'GCA':'A', 
        'GCC':'A', 
        'GCG':'A', 
        'GCT':'A', 
        'GAC':'D', 
        'GAT':'D', 
        'GAA':'E', 
        'GAG':'E', 
        'GGA':'G', 
        'GGC':'G', 
        'GGG':'G', 
        'GGT':'G', 
        'TCA':'S', 
        'TCC':'S', 
        'TCG':'S', 
        'TCT':'S', 
        'TTC':'F', 
        'TTT':'F', 
        'TTA':'L', 
        'TTG':'L', 
        'TAC':'Y', 
        'TAT':'Y', 
        'TAA':'_', 
        'TAG':'_', 
        'TGC':'C', 
        'TGT':'C', 
        'TGA':'_', 
        'TGG':'W'
    } 
    protein_sequence =[] 
    if len(sequence)%3 == 0: #codons are in multiples of 3 nucleotides
        for i in range(0, len(sequence), 3): 
            codon = sequence[i:i + 3] 
            protein_sequence += conversion_table[codon] 
    else:
        print("This sequence is missing data. The sequence must be in-frame to compare codons.")
    return protein_sequence

In [11]:
#Now, at each of the variant sites identified, this mutation could change the amino acid coded by those nucleotides (a non-synonymous mutation), or it could be a synonymous mutation

Gaff_synonymous_sites = []
Gaff_nonsynonymous_sites = []
for codon in Gaff_snps.CodonNumber:
    if translate(Gaff_codons[codon]) == translate(Xmac_codons[codon]):
        Gaff_synonymous_sites += translate(Gaff_codons[codon])
    else:
        Gaff_nonsynonymous_sites += translate(Gaff_codons[codon])
        
Gaff_omega = len(Gaff_nonsynonymous_sites) / len(Gaff_synonymous_sites)

if Gaff_omega < 1:
    print("The dN/dS ratio for Gambusia affinis is", Gaff_omega, "which is indicative of purifying selection for this gene.") 
else:
    print("The dN/dS ratio for Gambusia affinis is", Gaff_omega, "which is indicative of positive selection for this gene.")

The dN/dS ratio for Gambusia affinis is 0.45384615384615384 which is indicative of purifying selection for this gene.


In [12]:
Pmex_synonymous_sites = []
Pmex_nonsynonymous_sites = []
for codon in Pmex_snps.CodonNumber:
    if translate(Pmex_codons[codon]) == translate(Xmac_codons[codon]):
        Pmex_synonymous_sites += translate(Pmex_codons[codon])
    else:
        Pmex_nonsynonymous_sites += translate(Pmex_codons[codon])
Pmex_omega = len(Pmex_nonsynonymous_sites) / len(Pmex_synonymous_sites)

if Pmex_omega < 1:
    print("The dN/dS ratio for Poecilia mexicana is", Pmex_omega, "which is indicative of purifying selection for this gene.") 
else:
    print("The dN/dS ratio for Poecilia mexicana is", Pmex_omega, "which is indicative of positive selection for this gene.")

The dN/dS ratio for Poecilia mexicana is 0.5221238938053098 which is indicative of purifying selection for this gene.


In [13]:
Psul_synonymous_sites = []
Psul_nonsynonymous_sites = []
for codon in Psul_snps.CodonNumber:
    if translate(Psul_codons[codon]) == translate(Xmac_codons[codon]):
        Psul_synonymous_sites += translate(Psul_codons[codon])
    else:
        Psul_nonsynonymous_sites += translate(Psul_codons[codon])
Psul_omega = len(Psul_nonsynonymous_sites) / len(Psul_synonymous_sites)

if Gaff_omega < 1:
    print("The dN/dS ratio for Poecilia sulphuraria is", Psul_omega, "which is indicative of purifying selection for this gene.") 
else:
    print("The dN/dS ratio for Poecilia sulphuraria is", Psul_omega, "which is indicative of positive selection for this gene.")

The dN/dS ratio for Poecilia sulphuraria is 0.4830508474576271 which is indicative of purifying selection for this gene.
