# Study on the Effects of the D614G Mutation on the Infectivity of SARS-CoV-2 Using California as a Case Study

by Joshua Delgadillo, Lauren Craft, Adam Perlin, Efrain Rangel, Brett Triebold, Ritta Liu

 # Background

Our team chose to research the novel Coronavirus. In 2020 SARS-CoV-2 began spreading across the world and there has been an widespread effort to understand the virus and research ways to combat it. Bioinformatics takes the forefront in understanding the genetic makeup of the virus and its mutations. Our contributions to the research into understanding SARS-CoV-2 include tracing similarities between SARS-CoV-2 and two viruses that have previously spread. These two viruses are SARS-CoV-1 and MERS-CoV. Much of our research and data collection involved taking sequences of these three viruses and examining the relationships between them in order to extract data to determine similarities and differences between them.

The data used in our research came from the National Center for Biotechnology Information (NCBI). The NCBI contains a vast amount of bioinformatic related databases including several databases which contain virus specific DNA sequences. The database that we focused our data extraction on was the NCBI Nucleotide Database. From this database we were able to collect several sequence samples to use in our research.

# Project Overview

After successfully collecting sequence data for SARS-CoV-1, SARS-CoV-2 and MERS-CoV, we analyzed the similarities and differences between example sequences of each of the viruses. We wanted to expand upon this research and begin looking into mutations of SARS-CoV-2 that have appeared since the initial identification. The specific mutation that we focused on is the D614G mutation.

The D614G mutation is an A-to-G missense mutation at position 23,403 that causes the substitution of aspartic acid with glycine (1). This causes a change in the structure of the viral spike glycoprotein. Virus with this mutation has since been named clade G and was first detected on January 21, 2020 (2). Due to the increasing global dominance of clade G, it is suspected that it increases susceptibility. A study found retroviruses with SG614 infected ACE2-expressing cells marked more efficiently than those with SD614 (6). It was initially mainly prevalent in European regions (1). In Italy there was an outbreak during the early stage of the global pandemic, resulting in 400 deaths in one month (3). 87% of samples taken from Italy were found to have the D614G mutation.

California will be used as a case study. Samples will be taken from patients with Covid-19 all over California, and the prevalence of the D614G mutation among the California population can be deduced. We have randomly selected 30 samples from the NCBI, and searched for mutations on the spike glycoprotein (S protein) (5). Since the mutation has been more commonly found in Europe, it would likely have been introduced to the California population later on. The presence of the D614G mutation among Californian cases could infer increased infectivity.


# Implementation

### Smith-Waterman Algorithm

Our implementation began with the comparison between SARS-CoV-1, SARS-CoV-2 and MERS-CoV. In order to perform these comparisons we utilized the Smith-Waterman Algorithm. Earlier research performed to analyze the similarities and differences between the viruses utilized the Needleman-Wunsch Algorithm which finds the optimal global alignment of sequences. However, our decision to use Smith-Waterman came from its impressive ability to find the best local alignment. We use the individual local alignment scores to draw conclusions about overall viral genome similarity; i.e., if all local alignments exhibit high similarity and local alignments take up a large proportion of the genome then the two genomes are likely to be quite similar.


Pseduo code:

~~~

Given: String s1 with length m , String s2 with length n

    // initialize matrix, M
    
    // score cells in matrix
    for i=1 to m
        for j=1 to n
        
            // initialization: max is 0
            max = 0 
            
            // first comparison: west cell (deletion)
            score = M[i][j-1] + gapScore
            if( score > max )
                max = score
            
            // second comparison: north cell (insertion)
            score = M[i-1][j] + gapScore
            if( score > max )
                max = score
            
            // last comparison: north-west cell (alignment)
            base1 = s1[j-1]
            base2 = s2[i-1]
            
            if( base1 == base2 )              // match
                alignmentScore = matchScore
            else                              // mismatch
                alignmentScore = mismatchScore
            
            score = M[i-1][j-1] + alignmentScore
            if( score > max )
                max = score
            
            // finished all comparisons
            M[i][j] = max
    
    // return completed matrix
    return M
~~~

### BLAST

We chose to utilize "The Basic Local Alignment Search Tool (BLAST), in order to perform our sequence analysis. 

From the [BLAST website]: BLAST finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches."

In our research, we found out BLAST actually uses a faster (but less precise) version of Smith-Waterman. A heuristic is used to eliminate sequences that are not likely to be a good local alignment. The shortcut was necessary because the algorithm is too slow when ran with large datasets. source: [The NCBI Handbook]

[BLAST website]: https://blast.ncbi.nlm.nih.gov/Blast.cgi
[The NCBI Handbook]: https://www.unmc.edu/bsbc/docs/NCBI_blast.pdf

A function to print the contents of .fasta files:

In [20]:
from Bio import SeqIO
import pandas as pd 
  
def print_fasta(virus_name, file_name):
    # initialize list of lists 
    data = []

    fasta_sequences = SeqIO.parse(open(file_name),'fasta')
    with open(file_name) as out_file:
        for fasta in fasta_sequences:
            name, sequence = fasta.id, str(fasta.seq)
            data.append([name, sequence])

    # Create the pandas DataFrame 
    df = pd.DataFrame(data, columns = [virus_name, 'sequence']) 

    # print dataframe. 
    print(df) 
        

Printing one example sequence of each of the viruses:

In [21]:
from glob import glob

data_dir = "test_data"
sample_types = [
    "SARS-CoV-1",
    "SARS-CoV-2",
    "MERS-CoV"
]

for sample_type in sample_types:
    for sample in glob(data_dir+"/{}*.fasta".format(sample_type)):
        print_fasta(sample_type, sample)
    print()

   SARS-CoV-1                                           sequence
0  ATO98191.1  MESLVLGVNEKTHVQLSLPVLQVRDVLVRGFGDSVEEALSEAREHL...
   SARS-CoV-1                                           sequence
0  AGT21121.1  MESLVLGVNEKTHVQLSLPVLQVRDVLVRGFGDSVEEALSEAREHL...
    SARS-CoV-1                                           sequence
0   AEA10518.1  MFIFLLFLTLTSGSDLDRCTTFDDVQAPNYTQHTSSMRGVYYPDEI...
1   AEA10517.1  MESLVLGVNEKTHVQLSLPVLQVRDVLVRGFGDSVEEALSEAREHL...
2   AEA10516.1  MESLVLGVNEKTHVQLSLPVLQVRDVLVRGFGDSVEEALSEAREHL...
3   AEA10528.1  MSDNGPQSNQRSAPRITFGGPTDSTDNNQNGGRNGARPKQRRPQGL...
4   AEA10522.1  MADNGTITVEKLKQLLEQWNLVIGFLFLAWIMLLQFAYSNRNRFLY...
5   AEA10527.1  MCLKILVRYNTRGNTYSTAWLCALGKVLPFHRWHTMVQTCTPNVTI...
6   AEA10526.1            MKLLIVLTCISLCSCICTVVQRCASNKPHVLEDPCKVQH
7   AEA10524.1  MKIILFLTLIVFTSCELYHYQECVRGTTVLLKEPCPSGTYEGNSPF...
8   AEA10523.1  MFHLVDFQVTIAEILIIIMRTFRIAIWNLDVIISSIVRQLFKPLTK...
9   AEA10520.1  MMPTTLFAGTHITMTTVYHITVSQIQLSLLKVTAFQHQNSKKTTKL...
10  AEA10519.1

#### Database Creation for Utilizing BLAST

In [22]:
%%bash
/usr/local/ncbi-blast-2.10.1+/bin/makeblastdb -in test_data/SARS-CoV-1-ATO98191.fasta -dbtype prot
/usr/local/ncbi-blast-2.10.1+/bin/makeblastdb -in test_data/SARS-CoV-2-QOP81761.fasta -dbtype prot
/usr/local/ncbi-blast-2.10.1+/bin/makeblastdb -in test_data/MERS-CoV-AUM60013.fasta -dbtype prot



Building a new DB, current time: 11/30/2020 21:52:28
New DB name:   /home/jupyter-jdelga26/448/test_data/SARS-CoV-1-ATO98191.fasta
New DB title:  test_data/SARS-CoV-1-ATO98191.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 0.00022006 seconds.




Building a new DB, current time: 11/30/2020 21:52:28
New DB name:   /home/jupyter-jdelga26/448/test_data/SARS-CoV-2-QOP81761.fasta
New DB title:  test_data/SARS-CoV-2-QOP81761.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 0.000207901 seconds.




Building a new DB, current time: 11/30/2020 21:52:28
New DB name:   /home/jupyter-jdelga26/448/test_data/MERS-CoV-AUM60013.fasta
New DB title:  test_data/MERS-CoV-AUM60013.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 1 sequences in 0.00020504 seconds.




Running blast on each pair:

In [23]:
%%bash
/usr/local/ncbi-blast-2.10.1+/bin/blastp -query ./test_data/SARS-CoV-1-ATO98191.fasta -db ./test_data/SARS-CoV-2-QOP81761.fasta -evalue 1e-6 -num_threads 16 -out ./test_data/blast_S1_S2.txt
/usr/local/ncbi-blast-2.10.1+/bin/blastp -query ./test_data/SARS-CoV-1-ATO98191.fasta -db ./test_data/MERS-CoV-AUM60013.fasta -evalue 1e-6 -num_threads 16 -out ./test_data/blast_S1_M.txt
/usr/local/ncbi-blast-2.10.1+/bin/blastp -query ./test_data/SARS-CoV-2-QOP81761.fasta -db ./test_data/MERS-CoV-AUM60013.fasta -evalue 1e-6 -num_threads 16 -out ./test_data/blast_S2_M.txt

BLAST results of SARS-CoV-1 and SARS-CoV-2:

In [24]:
!grep -A1 "Score = " ./test_data/blast_S1_S2.txt | head -2

 Score = 12938 bits (33578),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 6126/7101 (86%), Positives = 6599/7101 (93%), Gaps = 33/7101 (0%)


BLAST results of SARS-CoV-1 and MERS-CoV:

In [25]:
!grep -A1 "Score = " ./test_data/blast_S1_M.txt | head -2

 Score = 6032 bits (15650),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 3061/6037 (51%), Positives = 4050/6037 (67%), Gaps = 195/6037 (3%)


BLAST results of SARS-CoV-2 and MERS-CoV:

In [26]:
!grep -A1 "Score = " ./test_data/blast_S2_M.txt | head -2

 Score = 6061 bits (15723),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 3055/5946 (51%), Positives = 4030/5946 (68%), Gaps = 177/5946 (3%)


Commands to clean up the test_data directory:

In [27]:
!rm ./test_data/*.fasta.*
!rm ./test_data/*.txt

#### BLAST Database Creation

In [28]:
%%bash
/usr/local/ncbi-blast-2.10.1+/bin/makeblastdb -in data/SARS-CoV-1-Sequences_2.fasta -dbtype prot
/usr/local/ncbi-blast-2.10.1+/bin/makeblastdb -in data/SARS-CoV-2-Sequences_2.fasta -dbtype prot
/usr/local/ncbi-blast-2.10.1+/bin/makeblastdb -in data/MERS-CoV-Sequences_2.fasta -dbtype prot



Building a new DB, current time: 11/30/2020 21:52:28
New DB name:   /home/jupyter-jdelga26/448/data/SARS-CoV-1-Sequences_2.fasta
New DB title:  data/SARS-CoV-1-Sequences_2.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 157 sequences in 0.00476289 seconds.




Building a new DB, current time: 11/30/2020 21:52:28
New DB name:   /home/jupyter-jdelga26/448/data/SARS-CoV-2-Sequences_2.fasta
New DB title:  data/SARS-CoV-2-Sequences_2.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 108 sequences in 0.00394607 seconds.




Building a new DB, current time: 11/30/2020 21:52:28
New DB name:   /home/jupyter-jdelga26/448/data/MERS-CoV-Sequences_2.fasta
New DB title:  data/MERS-CoV-Sequences_2.fasta
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 135 sequences in 0.00464487 seconds.




Run blast on every pair

In [29]:
%%bash
/usr/local/ncbi-blast-2.10.1+/bin/blastp -query ./data/SARS-CoV-1-Sequences_2.fasta -db ./data/SARS-CoV-2-Sequences_2.fasta -evalue 1e-6 -num_threads 16 -out ./data/blast_S1_S2.txt
/usr/local/ncbi-blast-2.10.1+/bin/blastp -query ./data/SARS-CoV-1-Sequences_2.fasta -db ./data/MERS-CoV-Sequences_2.fasta -evalue 1e-6 -num_threads 16 -out ./data/blast_S1_M.txt
/usr/local/ncbi-blast-2.10.1+/bin/blastp -query ./data/SARS-CoV-2-Sequences_2.fasta -db ./data/MERS-CoV-Sequences_2.fasta -evalue 1e-6 -num_threads 16 -out ./data/blast_S2_M.txt

Print out results of comparisons (top alignment score in every output file)

In [30]:
%%bash
echo "SARS-CoV-1 x SARS-CoV-2"
grep -A1 -m1 "Score =" ./data/blast_S1_S2.txt | head -2
echo "SARS-CoV-1 x MERS-CoV"
grep -A1 -m1 "Score =" ./data/blast_S1_M.txt | head -2
echo "SARS-CoV-2 x MERS-CoV"
grep -A1 -m1 "Score =" ./data/blast_S2_M.txt | head -2

SARS-CoV-1 x SARS-CoV-2
 Score = 12927 bits (33549),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 6120/7103 (86%), Positives = 6595/7103 (93%), Gaps = 37/7103 (1%)
SARS-CoV-1 x MERS-CoV
 Score = 5983 bits (15522),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 3066/6175 (50%), Positives = 4083/6175 (66%), Gaps = 262/6175 (4%)
SARS-CoV-2 x MERS-CoV
 Score = 5995 bits (15553),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 3041/5978 (51%), Positives = 4002/5978 (67%), Gaps = 202/5978 (3%)


In [31]:
!cat ./data/blast_S2_M.txt

BLASTP 2.10.1+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer,
L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri
I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001),
"Improving the accuracy of PSI-BLAST protein database searches with
composition-based statistics and other refinements", Nucleic Acids
Res. 29:2994-3005.



Database: data/MERS-CoV-Sequences_2.fasta
           135 sequences; 155,015 total letters



Query= QPF16038.1 |ORF1ab polyprotein [Severe acute respiratory syndrome
coronavirus 2]

Length=7096
                                                                      Score     E
Sequences producing significant alignments:                        

                 +K             +T+ V I   D ++ AK    +V+VNAAN +LKHGGG+AGA
Sbjct  1100  KRLRIKRNVDPLSNFEHKVITECVTIVLGDAIQVAKCYGESVLVNAANTHLKHGGGIAGA  1159

Query  1075  LNKATNNAMQVESDDYIATNGPLKVGGSCVLSGHNLAKHCLHVVGPNVNKGEDIQLLKSA  1134
             +N A+  A+Q ESD+YI   GPL+VG S +L GH+LAK+ LHVVGP+    +D+ LL   
Sbjct  1160  INAASKGAVQKESDEYILAKGPLQVGDSVLLQGHSLAKNILHVVGPDARAKQDVSLLSKC  1219

Query  1135  YENFNQHEVLLAPLLSAGIFGADPIHSLRVCVDTVRTNVYLAVFDKNLYDKL  1186
             Y+  N + +++ PL+SAGIFG  P  S    +   +T V + V  +++Y  L
Sbjct  1220  YKAMNAYPLVVTPLVSAGIFGVKPAVSFDYLIREAKTRVLVVVNSQDVYKSL  1271


>YP_009047203.1 |1A polyprotein [Middle East respiratory syndrome-related 
coronavirus]
Length=4391

 Score = 2196 bits (5691),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 1252/3283 (38%), Positives = 1857/3283 (57%), Gaps = 193/3283 (6%)

Query  1259  INGNLHPDSATLVSDIDITFLKKDAPYIVGD-VVQEGVLTA-----VVIPTKKAGGTTEM  1312
             I G ++  S   V      ++  

                   +T  +   VEV PQ+ ++L    QTI+                           
Sbjct  1048  L----QETPVVSDTVEVPPQV-VKLPSEPQTIQPEVKEVAPVYEADTEQTQSVTVKPKRL  1102

Query  1025  -----VNSFSGYLK--LTDNVYIKNADIVEEAKKVKPTVVVNAANVYLKHGGGVAGALNK  1077
                  V+  S +    +T+ V I   D ++ AK    +V+VNAAN +LKHGGG+AGA+N 
Sbjct  1103  RKKRNVDPLSNFEHKVITECVTIVLGDAIQVAKCYGESVLVNAANTHLKHGGGIAGAINA  1162

Query  1078  ATNNAMQVESDDYIATNGPLKVGGSCVLSGHNLAKHCLHVVGPNVNKGEDIQLLKSAYEN  1137
             A+  A+Q ESD+YI   GPL+VG S +L GH+LAK+ LHVVGP+    +D+ LL   Y+ 
Sbjct  1163  ASKGAVQKESDEYILAKGPLQVGDSVLLQGHSLAKNILHVVGPDARAKQDVSLLSKCYKA  1222

Query  1138  FNQHEVLLAPLLSAGIFGADPIHSLRVCVDTVRTNVYLAVFDKNLYDKL  1186
              N + +++ PL+SAGIFG  P  S    +   +T V + V  +++Y  L
Sbjct  1223  MNAYPLVVTPLVSAGIFGVKPAVSFDYLIREAKTRVLVVVNSQDVYKSL  1271


>AGN72639.1 |orf1ab [Middle East respiratory syndrome-related 
coronavirus]
Length=7078

 Score = 5994 bits (15551),  Expect = 0.0, Method: Compositiona

Sbjct  2982  ITTNGSWAIFNDHHLNRPGVYCGSDFIDIVRRLAVSLFQPITYFQLTTSLVLGIGLCAFL  3041

Query  3058  TCLAYYFMRFRRAFGEYSHVVAFNTLLFLMSFTVLCLTPVYSFLPGVYSVIYLYLTFYLT  3117
             T L YY  + +RAF +Y+       +  +++   +C           Y+ +Y Y TFY T
Sbjct  3042  TLLFYYINKVKRAFADYTQCAVIAVVAAVLNSLCICFVASIPLCIVPYTALYYYATFYFT  3101

Query  3118  NDVSFLAHIQWMVMFTPLVPFWITIAYIICISTKHFYWFFSNYLKRRV-VF-NGVSFSTF  3175
             N+ +F+ H+ W +MF P+VP W+T  Y + +  +HF+W  + + K+ V VF +G    +F
Sbjct  3102  NEPAFIMHVSWYIMFGPIVPIWMTCVYTVAMCFRHFFWVLAYFSKKHVEVFTDGKLNCSF  3161

Query  3176  EEAALCTFLLNKEMYLKLRSDVLLPLTQYNRYLALYNKYKYFSGAMDTTSYREAACCHLA  3235
             ++AA   F++NK+ Y  LR+   L    Y+R+L L+NKYKYFSGAM+T +YREAA CHLA
Sbjct  3162  QDAASNIFVINKDTYAALRNS--LTNDAYSRFLGLFNKYKYFSGAMETAAYREAAACHLA  3219

Query  3236  KALNDFSNSGSDVLYQPPQTSITSAVLQSGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWL  3295
             KAL  +S +GSD+LYQPP  SITS VLQSG  KM+ PSG VE CMVQVTCG+ TLNGLWL
Sbjct  3220  KALQTYSETGSDLLYQPPNCSITSGVLQSGL

              N + +++ PL+SAGIFG  P  S    +   +T V + V  +++Y  L
Sbjct  1223  MNAYPLVVTPLVSAGIFGVKPAVSFDYLIREAKTRVLVVVNSQDVYKSL  1271


>AKS48060.1 |ORF1ab [Middle East respiratory syndrome-related 
coronavirus]
Length=7078

 Score = 5990 bits (15541),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 3038/5978 (51%), Positives = 3998/5978 (67%), Gaps = 202/5978 (3%)

Query  1259  INGNLHPDSATLVSDIDITFLKKDAPYIVGD-VVQEGVLTA-----VVIPTKKAGGTTEM  1312
             I G ++  S   V      ++    P  VGD V+ +G   A     VV P  +A     +
Sbjct  1156  IAGAINAASKGAVQKESDEYILAKGPLQVGDSVLLQGHSLAKNILHVVGPDARAKQDVSL  1215

Query  1313  LAK---ALRKVPTDNYITTYPGQGLNG--------YTVEEAKTVLKKCKSAFYILPSIIS  1361
             L+K   A+   P    +T     G+ G        Y + EAKT +    ++  +  S+  
Sbjct  1216  LSKCYKAMNAYPL--VVTPLVSAGIFGVKPAVSFDYLIREAKTRVLVVVNSQDVYKSLTI  1273

Query  1362  NEKQEILGTVSWNLREMLAHAEETRKLMPVCVETKAIVSTIQRKYKGIKIQEGVVDYGAR  1421
              +  + L      LR  +  A++    + VC 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [32]:
import re
import numpy as np

def average_identity_rate_for_all_alignments(blast_output_file):
    f = open(blast_output_file)
    total_positive_frac = 0
    n_significant_scores = 0

    for line in f.readlines():
        m = re.match("\W*Identities = (\d+)\/(\d+)", line)
        if m is not None:
            num = float(m.group(1))
            denom = float(m.group(2))
            total_positive_frac += num/denom
            n_significant_scores += 1
            
    return total_positive_frac / n_significant_scores

pairings = [('S1', 'S2'), ('S1', 'M'), ('S2', 'M')]
rows = np.zeros((3 ,3))
df = pd.DataFrame(rows, columns=['S1', 'S2', 'M'], index=['S1', 'S2', 'M'])
df.loc['S1', 'S1'] = df.loc['S2', 'S2'] = df.loc['M', 'M'] = 100
for pair in pairings:
    f = './data/blast_' + '_'.join(pair) + '.txt'
    df.loc[pair[0], pair[1]] = df.loc[pair[1], pair[0]] = round(average_identity_rate_for_all_alignments(f)*100)

df

Unnamed: 0,S1,S2,M
S1,100.0,84.0,40.0
S2,84.0,100.0,38.0
M,40.0,38.0,100.0


Clean up data directory

In [33]:
!rm ./data/*.fasta.*

### Mutation Research

After analyzing the similarities and differences between the SARS-CoV-1, SARS-CoV-2 and MERS-CoV sequences, we began futhering our research of SARS-CoV-2 by inspecting the D614G mutation. The D614G mutation causes a change in the structure of the viral spike glycoprotein.

In order to focus the study, we utilized California as our testing sample and focus to act as a case study of the D614G missense mutation. We analyzed 30 samples taken from the NCBI database and searched to see if they contained the mutation.

The functions below read in the sequence data and find which contain the mutation.

In [34]:
def get_sequence(filename):
    file = open("./mutation_data/" +filename)
    sequence = ""
    count = 0
    
    for line in file:
        if count != 0:
            sequence += line.strip()
        count += 1
        
    return sequence
        
    
def find_mutations():
    out_file_1 = open("./mutation_data/results.txt", "w")
    out_file_2 = open("./mutation_data/differences.txt", "w")
    reference_genome = get_sequence("reference_seq.fasta")
    mutation_filename = "seq_00.fasta"
    
    mutations = []
    
    for i in range(1, 31):
        temp = mutation_filename[:4] + f'{i:02}' + mutation_filename[6:]
        mutation = get_sequence(temp)
        mutations.append(mutation)
    
    for i in range(len(mutations)):
        mutation = mutations[i]
        
        if mutation[23402] == 'G':
            out_file_1.write(str(i + 1) + " - " + mutation[23402] + "\n")
    
    for i in range(len(mutations)):
        out_file_2.write(str(i) + " - differences")
        mutation = mutations[i]
        differences = ""
        min_len = min(len(reference_genome), len(mutation))
        
        
        for j in range(min_len):
            if(reference_genome[j] != mutation[j]):
                differences += reference_genome[j]
            else:
                differences += "-"
                
        out_file_2.write(reference_genome)
        out_file_2.write(differences)
    
find_mutations()

#### Mutation Results
Below are the results of searching for the D614G missense mutation amongst the data set.

In [35]:
%%bash
cat ./mutation_data/results.txt

4 - G
21 - G
24 - G


### Pulling Data from NCBI

There are third party resources to pull data from the NCBI database. Utilizing thee resources to streamline or automate collecting sequences would be convinient, but they were either too simple to be useful or too complicated to be worth the time put in. 

Furthermore, there are problems with the NCBI database itself. The data labelled as SARS-CoV-1 and SARS-Cov-2 were part of the same dataset, so manual verification is required to see that you're getting the proper sequences. Thus, any attempt at automation would run into this problem and it is likely there are simliar issues with other data entries.


### Relevant GitHubs

Analysis using k-mer composition
https://anderson-github-classroom.github.io/csc-448-project/eagranof/

Even more analysis using k-mer composition
https://anderson-github-classroom.github.io/csc-448-project/skurdogh/ 

Vitulgin Experimentation
https://anderson-github-classroom.github.io/csc-448-project/cilg/

The Levenshtein distance experiment
https://anderson-github-classroom.github.io/csc-448-project/awengel/ 

Mutation Rate Comparison and Spike Proteins
https://anderson-github-classroom.github.io/csc-448-project/pamidi/