<a href="https://colab.research.google.com/github/lestimpe/SARS-CoV-2-genome/blob/main/latest_SARS_CoV_2(3).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Messenger RNA Vaccines (3)

The sequence of the Wuhan isolate of SARS CoV-2 was published in January of 2020.  Immediately some scientists at the NIH, who had previous experience designing mRNA vaccines, chose part of the sequence to include in an mRNA vaccine for SARS CoV-2. They picked the spike protein, because it projects from the surface of the virus, and hence is likely to be antigenic (recall that an *antigen* is antibody-generating, it provokes the immune system to make antibodies).  They shared the sequence with Moderna, a biotech company that was attempting to use mRNA for various therapeutics.  

The idea of the vaccine is to inject mRNA into a person's muscle.  The mRNA should be taken up by muscle cells, and direct the synthesis of spike proteins.  These would be expressed on the surface of muscle cells, and, since they are foreign to the body, elicit an immune response.  If the person were exposed to SARS CoV-2 later, the body would already be prepared to fight off the infection.

There are potential problems, however.  Messenger RNA is not very stable in the body extracellularly (outside of cells), and it is not taken up efficiently across the cell membrane into cells.  Moderna (and other companies) have developed a *lipid nanoparticle* technology that helped solve these problems.  So, once the best sequence was chosen, the mRNA itself could be easily synthesized, then packaged for delivery in nanoparticles.

We are going to compare the sequences of the BioNTech/Pfizer and Moderna mRNA vaccines with that of the Wuhan isolate.  In order to complete this notebook you need to copy two files from iLearn into your Downloads file.

Run the following code cell to prepare the notebook:

In [1]:
!pip install biopython
import Bio
from Bio import Entrez
from Bio import SeqIO
from Bio import GenBank

Collecting biopython
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (2.3 MB)
[K     |████████████████████████████████| 2.3 MB 5.2 MB/s 
Installing collected packages: biopython
Successfully installed biopython-1.79


This one gets the Wuhan isolate nucleotide sequence from the NCBI:

In [12]:
Entrez.email = 'A.N.Other@example.com'
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id="NC_045512.2"
) as handle:
    Wuhan_record = SeqIO.read(handle, "genbank")
print("%s with %i features" % (Wuhan_record.id, len(Wuhan_record.features)))

NC_045512.2 with 57 features


Some scientists at Stanford determined the mRNA sequences from the Moderna and Pfizer vaccines.  (Apparently the companies have not been eager to publish the sequences themselves.)  The next code cell gets those sequences.  You should have already copied the two files from github to your Downloads file.  After you start the code cell, click on Choose Files at the bottom, and get the Moderna and BioNTech sequences from your Downloads file.

In [3]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

Saving mRNA_Moderna.txt to mRNA_Moderna.txt
Saving mRNA_BioNTech.txt to mRNA_BioNTech.txt
User uploaded file "mRNA_Moderna.txt" with length 4184 bytes
User uploaded file "mRNA_BioNTech.txt" with length 4367 bytes


The next two code cells display the BioNTech and Moderna nucleotide sequences:

In [13]:
for seq_record in SeqIO.parse("mRNA_BioNTech.txt", "fasta"):
    print(seq_record.id)
    print(seq_record.seq)
    BioNTech_record = seq_record
    print(len(seq_record))

Figure1_032321_Spike-encoding_contig_assembled_from_BioNTech/Pfizer_BNT-162b2_vaccine
GAGAATAAACTAGTATTCTTCTGGTCCCCACAGACTCAGAGAGAACCCGCCACCATGTTCGTGTTCCTGGTGCTGCTGCCTCTGGTGTCCAGCCAGTGTGTGAACCTGACCACCAGAACACAGCTGCCTCCAGCCTACACCAACAGCTTTACCAGAGGCGTGTACTACCCCGACAAGGTGTTCAGATCCAGCGTGCTGCACTCTACCCAGGACCTGTTCCTGCCTTTCTTCAGCAACGTGACCTGGTTCCACGCCATCCACGTGTCCGGCACCAATGGCACCAAGAGATTCGACAACCCCGTGCTGCCCTTCAACGACGGGGTGTACTTTGCCAGCACCGAGAAGTCCAACATCATCAGAGGCTGGATCTTCGGCACCACACTGGACAGCAAGACCCAGAGCCTGCTGATCGTGAACAACGCCACCAACGTGGTCATCAAAGTGTGCGAGTTCCAGTTCTGCAACGACCCCTTCCTGGGCGTCTACTACCACAAGAACAACAAGAGCTGGATGGAAAGCGAGTTCCGGGTGTACAGCAGCGCCAACAACTGCACCTTCGAGTACGTGTCCCAGCCTTTCCTGATGGACCTGGAAGGCAAGCAGGGCAACTTCAAGAACCTGCGCGAGTTCGTGTTTAAGAACATCGACGGCTACTTCAAGATCTACAGCAAGCACACCCCTATCAACCTCGTGCGGGATCTGCCTCAGGGCTTCTCTGCTCTGGAACCCCTGGTGGATCTGCCCATCGGCATCAACATCACCCGGTTTCAGACACTGCTGGCCCTGCACAGAAGCTACCTGACACCTGGCGATAGCAGCAGCGGATGGACAGCTGGTGCCGCCGCTTACTATGTGGGCTACCTGCAGCCTAGAACCTTCCTGCTGAAGTACAACGAGAACGGCACCATCACCGA

In [14]:
for seq_record in SeqIO.parse("mRNA_Moderna.txt", "fasta"):
    print(seq_record.id)
    print(seq_record.seq)
    Moderna_record = seq_record
    print(len(seq_record))

Figure_2_32321_Spike-encoding_contig_assembled_from_Moderna_mRNA-1273_vaccine
GGGAAATAAGAGAGAAAAGAAGAGTAAGAAGAAATATAAGACCCCGGCGCCGCCACCATGTTCGTGTTCCTGGTGCTGCTGCCCCTGGTGAGCAGCCAGTGCGTGAACCTGACCACCCGGACCCAGCTGCCACCAGCCTACACCAACAGCTTCACCCGGGGCGTCTACTACCCCGACAAGGTGTTCCGGAGCAGCGTCCTGCACAGCACCCAGGACCTGTTCCTGCCCTTCTTCAGCAACGTGACCTGGTTCCACGCCATCCACGTGAGCGGCACCAACGGCACCAAGCGGTTCGACAACCCCGTGCTGCCCTTCAACGACGGCGTGTACTTCGCCAGCACCGAGAAGAGCAACATCATCCGGGGCTGGATCTTCGGCACCACCCTGGACAGCAAGACCCAGAGCCTGCTGATCGTGAATAACGCCACCAACGTGGTGATCAAGGTGTGCGAGTTCCAGTTCTGCAACGACCCCTTCCTGGGCGTGTACTACCACAAGAACAACAAGAGCTGGATGGAGAGCGAGTTCCGGGTGTACAGCAGCGCCAACAACTGCACCTTCGAGTACGTGAGCCAGCCCTTCCTGATGGACCTGGAGGGCAAGCAGGGCAACTTCAAGAACCTGCGGGAGTTCGTGTTCAAGAACATCGACGGCTACTTCAAGATCTACAGCAAGCACACCCCAATCAACCTGGTGCGGGATCTGCCCCAGGGCTTCTCAGCCCTGGAGCCCCTGGTGGACCTGCCCATCGGCATCAACATCACCCGGTTCCAGACCCTGCTGGCCCTGCACCGGAGCTACCTGACCCCAGGCGACAGCAGCAGCGGGTGGACAGCAGGCGCGGCTGCTTACTACGTGGGCTACCTGCAGCCCCGGACCTTCCTGCTGAAGTACAACGAGAACGGCACCATCACCGACGCCG

In a sequence *alignment*, two or more sequences are arranged to line up the corresponding nucleotides or amino acids.  We are going to do pairwise sequence alignments.  

There are very many ways to align sequences of the size we are working with.  For each alignment the computer calculates a score, adding points for correct matches of nucleotides, and subtracting points for mismatches or adding gaps.  Then the alignment (or perhaps more than one) with the highest score is returned by the program.  The following code cell gets the code module we need, and assigns values to the different parameters.

In [6]:
from Bio import Align
aligner = Align.PairwiseAligner()
aligner.mode = 'local'
aligner.mismatch_score = -2
aligner.target_internal_open_gap_score = -10.000000
aligner.target_internal_extend_gap_score = -0.500000
aligner.target_left_open_gap_score = -10.000000
aligner.target_left_extend_gap_score = -0.500000
aligner.target_right_open_gap_score = -10.000000
aligner.target_right_extend_gap_score = -0.500000
aligner.query_internal_open_gap_score = -10.000000
aligner.query_internal_extend_gap_score = -0.500000
aligner.query_left_open_gap_score = -10.000000
aligner.query_left_extend_gap_score = -0.500000
aligner.query_right_open_gap_score = -10.000000
aligner.query_right_extend_gap_score = -0.500000
alignments = aligner.align(Wuhan_record.seq, BioNTech_record.seq)

Now let's compare the nucleotide sequences of the two mRNA vaccines.  Run the code cell.  You will see the two sequences displayed horizontally, Moderna on top.  Vertical lines connect nucleotides that are the same.  Within the aligned sequence dots appear where the two nucleotides differ.  Outside the aligned region there are 5' and 3' sequences that do not match at all.  Use the scroll bar to scan horizontally to the end of the sequence to the right.  The two nucleotide sequences are similar, but not identical.

In [15]:
alignmentBvsM = aligner.align(Moderna_record.seq, BioNTech_record.seq)
print(alignmentBvsM[0])

GGGAAATAAGAGAGAAAAGAAGAGTAAGAAGAAATATAAGACCCCGGCGCCGCCACCATGTTCGTGTTCCTGGTGCTGCTGCCCCTGGTGAGCAGCCAGTGCGTGAACCTGACCACCCGGACCCAGCTGCCACCAGCCTACACCAACAGCTTCACCCGGGGCGTCTACTACCCCGACAAGGTGTTCCGGAGCAGCGTCCTGCACAGCACCCAGGACCTGTTCCTGCCCTTCTTCAGCAACGTGACCTGGTTCCACGCCATCCACGTGAGCGGCACCAACGGCACCAAGCGGTTCGACAACCCCGTGCTGCCCTTCAACGACGGCGTGTACTTCGCCAGCACCGAGAAGAGCAACATCATCCGGGGCTGGATCTTCGGCACCACCCTGGACAGCAAGACCCAGAGCCTGCTGATCGTGAATAACGCCACCAACGTGGTGATCAAGGTGTGCGAGTTCCAGTTCTGCAACGACCCCTTCCTGGGCGTGTACTACCACAAGAACAACAAGAGCTGGATGGAGAGCGAGTTCCGGGTGTACAGCAGCGCCAACAACTGCACCTTCGAGTACGTGAGCCAGCCCTTCCTGATGGACCTGGAGGGCAAGCAGGGCAACTTCAAGAACCTGCGGGAGTTCGTGTTCAAGAACATCGACGGCTACTTCAAGATCTACAGCAAGCACACCCCAATCAACCTGGTGCGGGATCTGCCCCAGGGCTTCTCAGCCCTGGAGCCCCTGGTGGACCTGCCCATCGGCATCAACATCACCCGGTTCCAGACCCTGCTGGCCCTGCACCGGAGCTACCTGACCCCAGGCGACAGCAGCAGCGGGTGGACAGCAGGCGCGGCTGCTTACTACGTGGGCTACCTGCAGCCCCGGACCTTCCTGCTGAAGTACAACGAGAACGGCACCATCACCGACGCCGTGGACTGCGCCCTGGACCCTCTGAGCGAGACCAAGTGCACCCTGAAGAGCTTCACCGTGGAGAAGGGCATCTACCAGA

We are more interested in amino acid sequence than in nucleotide sequence, since translation will occur in lung cells or muscle cells.  The next two code blocks translate the two mRNA sequences, giving the amino acid sequences using their single letter abbreviations.

In [16]:
BioNTech_aa = BioNTech_record.seq.translate()
print(BioNTech_aa)

ENKLVFFWSPQTQREPATMFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILS



In [17]:
Moderna_aa = Moderna_record.seq.translate()
print(Moderna_aa)

GK*ERKEE*EEI*DPGAATMFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDIL



In the Moderna sequence you see some asterisks; these are stop codons.

We also need to translate the Wuhan isolate sequence, but the sequence uploaded from the NCBI is in the wrong reading frame for the spike protein.  So, in the next code box we remove the first nucleotide, shifting into the correct reading frame.  All genes in this virus do not have the same reading frame, but we are interested in the spike protein here, so we need the right one for spike.  

In [18]:
from Bio.Seq import MutableSeq
mutable_seq = MutableSeq(Wuhan_record.seq)
mutable_seq.remove("A")

Now we are ready to align the Wuhan isolate (entire sequence) with the translated spike protein sequence from the BioNTech vaccine.  Remember that the spike gene is near the 3' end of the genome; scroll to the right to find it.  The first amino acid of the spike protein is methionine (M), which is the second amino acid in the alignment.

In [19]:
alignmentWvsB_aa = aligner.align(mutable_seq.translate(), BioNTech_aa)
print(alignmentWvsB_aa[0])



L K V Y T F P G N K P T N F R S L V D L F S K R T L K S V W L S L G C M L S A L T Q Y N * * L I T V V D R T R V T R L S S A G C L R F R P C C S R S S A H L G F V R V * P K G K M E S L V P G F N E K T H V Q L S L P V L Q V R D V L V R G F G D S V E E V L S E A R Q H L K D G T C G L V E V E K G V L P Q L E Q P Y V F I K R S D A R T A P H G H V M V E L V A E L E G I Q Y G R S G E T L G V L V P H V G E I P V A Y R K V L L R K N G N K G A G G H S Y G A D L K S F D L G D E L G T D P Y E D F Q E N W N T K H S S G V T R E L M R E L N G G A Y T R Y V D N N F C G P D G Y P L E C I K D L L A R A G K A S C T L S E Q L D F I D T K R G V Y C C R E H E H E I A W Y T E R S E K S Y E L Q T P F E I K L A K K F D T F N G E C P N F V F P L N S I I K T I Q P R V E K K K L D G F M G R I R S V Y P V A S P N E C N Q M C L S T L M K C D H C G E T S W Q T G D F V K A T C E F C G T E N L T K E G A T T C G Y L P Q N A V V K I Y C P A C H N S E V G P E H S L A E Y H N E S G L K T I L R K G G R T I A F G G C V F S 

Here is the Wuhan isolate aligned with the Moderna translated sequence:

In [20]:
alignmentWvsM_aa = aligner.align(mutable_seq.translate(), Moderna_aa)
print(alignmentWvsM_aa[0])

L K V Y T F P G N K P T N F R S L V D L F S K R T L K S V W L S L G C M L S A L T Q Y N * * L I T V V D R T R V T R L S S A G C L R F R P C C S R S S A H L G F V R V * P K G K M E S L V P G F N E K T H V Q L S L P V L Q V R D V L V R G F G D S V E E V L S E A R Q H L K D G T C G L V E V E K G V L P Q L E Q P Y V F I K R S D A R T A P H G H V M V E L V A E L E G I Q Y G R S G E T L G V L V P H V G E I P V A Y R K V L L R K N G N K G A G G H S Y G A D L K S F D L G D E L G T D P Y E D F Q E N W N T K H S S G V T R E L M R E L N G G A Y T R Y V D N N F C G P D G Y P L E C I K D L L A R A G K A S C T L S E Q L D F I D T K R G V Y C C R E H E H E I A W Y T E R S E K S Y E L Q T P F E I K L A K K F D T F N G E C P N F V F P L N S I I K T I Q P R V E K K K L D G F M G R I R S V Y P V A S P N E C N Q M C L S T L M K C D H C G E T S W Q T G D F V K A T C E F C G T E N L T K E G A T T C G Y L P Q N A V V K I Y C P A C H N S E V G P E H S L A E Y H N E S G L K T I L R K G G R T I A F G G C V F S 



Find the overlap between the translated Wuhan isolate and vaccine mRNA sequences in each of the two alignments.  Are they different, or are they the same?  What conclusion might you draw about the mechanism of action of the two vaccines?