<a href="https://colab.research.google.com/github/lestimpe/SARS-CoV-2-genome/blob/main/ComparingWuhanAndUnknown(4).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Comparing the Wuhan isolate with two variants**

We will download the genomic sequences from the Wuhan isolate, a delta variant and an unknown that you will classify at the end of this notebook.  First we install Biopython:

In [None]:
!pip install biopython
import Bio
from Bio import Entrez
from Bio import SeqIO
from Bio import GenBank
from Bio import Align
from Bio import AlignIO

The next code cell downloads the three genomic sequences:

In [None]:
Entrez.email = 'A.N.Other@example.com'
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id="NC_045512.2"
) as handle:
    Wuhan_record = SeqIO.read(handle, "genbank")
print("%s with %i features" % (Wuhan_record.id, len(Wuhan_record.features)))

Entrez.email = 'A.N.Other@example.com'
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id="MZ077342.1"
) as handle:
    Delta_record = SeqIO.read(handle, "genbank")
print("%s with %i features" % (Delta_record.id, len(Delta_record.features)))

Entrez.email = 'A.N.Other@example.com'
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id="MZ007342.1"
) as handle:
    Unknown_record = SeqIO.read(handle, "genbank")
print("%s with %i features" % (Unknown_record.id, len(Unknown_record.features)))

In a sequence *alignment*, two or more sequences are arranged to line up the corresponding nucleotides or amino acids.  Remember that there are three reading frames per strand. We are going to use PairwiseAligner to make sure the spike genes are in the same reading frame.

In [None]:
aligner = Align.PairwiseAligner()
aligner.mode = 'local'
aligner.mismatch_score = -10
aligner.target_internal_open_gap_score = -20.000000
aligner.target_internal_extend_gap_score = -5.00000
aligner.target_left_open_gap_score = -10.000000
aligner.target_left_extend_gap_score = -0.500000
aligner.target_right_open_gap_score = -10.000000
aligner.target_right_extend_gap_score = -0.500000
aligner.query_internal_open_gap_score = -10.000000
aligner.query_internal_extend_gap_score = -0.500000
aligner.query_left_open_gap_score = -10.000000
aligner.query_left_extend_gap_score = -0.500000
aligner.query_right_open_gap_score = -10.000000
aligner.query_right_extend_gap_score = -0.500000

Now we run the aligner.  The output is the amino acid sequence in the single letter code, Wuhan isolate on top.  

In [None]:
spikeW = Wuhan_record.seq[21562:25384]
spikeD = Delta_record.seq[21525:25427]
alignmentWvsD_aa = aligner.align(spikeW.translate(), spikeD.translate())

print(alignmentWvsD_aa[0])

Notice that the first amino acid is M, or methionine, as expected.  If you scroll all the way to the end of the sequence you will see an asterisk corresponding to the stop codon, the end of translation and the end of the open reading frame.

The missense mutations and gaps are represented by dots and dashes, respectively.  Scrolling back to the beginning, you will see where the program has aligned LIV in the Wuhan isolate with XXX in the Delta.  There is no amino acid X in the single letter code.  This is a region of nine nucleotides in Delta in which the sequencing failed to identify the nucleotides.  Not knowing what they are, the aligner program uses X.

Near the beginning of the sequence you will see T (threonine) in the Wuhan isolate paired with R (arginine) in Delta.  The dot indicates a mismatch.  This mutation is represented as T19R, where 19 is the position of the amino acid. It is a missense mutation. Each variant is defined by its mutations.  There is a list for a few common variants in a document called *Variants_Spike_mutations* on iLearn. 


We are interested in a multiple sequence alignment, which includes all three viral genomes.  Multiple sequence alignment is a more difficult problem than pairwise alignment, and can't be done through Biopython without installing another program on your computer.  So, the alignment is provided as a file called *alignment.clustal_num*, which you will find on is available from github and should already be in your Downloads directory.  CLUSTAL is the program used to do the aligning.

Biopython will not display it in the most useful format, hence you should open the file on your computer.  It is a text file.  When I open with Word, the file does not look like a proper amino acid alignement, but this is due to the default page margins being too large.  Click on the Layout tab and set the margins to 1/2 inch all around, and the amino acids should be nicely lined up. 

# Asssignment

The assignment is to classify the unknown genome in the multiple alignment.  You need to have the multiple alignment open and displayed properly, plus the Variants_Spike_mutations file.  Go through the alignment comparing the Wuhan isolate with the unknown.  Write down the mutations using the notation described above, then compare your results with the lists in *Variant_Spike_mutations* to figure which variant the unknown is.  If you are interested in learning more about the variants look [here](https://www.nytimes.com/interactive/2021/health/coronavirus-variant-tracker.html?searchResultPosition=83).  The article also has some useful diagrams of the genome and pictures of the spike protein.