<a href="https://colab.research.google.com/github/lestimpe/SARS-CoV-2-genome/blob/main/SARS_CoV_2(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting the SARS-CoV-2 Genome, Wuhan isolate; the spike protein (2)

We will use the database available online from the National Center for Biotechnology Information (NCBI), which is part the the National Library of Medicine (NLM), which in turn is part of the National Institutes of Health (NIH).  This database includes the sequences of genes from many different organisms (not just humans), protein sequences, and some entire genomes. Much of this work was supported financially by the U.S. government, hence by taxpayers, and the results are freely available.

Soon we will use Python to get the SARS-CoV-2 genome sequence, but first we need an identifier for it, its Ref_Seq number. Click on [ncbi home page](https://www.ncbi.nlm.nih.gov). Under Popular Resources (right-hand column) you will see some items that might interest you, such as PubMed (access to published papers) and Bookshelf (textbooks).  Now visit the [SARS-CoV-2 Data Hub] (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=taxid:2697049).  

The top entry on the list is the Wuhan isolate of the virus, the first sequenced.  This was the first published SARS-CoV-2 sequence; it was made public in January of 2020 by the Wuhan Institute of Virology.  The published genomic sequence was used to design the mRNA vaccines, as we will see in the next notebook.  Copy down the Ref Seq number.



Before getting the SARS-CoV-2 sequence, the notebook needs to be prepared.  We will use a package called *Biopython*, which adds functions important for us to the basic Python distribution, and one called *ReportLab* for drawing the genome.  Run the following code cell to download Biopython.

In [1]:
!pip install biopython
import Bio
from Bio import Entrez
from Bio import SeqIO
from Bio import GenBank
!pip install ReportLab
from reportlab.lib import colors
from reportlab.lib.units import cm
from Bio.Graphics import GenomeDiagram
from google.colab import files

Collecting biopython
  Downloading biopython-1.81-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: biopython
Successfully installed biopython-1.81
Collecting ReportLab
  Downloading reportlab-4.0.6-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: ReportLab
Successfully installed ReportLab-4.0.6


The following code cell will download the SARS CoV-2 sequence, plus associated data that we want.  Look at the end of line three, where you will see *id = ""*.  Click on the code cell, enter the RefSeq identifier for the Wuhan isolate between the quotation marks, and execute the cell.

In [3]:
Entrez.email = 'A.N.Other@example.com'
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id="NC_045512"
) as handle:
    seq_record = SeqIO.read(handle, "genbank")
print("%s with %i features" % (seq_record.id, len(seq_record.features)))

NC_045512.2 with 57 features


Run the next cell to display part of the sequence, beginning at the 5' end:

In [4]:
str(seq_record.seq)

'ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAGATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAACTGCACCTCATGGTCATGTTATGGTTGAGCTGGTAGCAGAACTCGAAGGCATTCAGTACGGTCGTAGTGGTGAGACACTTGGTGTCCTTGTCCCTCATGTGGGCGAAATACCAGTGGCTTACCGCAAGGTTCTTCTTCGTAAGAACGGTAATAAAGGAGCTGGTGGCCATAGTTACGGCGCCGATCTAAAGTCATTTGACTTAGGCGACGAGCTTGGCACTGATCCTTATGAAGATTTTCAAGAAAACTGGAACACTAAACATAGCAGTGGTGTTACCCGTGAACTCATGCGTGAGCTTAACGGAGGGGCATACACTCGCTATGTCGATAACAACTTCTGTGGCCCTGATGGCTACCCTCTTGAGTGCATTAAAGACCTTCTAGCACGTGCTGGTAAAGCTTCATGCACTTTGTCCGAACAACTGGACTTTATTGACACTAAGAGGGGTGTATACTGCTGCCGTGAACATGAGCATGAAATTGCTTGGTACACGGAACGTTC

The genome of SARS CoV-2 is a + strand RNA genome, 29903 nucleotides in length.
*+ strand* means it is a single RNA strand, which contains the coding sequences and is sometimes called the *sense* strand (the other would be the - strand, or *antisense* strand, or the *template* strand).  So, you could use the universal genetic code table on the sequence directly.  Even though it is an RNA genome, it is presented as DNA, with T's rather than U's.

In addition to the sequence, the record you downloaded also contains many features.  Run the next cell:

In [5]:
seq_record.features

[SeqFeature(SimpleLocation(ExactPosition(0), ExactPosition(29903), strand=1), type='source', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(0), ExactPosition(265), strand=1), type="5'UTR"),
 SeqFeature(SimpleLocation(ExactPosition(265), ExactPosition(21555), strand=1), type='gene', qualifiers=...),
 SeqFeature(CompoundLocation([SimpleLocation(ExactPosition(265), ExactPosition(13468), strand=1), SimpleLocation(ExactPosition(13467), ExactPosition(21555), strand=1)], 'join'), type='CDS', location_operator='join', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(265), ExactPosition(805), strand=1), type='mat_peptide', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(805), ExactPosition(2719), strand=1), type='mat_peptide', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(2719), ExactPosition(8554), strand=1), type='mat_peptide', qualifiers=...),
 SeqFeature(SimpleLocation(ExactPosition(8554), ExactPosition(10054), strand=1), type='mat_peptide', qualifi

Each row has information about one feature, including its starting and ending positions, counted in nucleotides from the 5' end of the genome, and also the feature type:


**5' UTR** - 5' untranslated region

**mat_peptide** - coding sequence for a mature peptide

**gene** - in general could include introns, promotors, etc.

**CDS** - coding sequence.  In this virus the coding sequences and genes are the  same.

**stem_loop** - an RNA secondary structure

**3' UTR** - 3' untranslated region



In class we studied examples in which each gene has its own promoter.  Hence each gene gives rise to one transcript (or a family related by different splicing patterns) and one protein (or related splice variants).  In this virus the situation is different:  an open reading frame can be translated, then digested into mature proteins by protease enzymes.  The longest is *orf1ab*.  Open [this](https://www.ncbi.nlm.nih.gov/projects/sviewer/?id=NC_045512&tracks=[key:sequence_track,name:Sequence,display_name:Sequence,id:STD649220238,annots:Sequence,ShowLabel:false,ColorGaps:false,shown:true,order:1][key:gene_model_track,name:Genes,display_name:Genes,id:STD3194982005,annots:Unnamed,Options:ShowAllButGenes,CDSProductFeats:true,NtRuler:true,AaRuler:true,HighlightMode:2,ShowLabel:true,shown:true,order:9]&v=1:29903&c=null&select=null&slim=0) display from the NCBI website, and you will see orf1ab at the top.

The viral genome encodes over two dozen proteins, including the proteases needed to process the proteins, and an RNA polymerase to copy the RNA genome.
Some of the SARS CoV-2 proteins have unknown functions.  For a description of some of the genes and their functions, click [ here ](https://www.nytimes.com/interactive/2020/04/03/science/coronavirus-genome-bad-news-wrapped-in-protein.html?searchResultPosition=9).

We will look at the gene for the spike protein, which is the 34th feature.  Run the following code cell:


In [6]:
print(seq_record.features[34])

type: CDS
location: [21562:25384](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GeneID:43740568']
    Key: gene, Value: ['S']
    Key: gene_synonym, Value: ['spike glycoprotein']
    Key: locus_tag, Value: ['GU280_gp02']
    Key: note, Value: ['structural protein; spike protein']
    Key: product, Value: ['surface glycoprotein']
    Key: protein_id, Value: ['YP_009724390.1']
    Key: translation, Value: ['MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFL

A spike protein binds with two others to form a trimer, which projects out from the surface of the virus.  When you see a picture or drawing of the virus that resembles a medieval mace, those are the spike protein trimers sticking out.  In electron micrographs the spike proteins form a hazy layer, or corona, surrounding the spherical virus.  The trimer can bind with a protein, **ACE2**, on the surface of lung cells.  This binding is specific for **ACE2**, which is why SARS CoV-2 attacks lung cells, but not others.  The gene is 3822 nucleotides in length.

You can see that the spike protein is called a *glycoprotein*.  Proteins that are extracellular, or outside of cells, are usually coated with glycans (carbohydrates).  The glycans are thought to protect the proteins from being digested by extracellular protease enzymes.  (Recall from earlier in the semester that the A and B blood type antigens are different carbohydrates attached to the H substance.)

Run the following code cell.  It will make a simple map of the spike protein gene on the SARS CoV-2 genome, which will appear in your Downloads folder.  Click on the file to see the map.

In [7]:
gd_diagram = GenomeDiagram.Diagram("SARS CoV-2 genome")
gd_track_for_features = gd_diagram.new_track(1, name="Annotated Features")
gd_feature_set = gd_track_for_features.new_set()
gd_feature_set.add_feature(seq_record.features[34], color=colors.lightblue, label=True)
gd_diagram.draw(
    format="linear",
    orientation="landscape",
    pagesize="A4",
    fragments=1,
    start=0,
    end=len(seq_record),
)

with open('gd_diagram', 'w') as f:
    gd_diagram.write("SARS-CoV-2.pdf", "PDF")
files.download("SARS-CoV-2.pdf")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

The light blue rectangle shows the location of the spike gene.  It is near the 3' end of the genome, consistent with its address of 21562 - 25384.

You can see a structural model of the spike trimer at [PDB](https://www.rcsb.org/structure/6ZB5).  The PDB (protein data base) is a repository for protein structures, determined experimentally by X-ray crystallography or cryo-electron microscopy.  Recently a computer program called AlphaFold has been predicting structures successfully.  In the lower left, click on *3D View: Structure*.  The protein you see comprises the three copies of the S protein, each in a different color.  If you have studied a little biochemistry, you will recognize two types of secondary structure, the alpha helix and the beta sheet.  The blue boxes dangling off the protein are the glycans.  You can rotate the structure by mousing over it, and see that it actually does have the shape of a spike.