# Lexi's Bioinformatics Workbook

My experiments using biopython to transcribe and translate the SARS-CoV-2 virus genome

In [6]:
from Bio import SeqIO
covid_seq_record = SeqIO.read("dataset/genome.fna", "fasta")
print(covid_seq_record)

ID: MN908947.3
Name: MN908947.3
Description: MN908947.3 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome
Number of features: 0
Seq('ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGT...AAA', SingleLetterAlphabet())


We remove the last 2 characters from the virus DNA sequence in order to allow translation to complete in multiples of 3 then test using the modulus operator. 

In [7]:
covid_seq_record.seq = covid_seq_record.seq[:-2]
covid_seq = covid_seq_record.seq
assert len(covid_seq) % 3 == 0

In [12]:
print(f"DNA: {covid_seq[:100]}...{covid_seq[-3:]}")

DNA: ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTC...AAA


Transcribes the virus DNA into mRNA. Changes the thymines from the DNA to uracils in the mRNA. 

In [13]:
mRNA = covid_seq.transcribe()
print("mRNA: " + mRNA[:99] + "..." + mRNA[-3:])

mRNA: AUUAAAGGUUUAUACCUUCCCAGGUAACAAACCAACCAACUUUCGAUCUCUUGUAGAUCUGUUCUCUAAACGAACUUUAAAAUCUGUGUGGCUGUCACU...AAA


Translates the virus mRNA into an amino acid sequence. Takes every 3 characters (a codon) from the mRNA sequence and outputs an amino acid. 

In [14]:
aa_seq = mRNA.translate()
print("Amino Acid Sequence: " + aa_seq[:33])

Amino Acid Sequence: IKGLYLPR*QTNQLSISCRSVL*TNFKICVAVT


The * symbol stands for the stop codons: UGA, UAA, and UAG
This splits on the stop codons to form a list of proteins

In [15]:
proteins = aa_seq.split("*")
print(proteins[:10])

[Seq('IKGLYLPR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('QTNQLSISCRSVL', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('TNFKICVAVTRLHA', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('CTHAV', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('LITNYCR', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('QDTSNSSIFCRLLTVSSVLQPIISTSRFRPGVTER', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('DGEPCPWFQRENTRPTQFACFTGSRRARTWLWRLRGGGLIRGTSTS', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('RWHLWLSRS', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('KRRFAST', HasStopCodon(ExtendedIUPACProtein(), '*')), Seq('TALCVHQTFGCSNCTSWSCYG', HasStopCodon(ExtendedIUPACProtein(), '*'))]
