# Unknown Sequence Identification with BioPython
This notebool will use the BLAST Online Service (trough Biopython) to identify (align) some unknown sequence of Nucleotides (Consider a complete sequence) to their database.

#### Real life
On real case scenarios it's often rare that some sequence will completly match to a database of previous sequences (Specially for Virus), so those services actually align and return the top sequences that has matching parts.

#### References
* https://blast.ncbi.nlm.nih.gov/Blast.cgi
* https://biopython.org/DIST/docs/api/Bio.Blast.NCBIWWW-module.html
* https://github.com/chris-rands/biopython-coronavirus/blob/master/biopython-coronavirus-notebook.ipynb
* https://medium.com/analytics-vidhya/coronavirus-covid-19-genome-analysis-using-biopython-8b8cb1f4a041
* https://github.com/peterjc/biopython_workshop
* https://www.youtube.com/watch?v=QIZ8QH6JcC8
* https://www.youtube.com/watch?v=8A-msg23u0w&list=PLQaBWYcv0RlZnZYwyAQUf8BdnlTTuHKPb

In [1]:
import os
import sys

from urllib.request import urlretrieve

import Bio
from Bio import SeqIO, SearchIO, Entrez
from Bio.Seq import Seq
from Bio.SeqUtils import GC
from Bio.Blast import NCBIWWW
from Bio.Data import CodonTable

print("Python version:", sys.version_info)
print("Biopython version:", Bio.__version__)
input_file = "unknown-sequence.fa"

Python version: sys.version_info(major=3, minor=7, micro=6, releaselevel='final', serial=0)
Biopython version: 1.76


#### Load Sequence
Here we load a sequence and print it's content and size. Considering that we completelly sequence this genome.

This sequence is around 30Kb(nucleotides) which is a high indication of a virus.

##### GC Content
GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C).

In [2]:
record = SeqIO.read(input_file, "fasta")
print('Sequence size:', len(record.seq))
# Just print the first 200 nucleotides
print('Sequence:', str(record.seq)[0:200])

if len(record.seq) < 150000:
    print('Its a Virus because the complete sequence is small')
    
# Print the GC content
# The GC content is 0.38, so the sequence is somewhat AT-rich, but within a 'normal' range.
print("GC content (%)", GC(record.seq))

Sequence size: 29903
Sequence: ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGT
Its a Virus because the complete sequence is small
GC content (%) 37.97277865097148


#### Compare to other genome sequences
We will use Blast to search for this particular sequence.
Let's use BLAST to align the unknown sequence to other annoated sequences in the NCBI nt database, which contains sequences from many different species from accross the tree of life.
This may take ~10 minutes since we are doing an online search against many sequences (for larger queries, it would sensible to run BLAST locally instead; see Bio.Blast.Applications)

In [3]:
# Use BlastN (nucleotide-nucleotide) to look for matching sequences
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)

In [None]:
blast_qresult = SearchIO.read(result_handle, "blast-xml")

In [12]:
print(blast_qresult)

Program: blastn (2.10.0+)
  Query: No (29903)
         definition line
 Target: nt
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  gi|1798174254|ref|NC_045512.2|  Severe acute respirator...
            1      1  gi|1819735426|gb|MT121215.1|  Severe acute respiratory ...
            2      1  gi|1805293633|gb|MT019531.1|  Severe acute respiratory ...
            3      1  gi|1820247323|dbj|LC529905.1|  Severe acute respiratory...
            4      1  gi|1818244594|gb|MT135041.1|  Severe acute respiratory ...
            5      1  gi|1805293611|gb|MT019529.1|  Severe acute respiratory ...
            6      1  gi|1802633808|gb|MN996528.1|  Severe acute respiratory ...
            7      1  gi|1818244616|gb|MT135043.1|  Severe acute respiratory ...
            8      1  gi|1808633715|gb|MT049951.1|  Severe acute res

#### Its the SARS-CoV-2 (Covid19)

In [13]:
[hit.description for hit in blast_qresult[:5]]

['Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome',
 'Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/CHN/SH01/2020, complete genome',
 'Severe acute respiratory syndrome coronavirus 2 isolate BetaCoV/Wuhan/IPBCAMS-WH-03/2019, complete genome',
 'Severe acute respiratory syndrome coronavirus 2 TKYE6182_2020 RNA, complete genome',
 'Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/CHN/105/2020, complete genome']

#### What's next?
Logical next steps at the genome level might include building a multiple sequence alignment from many cornavirus genomes (checkout the Biopython wrapers/parsers for Clustal and Mafft and Bio.Align/Bio.parirwise2 plus Bio.AlignIO), building a phylogeny with an external tool like iq-tree and then viewing the tree with Bio.Phylo, the ete3 toolkit, or Jalview.
Other protein level analyses could involve including building protein trees, annotating the proteins and vizulisation (e.g. Bio.Graphics), doing evolutionary rate analyses (e.g. Bio.Phylo.PAML), looking at protein structure, clustering and much more.
These kind of analyses can provide useful biological and epidemiological information, very important for this coronavirus in an outbreak situation. For example, allowing tracking of how the outbreak spreads and indicating appropriate infection control measures, although caution in the inturpretation of results is always required. See Nextstrain for an excellent resource of this kind.
Note, there is tons of other functionality in Biopython, this is just a very small fraction of the modules, see Peter Cock's Biopython workshop and the extensive official tutorial documentation.