# Unknown Sequence Identification with BioPython

#### References
* https://blast.ncbi.nlm.nih.gov/Blast.cgi
* https://biopython.org/DIST/docs/api/Bio.Blast.NCBIWWW-module.html
* https://github.com/chris-rands/biopython-coronavirus/blob/master/biopython-coronavirus-notebook.ipynb
* https://medium.com/analytics-vidhya/coronavirus-covid-19-genome-analysis-using-biopython-8b8cb1f4a041
* https://github.com/peterjc/biopython_workshop

In [1]:
import os
import sys

from urllib.request import urlretrieve

import Bio
from Bio import SeqIO, SearchIO, Entrez
from Bio.Seq import Seq
from Bio.SeqUtils import GC
from Bio.Blast import NCBIWWW
from Bio.Data import CodonTable

print("Python version:", sys.version_info)
print("Biopython version:", Bio.__version__)
input_file = "unknown-sequence.fa"

Python version: sys.version_info(major=3, minor=7, micro=6, releaselevel='final', serial=0)
Biopython version: 1.76


#### Load Sequence
Here we load a sequence and print it's content and size. Considering that we completelly sequence this genome. 

##### GC Content
GC-content (or guanine-cytosine content) is the percentage of nitrogenous bases in a DNA or RNA molecule that are either guanine (G) or cytosine (C).

In [2]:
record = SeqIO.read(input_file, "fasta")
print('Sequence size:', len(record.seq))
# Just print the first 200 nucleotides
print('Sequence:', str(record.seq)[0:200])

if len(record.seq) < 150000:
    print('Its a Virus because the complete sequence is small')
    
# Print the GC content
# The GC content is 0.38, so the sequence is somewhat AT-rich, but within a 'normal' range.
print("GC content (%)", GC(record.seq))

Sequence size: 29903
Sequence: ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGT
Its a Virus because the complete sequence is small
GC content (%) 37.97277865097148


#### Compare to other genome sequences
We will use Blast to search for this particular sequence.
Let's use BLAST to align the unknown sequence to other annoated sequences in the NCBI nt database, which contains sequences from many different species from accross the tree of life.
This may take ~10 minutes since we are doing an online search against many sequences (for larger queries, it would sensible to run BLAST locally instead; see Bio.Blast.Applications)

In [None]:
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)