# BLAST in Biopython 
BLAST stands for **Basic Local Alignment Search Tool**. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence (called a query) with a library or database of sequences

For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

Biopython provides **Bio.Blast** module to deal with NCBI BLAST operation. You can run BLAST in either local connection or over Internet connection.

### Running queries against online version of BLAST

We use the function qblast() in the Bio.Blast.NCBIWWW module to call the online version of BLAST. This has three non-optional arguments:

- The first argument is the blast program to use for the search, as a lower case string. Currently qblast only works with **blastn**, **blastp**, **blastx**, **tblast** and **tblastx**
- The second argument specifies the databases to search against
- The third argument is a string containing your query sequence. This can either be the sequence itself, the sequence in fasta format, or an identifier like a GI number

API: https://ncbi.github.io/blast-cloud/dev/api.html

In [15]:
from Bio.Blast import NCBIWWW
help(NCBIWWW.qblast)

Help on function qblast in module Bio.Blast.NCBIWWW:

qblast(program, database, sequence, url_base='https://blast.ncbi.nlm.nih.gov/Blast.cgi', auto_format=None, composition_based_statistics=None, db_genetic_code=None, endpoints=None, entrez_query='(none)', expect=10.0, filter=None, gapcosts=None, genetic_code=None, hitlist_size=50, i_thresh=None, layout=None, lcase_mask=None, matrix_name=None, nucl_penalty=None, nucl_reward=None, other_advanced=None, perc_ident=None, phi_pattern=None, query_file=None, query_believe_defline=None, query_from=None, query_to=None, searchsp_eff=None, service=None, threshold=None, ungapped_alignment=None, word_size=None, short_query=None, alignments=500, alignment_view=None, descriptions=500, entrez_links_new_window=None, expect_low=None, expect_high=None, format_entrez_query=None, format_object=None, format_type='XML', ncbi_gi=None, results_file=None, show_overview=None, megablast=None, template_type=None, template_length=None)
    BLAST search using NCBI's

In [24]:
result_handle = NCBIWWW.qblast("blastn", "nt", """ggtaagtcctctagtacaaacacccccaatattgtgatataattaaa
attatattcatattctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc""")

In [25]:
# BLAST generates output in XML format and we need to somehow parse it!
from Bio.Blast import NCBIXML

blast_records = NCBIXML.parse(result_handle)

You can only go though blast records once, so make sure to save them! If your BLAST file is huge though, you may run into memory problems trying to save them all in a list.

In [26]:
blast_records = list(blast_records)

In [29]:
blast_records

[<Bio.Blast.Record.Blast at 0x7fc46b4e5880>]

### What is in BLAST record?
The BLAST E-value is the number of expected hits of similar quality (score) that could be found just by chance.

In [44]:
E_VALUE_THRESH = 0.00000000001 
count = 0
for record in blast_records:
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < E_VALUE_THRESH:
                count += 1
                print("****Alignment****")
                print("sequence:", alignment.title)
                print("length:", alignment.length)
                print(hsp.query[0:75] + "...")
                print(hsp.match[0:75] + "...")
                print(hsp.sbjct[0:75] + "...")
                print()
                
print(f"There are {count} similar sequences in Blast output")

****Alignment****
sequence: gi|1219041180|ref|XM_021875076.1| PREDICTED: Chenopodium quinoa cold-regulated 413 plasma membrane protein 2-like (LOC110697660), mRNA
length: 1173
ACAGAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTC...
|| ||||||||| |||| | |||| ||  |||| |||| | |||| ||| | |||| ||| ||| ||||| | ||...
ACCGAAAATGGGCAGAGGAGTGAATTATATGGCAATGACACCTGAGCAACTAGCCGCGGCCAATTTGATCAACTC...

****Alignment****
sequence: gi|1226796956|ref|XM_021992092.1| PREDICTED: Spinacia oleracea cold-regulated 413 plasma membrane protein 2-like (LOC110787470), mRNA
length: 672
AAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGAT...
|||||||| |||  |||| | || ||||| |||||||| || ||||| |||| ||| ||| ||||||||||||||...
AAAATGGGTAGACGAATGGATTATTTGGCGATGAAAACCGAGCAATTAGCCGCGGCCAATTTGATCGATTCCGAT...

****Alignment****
sequence: gi|731339628|ref|XM_010682658.1| PREDICTED: Beta vulgaris subsp. vulgaris cold-regulated 413 plasma membrane protein 2 (LOC104895996), mRNA
length

Want to learn more? Check Chapter 7 of Biopython tutorial to learn more about BLAST in Biopython. Link: http://biopython.org/DIST/docs/tutorial/Tutorial.pdf