### Lab:4 - 03.Remote Blast

## Instructions:

### Preliminaries
Figure out what Blast does and how it works.
Go to the NCBI Blast web page (find it yourself...) and start a Blast comparison with CST3 against the so-called non-redundant protein database: nr. This means that you comparison should use the "blastp" subprogram (protein query against protein DB).
Find the highest scoring hit in mouse
What is the E-value, and what does that mean?
Look at the actual alignment. How alike are the sequences?

### Programming
Write a Python program that conducts a Blast search of a given protein sequence against the nr database at NCBI. There is good support for this in BioPython.

### Requirements
Your program is an executable Python file taking one input: a file containing a protein sequence.
Output is the blast report given in XML format, presented to stdout.

### To present:
You should be able to give a brief explanation of Blast.
What is the E-value, and what does that mean?
You should understand the online output from a Blast run.
Your code for the remoteblast program.
Demonstrate a successful run of remoteblast.

one of genomic's main goal is to determine if a particular sequence is like another sequence, this is accomplished by comparing the new sequence with sequences that have already been reported and stored in the databases.
Global sequence alignment is a powerful tool to identify homologous sequences, whereas local sequence alignment is another powerful tool to identify conserved, functional sequence motifs.

If we have a new sequence with unknown function we need to align the sequence with the available sequences in the databases. This analysis can tell two things:
1. Public databases contain any sequence that can be a potential homolog of the newly derived sequence.
2. The sequence contains some known functional motif present in other protein families.

Global alignment:
Looks for comparison over the entire range of the two sequences involved. This method does not always work very well because in many cases only a portion of the two sequences can be aligned.

Local alignment: 
When a local alignment is performed, a small highly similar sequence motif, that is present in both sequences, is uncovered. And the starting from this seed, the alignment is quickly extended.

Basic local Alignment search tool : BLAST.

There are four different types of BLAST, deeending on the sequence which kind we will use.

A word is a series of characters from the query sequence. ex. if sequence = NYLENFVQATFN and the word size = 3, the words are NYL ENF VQA TFN, this words are then compared with a sequence, once a match is found with a target protein, a scoring mtrix algorithm computes the local alignment score.

after finding the initial matches the local alignment is extended in both directions until the alignment score decreases. Alignments whose alignment score does not decrease are then compared with scores obtained by random searches.
The idea is, if there is a certain degree of homology, then the obtained sequence similarity cannot be regenerated by any random search in the database. Against a pool of random searches, the expected probability of finding the observed sequence similarity is computed.
The lower the probability, the less likely that your alignment is a random hit and the more likely that it is a signature of true homology.
This expected probability depends on two factors:
1. the size and quality of the alignment.
2. the size of the database against which comparson is being made.

Expected threshold: we will take a low expected threshold if we want phylogenetically very similar sequences. Take a high expected threshold if we want more dissimilar sequences.

BLAST is designed to identify local region of sequence similarity, the colors correspond to the scores of the alignment. The results are sorted according to the increasing values of the e-value. We want a very small e-value.

https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins

### Mus musculus/human:

Score	Expect	Method	Identities	Positives	Gaps
213 bits(543)	6e-69	Compositional matrix adjust.	98/146(67%)	116/146(79%)	6/146(4%)
Query  1    MAGPLRAPLLLLAILAVALAVSPAAGSSPGKPPRLVGGPMDASVEEEGVRRALDFAVGEY  60
            MA PLR+ L LLA+L VA A +P  G      PR++G P +A   EEGVRRALDFAV EY
Sbjct  1    MASPLRSLLFLLAVLGVAWAATPKQG------PRMLGAPEEADANEEGVRRALDFAVSEY  54

Query  61   NKASNDMYHSRALQVVRARKQIVAGVNYFFDVELGRTTCTKTQPNFDNCPFHDQPHLKRK  120
            NK SND YHSRA+QVVRARKQ+VAGVNYFFDVE+GRTTCTK+Q N  +CPFHDQPHL RK
Sbjct  55   NKGSNDAYHSRAIQVVRARKQLVAGVNYFFDVEMGRTTCTKSQTNLTDCPFHDQPHLMRK  114

Query  121  AFCSFQIYAVPWQGTMTFSKSTCQDA  146
            A CSFQIY+VPW+GT + +K +C++A
Sbjct  115  ALCSFQIYSVPWKGTHSLTKFSCKNA  140

98 of the 146 aminoacids are identical, there are (116 - 98) = 18 that are similar (+).

### Mus musculus
1. https://www.ncbi.nlm.nih.gov/protein/AAA63298.1?report=fasta
2. https://www.ncbi.nlm.nih.gov/protein/AAG40283.1?report=fasta

### The Expect value (E):

is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Essentially, the E value describes the random background noise. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.

The lower the E-value, or the closer it is to zero, the more "significant" the match is. However, keep in mind that virtually identical short alignments have relatively high E values. This is because the calculation of the E value takes into account the length of the query sequence. These high E values make sense because shorter sequences have a higher probability of occurring in the database purely by chance. For more details please see the calculations in the BLAST Course.

The Expect value can also be used as a convenient way to create a significance threshold for reporting results. You can change the Expect value threshold on most BLAST search pages. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported.

In [5]:
from Bio.Blast import NCBIXML
from Bio.Blast import NCBIWWW


In [20]:

fasta_string = open("cst3.fa").read()
result_handle = NCBIWWW.qblast("blastp", "nr", fasta_string,
                               word_size='2',gapcosts='11 1',
                               composition_based_statistics='no adjustment')
                               

with open("my-output.xml", 'w') as f:
    f.write(result_handle.getvalue())

In [21]:
print(result_handle.getvalue())

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastp</BlastOutput_program>
  <BlastOutput_version>BLASTP 2.8.1+</BlastOutput_version>
  <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>nr</BlastOutput_db>
  <BlastOutput_query-ID>Query_36161</BlastOutput_query-ID>
  <BlastOutput_query-def>CST3</BlastOutput_query-def>
  <BlastOutput_query-len>146</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_matrix>BLOSUM62</Parameters_matrix>
      <Parameters_expect>10</Parameters_expect>
      <Parameters_gap-open>11</Parameters_gap-open>
      <Parameters_gap-ex

In [22]:
blast_records = NCBIXML.parse(result_handle)

In [23]:
for record in blast_records:
    print(record.query)
    for alignment in record.alignments:
        print(alignment.title)
        print(alignment.hit_id)
        print(alignment.hit_def)
        for hsp in alignment.hsps:
            print(hsp.score)
            print(hsp.bits)
            print(hsp.expect)
            print(hsp.query)
            print(hsp.match)
            print(hsp.sbjct)

CST3
gi|4503107|ref|NP_000090.1| cystatin-C precursor [Homo sapiens] >gi|568599832|ref|NP_001275543.1| cystatin-C precursor [Homo sapiens] >gi|1034136378|ref|XP_016793041.1| cystatin-C [Pan troglodytes] >gi|1351477886|ref|XP_024094403.1| cystatin-C [Pongo abelii] >gi|118183|sp|P01034.1|CYTC_HUMAN RecName: Full=Cystatin-C; AltName: Full=Cystatin-3; AltName: Full=Gamma-trace; AltName: Full=Neuroendocrine basic polypeptide; AltName: Full=Post-gamma-globulin; Flags: Precursor >gi|296643|emb|CAA36497.1| cystatin C [Homo sapiens] >gi|755738|emb|CAA29096.1| cystatin C [Homo sapiens] >gi|4490944|emb|CAA43856.2| cystatin C [Homo sapiens] >gi|15341822|gb|AAH13083.1| Cystatin C [Homo sapiens] >gi|30582517|gb|AAP35485.1| cystatin C (amyloid angiopathy and cerebral hemorrhage) [Homo sapiens] >gi|49456929|emb|CAG46785.1| CST3 [Homo sapiens] >gi|49456989|emb|CAG46815.1| CST3, partial [Homo sapiens] >gi|60820979|gb|AAX36556.1| cystatin C [synthetic construct] >gi|61360748|gb|AAX41918.1| cystatin C [sy

## TEST

In [2]:
from Bio.Blast import NCBIWWW

fastaSequence = ">TEST 1-211670\nAGACTGCGATCCGAACTGAGAAC"

result_handle1 = NCBIWWW.qblast("blastn", "nr",
                               fastaSequence,
                               word_size=7,
                               gapcosts='5 2',
                               nucl_reward=1,
                               nucl_penalty='-3',
                               expect=1000)

In [3]:
print(result_handle1.getvalue())

<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
  <BlastOutput_program>blastn</BlastOutput_program>
  <BlastOutput_version>BLASTN 2.8.1+</BlastOutput_version>
  <BlastOutput_reference>Stephen F. Altschul, Thomas L. Madden, Alejandro A. Sch&amp;auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), &quot;Gapped BLAST and PSI-BLAST: a new generation of protein database search programs&quot;, Nucleic Acids Res. 25:3389-3402.</BlastOutput_reference>
  <BlastOutput_db>nr</BlastOutput_db>
  <BlastOutput_query-ID>Query_13829</BlastOutput_query-ID>
  <BlastOutput_query-def>TEST 1-211670</BlastOutput_query-def>
  <BlastOutput_query-len>23</BlastOutput_query-len>
  <BlastOutput_param>
    <Parameters>
      <Parameters_expect>1000</Parameters_expect>
      <Parameters_sc-match>1</Parameters_sc-match>
      <Parameters_sc-mismatch>-3</Parameters_sc-mismatch>
      <Para

In [6]:
blast_records1 = NCBIXML.parse(result_handle1)

In [24]:
for record in blast_records1:
    print(record.query)
    for alignment in record.alignments:
        print(alignment.title)
        print(alignment.hit_id)
        print(alignment.hit_def)
        for hsp in alignment.hsps:
            print(hsp.score)
            print(hsp.bits)
            print(hsp.expect)
            print(hsp.query)
            print(hsp.match)
            print(hsp.sbjct)

In [None]:
# count = 0
# for blast_record in blast_records:
#     print "printing header stuff"
#     print "blast_record.query", blast_record.query
#     print "blast_record.query_letters", blast_record.query_letters
#     print "printing descriptions"        
#     print "printing descriptions"
#     for description in blast_record.descriptions:
#         print "description.title", description.title
#         print "description.score", description.score #I think this is the best score
#         print "description.e", description.e #I think this is the best e-value
#         print "num_alignments", description.num_alignments #number of alignments per query
#     print "printing alignments"
#     for alignment in blast_record.alignments:
#         print "alignment.title", alignment.title
#         print "alignment.length", alignment.length
#         print "printing hsps"
#         for hsp in alignment.hsps: #Multiple hits means multiple hsps 
#             print "hsp.align_length", hsp.align_length
    
#             print "hsp.score", hsp.score
#             print "hsp.bits", hsp.bits
#             print "hsp.expect", hsp.expect
#             print "hsp.num_alignments", hsp.num_alignments
#             print "hsp.identities", hsp.identities
#             print "hsp.positives", hsp.positives
#             print "hsp.gaps", hsp.gaps
#             print "hsp.strand", hsp.strand
#             print "hsp.frame", hsp.frame
#             print "hsp.query", hsp.query
#             print "hsp.query_start", hsp.query_start
#             print "hsp.query_end", hsp.query_end
#             print "hsp.match", hsp.match
#             print "hsp.sbjct", hsp.sbjct
#             print "hsp.sbjct_start", hsp.sbjct_start
#             print "hsp.sbjct_end", hsp.sbjct_end
#     print "-----------------------------------"      

# result_handle.close()
# return