In [1]:
import warnings
warnings.filterwarnings(action='ignore')

# Chapter 8. BLAST and other sequence search tools

In this chapter, we’ll go through the main features of `Bio.SearchIO` to show what it can do for you. We’ll use two popular search tools along the way: BLAST and BLAT. They are used merely for illustrative purposes, and you should be able to adapt the workflow to any other search tools supported by `Bio.SearchIO` in a breeze. You’re very welcome to follow along with the search output files we’ll be using. The BLAST
output file can be downloaded [here(`my_blast.xml`)](https://github.com/biopython/biopython/blob/master/Doc/examples/my_blast.xml), and the BLAT output file [here(`my_blat.psl`)](https://raw.githubusercontent.com/biopython/biopython/master/Doc/examples/my_blat.psl) or are included with the Biopython source code under the `Doc/examples/` folder. Both output files were generated using this sequence:
```
>mystery_seq
CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTTTAGAGGG
```
The BLAST result is an XML file generated using blastn against the NCBI `refseq_rna` database. For BLAT, the sequence database was the February 2009 `hg19` human genome draft and the output format is PSL.

## 8.1 The SearchIO object model
The object model consists of a nested hierarchy of Python objects. These objects are:
* `QueryResult`, to represent a single search query.
* `Hit`, to represent a single database hit. `Hit` objects are contained within `QueryResult` and in each `QueryResult` there is zero or more Hit objects.
* `HSP` (short for high-scoring pair), to represent region(s) of significant alignments between query and hit sequences. `HSP` objects are contained within `Hit` objects and each `Hit` has one or more `HSP` objects.
* `HSPFragment`, to represent a single contiguous alignment between query and hit sequences. `HSPFragment` objects are contained within `HSP` objects. Most sequence search tools like BLAST and HMMER unify
`HSP` and `HSPFragment` objects as each `HSP` will only have a single `HSPFragment`. However there are tools like BLAT and Exonerate that produce `HSP` containing multiple `HSPFragment`.

These four objects are the ones you will interact with when you use `Bio.SearchIO`. They are created using one of the main `Bio.SearchIO` methods: `read`, `parse`, `index`, or `index_db`. These functions
behave similarly to their `Bio.SeqIO` and `Bio.AlignIO` counterparts:
* `read` is used for search output files with a single query and returns a QueryResult object
* `parse` is used for search output files with multiple queries and returns a generator that yields QueryResult objects

With that settled, let’s start probing each Bio.SearchIO object, beginning with QueryResult.

### 8.1.1 QueryResult
The QueryResult object represents a single search query and contains zero or more Hit objects. Let’s see what it looks like using the BLAST file we have:

In [2]:
from Bio import SearchIO
blast_qresult = SearchIO.read("my_blast.xml", "blast-xml")
print(blast_qresult)

Program: blastn (2.14.0+)
  Query: gi|8332116|gb|BE037100.1|BE037100 (1111)
         MP14H09 MP Mesembryanthemum crystallinum cDNA 5' similar to cold acc...
 Target: nt
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  gi|1219041180|ref|XM_021875076.1|  PREDICTED: Chenopodi...
            1      1  gi|1226796956|ref|XM_021992092.1|  PREDICTED: Spinacia ...
            2      1  gi|2252585423|ref|XM_010682658.2|  PREDICTED: Beta vulg...
            3      1  gi|2031543140|ref|XM_041168865.1|  PREDICTED: Juglans m...
            4      1  gi|2247117892|ref|XM_048479995.1|  PREDICTED: Ziziphus ...
            5      1  gi|2082357255|ref|XM_043119049.1|  PREDICTED: Carya ill...
            6      1  gi|2082357253|ref|XM_043119041.1|  PREDICTED: Carya ill...
            7      1  gi|1882610310|ref|XM_035691634.1|  PREDIC

By invoking print on the `QueryResult object`, you can see:
* The program name and version (blastn version 2.2.27+)
* The query ID, description, and its sequence length (ID is 42291, description is ‘mystery seq’, and it is 61 nucleotides long)
* The target database to search against (refseq rna)
* A quick overview of the resulting hits. For our query sequence, there are 100 potential hits (numbered 0–99 in the table). For each hit, we can also see how many HSPs it contains, its ID, and a snippet of its description. Notice here that `Bio.SearchIO` truncates the hit table overview, by showing only hits numbered 0–29, and then 97–99.

Now let’s check our BLAT results using the same procedure as above:

In [3]:
blat_qresult = SearchIO.read("my_blat.psl", "blat-psl")
print(blat_qresult)

Program: blat (<unknown version>)
  Query: mystery_seq (61)
         <unknown description>
 Target: <unknown target>
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0     17  chr19  <unknown description>


All the details you saw when invoking the print method can be accessed individually using Python’s object attribute access notation (a.k.a. the dot notation). There are also other format-specific attributes that you can access using the same method.

In [4]:
print("%s %s" % (blast_qresult.program, blast_qresult.version))
print("%s %s" % (blat_qresult.program, blat_qresult.version))
blast_qresult.param_evalue_threshold # blast-xml specific

blastn 2.14.0+
blat <unknown version>


10.0

Like Python lists and dictionaries, `QueryResult` objects are iterable. Each iteration returns a `Hit` object:

In [5]:
for hit in blast_qresult:
    print(hit)

Query: gi|8332116|gb|BE037100.1|BE037100
       MP14H09 MP Mesembryanthemum crystallinum cDNA 5' similar to cold accli...
  Hit: gi|1219041180|ref|XM_021875076.1| (1173)
       PREDICTED: Chenopodium quinoa cold-regulated 413 plasma membrane prote...
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0  7.1e-117     435.90     624         [58:678]              [277:901]
Query: gi|8332116|gb|BE037100.1|BE037100
       MP14H09 MP Mesembryanthemum crystallinum cDNA 5' similar to cold accli...
  Hit: gi|1226796956|ref|XM_021992092.1| (672)
       PREDICTED: Spinacia oleracea cold-regulated 413 plasma membrane protei...
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  

To check how many items (hits) a QueryResult has, you can simply invoke Python’s len method:

In [6]:
print(len(blast_qresult))
print(len(blat_qresult))

50
1


In [7]:
blast_qresult[0] # retrieves the top hit
blast_qresult[-1] # retrieves the last hit

Hit(id='gi|349709091|emb|FQ378501.1|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps)

In [8]:
blast_slice = blast_qresult[:3] # slices the first three hits
print(blast_slice)

Program: blastn (2.14.0+)
  Query: gi|8332116|gb|BE037100.1|BE037100 (1111)
         MP14H09 MP Mesembryanthemum crystallinum cDNA 5' similar to cold acc...
 Target: nt
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  gi|1219041180|ref|XM_021875076.1|  PREDICTED: Chenopodi...
            1      1  gi|1226796956|ref|XM_021992092.1|  PREDICTED: Spinacia ...
            2      1  gi|2252585423|ref|XM_010682658.2|  PREDICTED: Beta vulg...


Like Python dictionaries, you can also retrieve hits using the hit’s ID. This is particularly useful if you know a given hit ID exists within a search query results:

In [9]:
blast_qresult["gi|1219041180|ref|XM_021875076.1|"] # "gi|262205317|ref|NR_030195.1|" in original code

Hit(id='gi|1219041180|ref|XM_021875076.1|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps)

In [10]:
blast_qresult.hits

[Hit(id='gi|1219041180|ref|XM_021875076.1|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps),
 Hit(id='gi|1226796956|ref|XM_021992092.1|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps),
 Hit(id='gi|2252585423|ref|XM_010682658.2|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps),
 Hit(id='gi|2031543140|ref|XM_041168865.1|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps),
 Hit(id='gi|2247117892|ref|XM_048479995.1|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps),
 Hit(id='gi|2082357255|ref|XM_043119049.1|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps),
 Hit(id='gi|2082357253|ref|XM_043119041.1|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps),
 Hit(id='gi|1882610310|ref|XM_035691634.1|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps),
 Hit(id='gi|1882610309|ref|XM_018970776.2|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps),
 Hit(id='gi|1350315641|ref|XM_024180293.1|', query_id='gi|8332116|gb|BE037100.1|BE037100', 1 hsps),


In [11]:
blast_qresult.hit_keys

['gi|1219041180|ref|XM_021875076.1|',
 'gi|1226796956|ref|XM_021992092.1|',
 'gi|2252585423|ref|XM_010682658.2|',
 'gi|2031543140|ref|XM_041168865.1|',
 'gi|2247117892|ref|XM_048479995.1|',
 'gi|2082357255|ref|XM_043119049.1|',
 'gi|2082357253|ref|XM_043119041.1|',
 'gi|1882610310|ref|XM_035691634.1|',
 'gi|1882610309|ref|XM_018970776.2|',
 'gi|1350315641|ref|XM_024180293.1|',
 'gi|1350315638|ref|XM_006425719.2|',
 'gi|1350315636|ref|XM_006425716.2|',
 'gi|1350315634|ref|XM_006425717.2|',
 'gi|1204884098|ref|XM_021445554.1|',
 'gi|2395983800|ref|XM_006466626.4|',
 'gi|2395983799|ref|XM_006466625.3|',
 'gi|2395983798|ref|XM_006466623.4|',
 'gi|2395983797|ref|XM_006466624.4|',
 'gi|2395983796|ref|XM_025094967.2|',
 'gi|1227938481|ref|XM_022049453.1|',
 'gi|1063463253|ref|XM_007047033.2|',
 'gi|1063463252|ref|XM_007047032.2|',
 'gi|1269881407|ref|XM_022895605.1|',
 'gi|1269881405|ref|XM_022895604.1|',
 'gi|1269881403|ref|XM_022895603.1|',
 'gi|2082386146|ref|XM_043113302.1|',
 'gi|2082386

What if you just want to check whether a particular hit is present in the query results? You can do a simple Python membership test using the in keyword:

In [12]:
print("gi|1219041180|ref|XM_021875076.1|" in blast_qresult) # "gi|262205317|ref|NR_030195.1|" in blast_qresult
print("gi|262205317|ref|NR_030195.1|" in blast_qresult) # "gi|262205317|ref|NR_030194.1|" in blast_qresult

True
False


In [13]:
blast_qresult.index("gi|1861285698|gb|MN544658.1|") # "gi|301171437|ref|NR_035870.1|" in original code

47

If the native hit ordering doesn’t suit your taste, you can use the sort method of the `QueryResult` object. It is very similar to Python’s `list.sort` method, with the addition of an option to create a new sorted QueryResult object or not.

For this particular sort, we’ll set the `in_place` flag to False so that sorting will return a new `QueryResult` object and leave our initial object unsorted.

In [14]:
for hit in blast_qresult[:5]: # id and sequence length of the first five hits
    print("%s %i" % (hit.id, hit.seq_len))

gi|1219041180|ref|XM_021875076.1| 1173
gi|1226796956|ref|XM_021992092.1| 672
gi|2252585423|ref|XM_010682658.2| 898
gi|2031543140|ref|XM_041168865.1| 1020
gi|2247117892|ref|XM_048479995.1| 1030


In [15]:
sort_key = lambda hit: hit.seq_len
sorted_qresult = blast_qresult.sort(key=sort_key, reverse=True, in_place=False)
for hit in sorted_qresult[:5]:
    print("%s %i" % (hit.id, hit.seq_len))

gi|2396494064|ref|XM_024605027.2| 1178
gi|1219041180|ref|XM_021875076.1| 1173
gi|743838297|ref|XM_011027373.1| 1132
gi|1860377399|ref|XM_035077205.1| 1109
gi|764593175|ref|XM_004300526.2| 1105


Here is an example of using `hit_filter` to filter out `Hit` objects that only have one HSP:

In [16]:
filter_func = lambda hit: len(hit.hsps) > 0 # the callback function
print(len(blast_qresult)) # no. of hits before filtering

filtered_qresult = blast_qresult.hit_filter(filter_func)
print(len(filtered_qresult)) # no. of hits after filtering

for hit in filtered_qresult[:5]: # quick check for the hit lengths
    print("%s %i" % (hit.id, len(hit.hsps)))

50
50
gi|1219041180|ref|XM_021875076.1| 1
gi|1226796956|ref|XM_021992092.1| 1
gi|2252585423|ref|XM_010682658.2| 1
gi|2031543140|ref|XM_041168865.1| 1
gi|2247117892|ref|XM_048479995.1| 1


`hsp_filter` works the same as `hit_filter`, only instead of looking at the Hit objects, it performs filtering on the HSP objects in each hits.

In [17]:
def map_func(hit):
    # renames "gi|301171322|ref|NR_035857.1|" to "NR_035857.1"
    hit.id = hit.id.split("|")[3]
    return hit

mapped_qresult = blast_qresult.hit_map(map_func)
for hit in mapped_qresult[:5]:
    print(hit.id)

XM_021875076.1
XM_021992092.1
XM_010682658.2
XM_041168865.1
XM_048479995.1


### 8.1.2 Hit
`Hit` objects represent all query results from a single database entry. Let’s see what they look like, beginning with our BLAST search:

In [18]:
from Bio import SearchIO
blast_qresult = SearchIO.read("my_blast.xml", "blast-xml")
blast_hit = blast_qresult[3] # fourth hit from the query result
print(blast_hit)

Query: gi|8332116|gb|BE037100.1|BE037100
       MP14H09 MP Mesembryanthemum crystallinum cDNA 5' similar to cold accli...
  Hit: gi|2031543140|ref|XM_041168865.1| (1020)
       PREDICTED: Juglans microcarpa x Juglans regia cold-regulated 413 plasm...
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0  1.8e-105     398.93     593         [64:655]              [253:838]


Now let’s contrast this with the BLAT search.

In [19]:
blat_qresult = SearchIO.read("my_blat.psl", "blat-psl")
blat_hit = blat_qresult[0] # the only hit
print(blat_hit)

Query: mystery_seq
       <unknown description>
  Hit: chr19 (59128983)
       <unknown description>
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0         ?          ?       ?           [0:61]    [54204480:54204541]
          1         ?          ?       ?           [0:61]    [54233104:54264463]
          2         ?          ?       ?           [0:61]    [54254477:54260071]
          3         ?          ?       ?           [1:61]    [54210720:54210780]
          4         ?          ?       ?           [0:60]    [54198476:54198536]
          5         ?          ?       ?           [0:61]    [54265610:54265671]
          6         ?          ?       ?           [0:61]    [54238143:54240175]
          7         ?          ?       ?           [0:60]    [54189735:54189795]
        

Here, we’ve got a similar level of detail as with the BLAST hit we saw earlier. There are some differences worth explaining, though:
* The e-value and bit score column values. As BLAT HSPs do not have e-values and bit scores, the display defaults to ‘?’.
* What about the span column? The span values is meant to display the complete alignment length, which consists of all residues and any gaps that may be present. The PSL format do not have this information readily available and `Bio.SearchIO` does not attempt to try guess what it is, so we get a ‘?’ similar to the e-value and bit score columns.

In terms of Python objects, `Hit` behaves almost the same as Python lists, but contain HSP objects exclusively. If you’re familiar with lists, you should encounter no difficulties working with the `Hit` object.

Just like Python lists, `Hit` objects are iterable, and each iteration returns one HSP object it contains:

In [20]:
for hsp in blast_hit:
    hsp

In [21]:
len(blast_hit)

1

In [22]:
len(blat_hit)

17

In [23]:
blat_hit[0] # retrieve single items
sliced_hit = blat_hit[4:9] # retrieve multiple items
len(sliced_hit)
print(sliced_hit)

Query: mystery_seq
       <unknown description>
  Hit: chr19 (59128983)
       <unknown description>
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0         ?          ?       ?           [0:60]    [54198476:54198536]
          1         ?          ?       ?           [0:61]    [54265610:54265671]
          2         ?          ?       ?           [0:61]    [54238143:54240175]
          3         ?          ?       ?           [0:60]    [54189735:54189795]
          4         ?          ?       ?           [0:61]    [54185425:54185486]


### 8.1.3 HSP
`HSP` (high-scoring pair) represents region(s) in the hit sequence that contains significant alignment(s) to the query sequence. It contains the actual match between your query sequence and a database entry. As this match is determined by the sequence search tool’s algorithms, the HSP object contains the bulk of the statistics computed by the search tool. This also makes the distinction between HSP objects from different search tools more apparent compared to the differences you’ve seen in `QueryResult` or `Hit` objects.

Let’s see some examples from our BLAST and BLAT searches. We’ll look at the BLAST HSP first:

In [24]:
from Bio import SearchIO
blast_qresult = SearchIO.read("my_blast.xml", "blast-xml")
blast_hsp = blast_qresult[0][0] # first hit, first hsp
print(blast_hsp)

      Query: gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum cr...
        Hit: gi|1219041180|ref|XM_021875076.1| PREDICTED: Chenopodium quinoa ...
Query range: [58:678] (1)
  Hit range: [277:901] (1)
Quick stats: evalue 7.1e-117; bitscore 435.90
  Fragments: 1 (624 columns)
     Query - ACAGAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGG~~~ATGTA
             || ||||||||| |||| | |||| ||  |||| |||| | |||| ||| | |||| ||~~~|| ||
       Hit - ACCGAAAATGGGCAGAGGAGTGAATTATATGGCAATGACACCTGAGCAACTAGCCGCGG~~~ATTTA


Just like `QueryResult` and `Hit`, invoking print on an HSP shows its general details:
* There are the query and hit IDs and descriptions. We need these to identify our HSP.
* We’ve also got the matching range of the query and hit sequences. The slice notation we’re using here is an indication that the range is displayed using Python’s indexing style (zero-based, half open). The number inside the parenthesis denotes the strand. In this case, both sequences have the plus strand.
* Some quick statistics are available: the e-value and bitscore.
* There is information about the HSP fragments. Ignore this for now; it will be explained later on.
* And finally, we have the query and hit sequence alignment itself.

These details can be accessed on their own using the dot notation, just like in `QueryResult` and `Hit`:

In [25]:
print(blast_hsp.query_range)
print(blast_hsp.evalue)

(58, 678)
7.07413e-117


Check out the HSP [documentation](http://biopython.org/docs/1.80/api/Bio.SearchIO.html#module-Bio.SearchIO) for a full list of these predefined properties.

In [26]:
print(blast_hsp.hit_start) # start coordinate of the hit sequence
print(blast_hsp.query_span) # how many residues in the query sequence
print(blast_hsp.aln_span) # how long the alignment is

277
620
624


In [27]:
print(blast_hsp.gap_num) # number of gaps
print(blast_hsp.ident_num) # number of identical residues

4
473


To see which details are available for a given sequence search tool, you should check the format’s documentation in `Bio.SearchIO`.
Alternatively, you may also use `.__dict__.keys()` for a quick list of what’s available:
```
>>> blast_hsp.__dict__.keys()
['bitscore', 'evalue', 'ident_num', 'gap_num', 'bitscore_raw', 'pos_num', '_items']
```

Finally, you may have noticed that the `query` and `hit` attributes of our HSP are not just regular strings:

In [28]:
blast_hsp.query

SeqRecord(seq=Seq('ACAGAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGC...GTA'), id='gi|8332116|gb|BE037100.1|BE037100', name='aligned query sequence', description="MP14H09 MP Mesembryanthemum crystallinum cDNA 5' similar to cold acclimation protein, mRNA sequence", dbxrefs=[])

In [29]:
blast_hsp.hit

SeqRecord(seq=Seq('ACCGAAAATGGGCAGAGGAGTGAATTATATGGCAATGACACCTGAGCAACTAGC...TTA'), id='gi|1219041180|ref|XM_021875076.1|', name='aligned hit sequence', description='PREDICTED: Chenopodium quinoa cold-regulated 413 plasma membrane protein 2-like (LOC110697660), mRNA', dbxrefs=[])

In [30]:
print(blast_hsp.aln)

Alignment with 2 rows and 624 columns
ACAGAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTG...GTA gi|8332116|gb|BE037100.1|BE037100
ACCGAAAATGGGCAGAGGAGTGAATTATATGGCAATGACACCTG...TTA gi|1219041180|ref|XM_021875076.1|


Let’s now take a look at HSPs from our BLAT results for a different kind of HSP. As usual, we’ll begin by invoking print on it:

In [31]:
blat_qresult = SearchIO.read("my_blat.psl", "blat-psl")
blat_hsp = blat_qresult[0][0] # first hit, first hsp
print(blat_hsp)

      Query: mystery_seq <unknown description>
        Hit: chr19 <unknown description>
Query range: [0:61] (1)
  Hit range: [54204480:54204541] (1)
Quick stats: evalue ?; bitscore ?
  Fragments: 1 (? columns)


Some of the outputs you may have already guessed. We have the query and hit IDs and descriptions and the sequence coordinates. Values for evalue and bitscore is ‘?’ as BLAT HSPs do not have these attributes. But The biggest difference here is that you don’t see any sequence alignments displayed. If you look closer, PSL formats themselves do not have any hit or query sequences, so Bio.SearchIO won’t create any sequence or alignment objects. What happens if you try to access `HSP.query`, `HSP.hit`, or `HSP.aln`? You’ll get the default values for these attributes, which is `None`:

In [32]:
print(blat_hsp.hit is None)
print(blat_hsp.query is None)
print(blat_hsp.aln is None)

True
True
True


This does not affect other attributes, though. For example, you can still access the length of the query or hit alignment. Despite not displaying any attributes, the PSL format still have this information so `Bio.SearchIO` can extract them:

In [33]:
print(blat_hsp.query_span) # length of query match
print(blat_hsp.hit_span) # length of hit match

61
61


Other format-specific attributes are still present as well:

In [34]:
print(blat_hsp.score) # PSL score
print(blat_hsp.mismatch_num) # the mismatch column

61
0


Let’s take a look at a BLAT HSP that contains multiple blocks to see how `Bio.SearchIO` deals with this:

In [35]:
blat_hsp2 = blat_qresult[0][1] # first hit, second hsp
print(blat_hsp2)

      Query: mystery_seq <unknown description>
        Hit: chr19 <unknown description>
Query range: [0:61] (1)
  Hit range: [54233104:54264463] (1)
Quick stats: evalue ?; bitscore ?
  Fragments: ---  --------------  ----------------------  ----------------------
               #            Span             Query range               Hit range
             ---  --------------  ----------------------  ----------------------
               0               ?                  [0:18]     [54233104:54233122]
               1               ?                 [18:61]     [54264420:54264463]


Take a look at the hit coordinate of the HSP above. In the `Hit range`: field, we see that the coordinate is `[54233104:54264463]`. But looking at the table rows, we see that not the entire region spanned by this coordinate matches our query. Specifically, the intervening region spans from `54233122` to `54264420`. Why then, is the query coordinates seem to be contiguous, you ask? This is perfectly fine. In this case it means that the query match is contiguous (no intervening regions), while the hit match is not.

All these attributes are accessible from the HSP directly, by the way:

In [36]:
blat_hsp2.hit_range # hit start and end coordinates of the entire HSP

(54233104, 54264463)

In [37]:
blat_hsp2.hit_range_all # hit start and end coordinates of each fragment

[(54233104, 54233122), (54264420, 54264463)]

In [38]:
blat_hsp2.hit_span # hit span of the entire HSP

31359

In [39]:
blat_hsp2.hit_span_all # hit span of each fragment

[18, 43]

In [40]:
blat_hsp2.hit_inter_ranges # start and end coordinates of intervening regions in the hit sequence

[(54233122, 54264420)]

In [41]:
blat_hsp2.hit_inter_spans # span of intervening regions in the hit sequence

[31298]

Most of these attributes are not readily available from the PSL file we have, but `Bio.SearchIO` calculates them for you on the fly when you parse the PSL file. All it needs are the start and end coordinates of each fragment.

`query_all`, `hit_all`, and `aln_all`. These properties will return a
list containing `SeqRecord` or `MultipleSeqAlignment` objects from each of the HSP fragment. There are other attributes that behave similarly, i.e. they only work for HSPs with one fragment. Check out the HSP documentation for a full list.

Finally, to check whether you have multiple fragments or not, you can use the `is_fragmented` property like so:

In [42]:
blat_hsp2.is_fragmented # BLAT HSP with 2 fragments

True

In [43]:
blat_hsp.is_fragmented # BLAT HSP from earlier, with one fragment

False

### 8.1.4 HSP Fragment
`HSPFragment` represents a single, contiguous match between the query and hit sequences. You could consider it the core of the object model and search result, since it is the presence of these fragments that determine whether your search have results or not.

In most cases, you don’t have to deal with `HSPFragment` objects directly since not that many sequence search tools fragment their HSPs. When you do have to deal with them, what you should remember is that `HSPFragment` objects were written with to be as compact as possible. In most cases, they only contain attributes directly related to sequences: strands, reading frames, molecule types, coordinates, the sequences themselves, and their IDs and descriptions.

These attributes are readily shown when you invoke `print` on an `HSPFragment`. Here’s an example, taken from our BLAST search:

In [44]:
from Bio import SearchIO
blast_qresult = SearchIO.read("my_blast.xml", "blast-xml")
blast_frag = blast_qresult[0][0][0] # first hit, first hsp, first fragment
print(blast_frag)

      Query: gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum cr...
        Hit: gi|1219041180|ref|XM_021875076.1| PREDICTED: Chenopodium quinoa ...
Query range: [58:678] (1)
  Hit range: [277:901] (1)
  Fragments: 1 (624 columns)
     Query - ACAGAAAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGG~~~ATGTA
             || ||||||||| |||| | |||| ||  |||| |||| | |||| ||| | |||| ||~~~|| ||
       Hit - ACCGAAAATGGGCAGAGGAGTGAATTATATGGCAATGACACCTGAGCAACTAGCCGCGG~~~ATTTA


At this level, the BLAT fragment looks quite similar to the BLAST fragment, save for the query and hit sequences which are not present:

In [45]:
blat_qresult = SearchIO.read("my_blat.psl", "blat-psl")
blat_frag = blat_qresult[0][0][0] # first hit, first hsp, first fragment
print(blat_frag)

      Query: mystery_seq <unknown description>
        Hit: chr19 <unknown description>
Query range: [0:61] (1)
  Hit range: [54204480:54204541] (1)
  Fragments: 1 (? columns)


In [46]:
blast_frag.query_start # query start coordinate

58

In [47]:
blast_frag.hit_strand # hit sequence strand

1

In [48]:
blast_frag.hit # hit sequence, as a SeqRecord object

SeqRecord(seq=Seq('ACCGAAAATGGGCAGAGGAGTGAATTATATGGCAATGACACCTGAGCAACTAGC...TTA'), id='gi|1219041180|ref|XM_021875076.1|', name='aligned hit sequence', description='PREDICTED: Chenopodium quinoa cold-regulated 413 plasma membrane protein 2-like (LOC110697660), mRNA', dbxrefs=[])

In all cases, these attributes are accessible using our favorite dot notation. Some examples:

## 8.2 A note about standards and conventions
One of the goals of `Bio.SearchIO` is to create a common, easy to use interface to deal with various search output files. This means creating standards that extend beyond the object model you just saw.

There are three implicit standards that you can expect when working with `Bio.SearchIO`:

* The first one pertains to sequence coordinates. In `Bio.SearchIO`, all sequence coordinates follows Python’s coordinate style: zero-based and half open. For example, if in a BLAST XML output file the start and end coordinates of an HSP are 10 and 28, they would become 9 and 28 in `Bio.SearchIO`. The start coordinate becomes 9 because Python indices start from zero, while the end coordinate remains 28 as Python slices omit the last item in an interval.
* The second is on sequence coordinate orders. In `Bio.SearchIO`, start coordinates are always less than or equal to end coordinates. This isn’t always the case with all sequence search tools, as some of them have larger start coordinates when the sequence strand is minus.
* The last one is on strand and reading frame values. For strands, there are only four valid choices: `1` (plus strand), `-1` (minus strand), `0` (protein sequences), and `None` (no strand). For reading frames, the valid choices are integers from `-3` to `3` and `None`.

## 8.3 Reading search output files
* To download test file : [`tab_2226_tblastn_*.txt`](https://github.com/biopython/biopython/tree/master/Tests/Blast)

In [49]:
from Bio import SearchIO
qresult = SearchIO.read("tab_2226_tblastn_003.txt", "blast-tab")
qresult

QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)

In [50]:
qresult2 = SearchIO.read("tab_2226_tblastn_007.txt", "blast-tab", comments=True)
qresult2

QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)

As for the `Bio.SearchIO.parse`, it is used for reading search output files with any number of queries. The function returns a generator object that yields a `QueryResult` object in each iteration. Like `Bio.SearchIO.read`, it also accepts format-specific keyword arguments:

In [51]:
from Bio import SearchIO
qresults = SearchIO.parse("tab_2226_tblastn_001.txt", "blast-tab")
for qresult in qresults:
    print(qresult.id)

gi|16080617|ref|NP_391444.1|
gi|11464971:4-101


In [52]:
qresults2 = SearchIO.parse("tab_2226_tblastn_005.txt", "blast-tab", comments=True)
for qresult in qresults2:
    print(qresult.id)

random_s00
gi|16080617|ref|NP_391444.1|
gi|11464971:4-101


## 8.4 Dealing with large search output files with indexing
You can use `index` with just the filename and format name:

In [53]:
from Bio import SearchIO
idx = SearchIO.index("tab_2226_tblastn_001.txt", "blast-tab")
sorted(idx.keys())

['gi|11464971:4-101', 'gi|16080617|ref|NP_391444.1|']

In [54]:
idx["gi|16080617|ref|NP_391444.1|"]

QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)

In [55]:
idx.close()

Or also with the format-specific keyword argument:

In [56]:
idx = SearchIO.index("tab_2226_tblastn_005.txt", "blast-tab", comments=True)
sorted(idx.keys())

['gi|11464971:4-101', 'gi|16080617|ref|NP_391444.1|', 'random_s00']

In [57]:
idx["gi|16080617|ref|NP_391444.1|"]

QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)

In [58]:
idx.close()

Or with the key_function argument, as in `Bio.SeqIO`:

In [59]:
key_function = lambda id: id.upper() # capitalizes the keys
idx = SearchIO.index("tab_2226_tblastn_001.txt", "blast-tab", key_function=key_function)
sorted(idx.keys())

['GI|11464971:4-101', 'GI|16080617|REF|NP_391444.1|']

In [60]:
idx["GI|16080617|REF|NP_391444.1|"]

QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)

In [61]:
idx.close()

## 8.5 Writing and converting search output files
It is occasionally useful to be able to manipulate search results from an output file and write it again to a new file. It returns a four-item tuple, which denotes the number or `QueryResult`, `Hit`, `HSP`, and `HSPFragment` objects that were written.
* To download test file : [`mirna.xml`](https://github.com/biopython/biopython/blob/master/Tests/Blast/mirna.xml)

In [62]:
from Bio import SearchIO
qresults = SearchIO.parse("mirna.xml", "blast-xml") # read XML file
SearchIO.write(qresults, "results.tab", "blast-tab") # write to tabular file

(3, 239, 277, 277)

`Bio.SearchIO` also provides a convert function, which is simply a shortcut for `Bio.SearchIO.parse` and `Bio.SearchIO.write`. Using the convert function, our example above would be:

In [63]:
from Bio import SearchIO
SearchIO.convert("mirna.xml", "blast-xml", "results.tab", "blast-tab")

(3, 239, 277, 277)