# Biopython 3: SearchIO and Entrez

In this workshop, we will take a quick look at SearchIO, Biopython's tool for unifying results from various sequence search tools, before exploring the Entrez API utility from NCBI.

## SearchIO

Biopython provides the [SearchIO package](https://biopython.org/docs/latest/api/Bio.SearchIO.html) for dealing with outputs from various sequence searching utilities, allowing them to be compared directly. Similar to SeqIO and AlignIO, it has capabilities for parsing, reading, writing, and indexing different search results, as well as converting between file types. 

### Reading and Parsing

Parsing and reading also work the same way as previously, where parse returns an iterator, and read simply returns the first item in the file. Note here that each item is a query, not an individual sequence. 

Let's start with an example of reading and XML result file from a BLAST search.

In [2]:
# Parse a BLAST XML file

from Bio import SearchIO

handle = "searchio-data/blast.xml"

my_result = SearchIO.read(handle = handle, format = "blast-xml")
print(dir(my_result))

print("Search {} has {} hits".format(my_result.id, len(my_result)))

print(dir(my_result.hsps[0]))

for hsp in my_result.hsps:
    print(hsp.hit_id, hsp.evalue) 

['_NON_STICKY_ATTRS', '_QueryResult__alt_hit_ids', '_QueryResult__marker', '__annotations__', '__bool__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_description', '_hit_key_function', '_id', '_items', '_transfer_attrs', 'absorb', 'append', 'blast_id', 'description', 'fragments', 'hit_filter', 'hit_keys', 'hit_map', 'hits', 'hsp_filter', 'hsp_map', 'hsps', 'id', 'index', 'items', 'iterhit_keys', 'iterhits', 'iteritems', 'param_evalue_threshold', 'param_filter', 'param_gap_extend', 'param_gap_open', 'param_score_match', 'param_score_mismatch', 'pop', 'program', 'reference', 'seq_len', 'sort',

### Writing and Format Conversion

You can also write results files in various file formats. To do this, use the `Bio.SearchIO.write()` function. In the example below, we parse a file in the blast-xml format, and then write its contents to a new file in blast-tab tabular format. 

```
from Bio import SearchIO
qresults = SearchIO.parse(handle = 'sample.xml', format = 'blast-xml')
SearchIO.write(qresults, handle = 'sample.tab', format = 'blast-tab')
```

You can also use the `convert()` function to directly convert between file formats without reading/parsing and then writing. The supported formats are below, although not all of the pairs will work for conversion (see [documentation](https://biopython.org/docs/latest/api/Bio.SearchIO.html#Bio.SearchIO.convert) for details).

'blast-tab', 'blast-xml', 'blat-psl', 'hmmer3-tab', 'hmmscan3-domtab', 'hmmsearch3-domtab', 'phmmer3-domtab"

The function will return a tuple of four values: the number of QueryResult, Hit, HSP, and HSPFragment objects it writes to the output file.

In [None]:
in_file = "searchio-data/blast.xml"
out_file = "searchio-data/blast_tab.tab"

in_format = "blast-xml"
out_format = "blast-tab"

SearchIO.convert(in_file=in_file, in_format=in_format, out_file=out_file, out_format=out_format)

## Entrez

The Entrez system is a collection of [NCBI databases](https://www.ncbi.nlm.nih.gov/guide/all/) ranging from literature to sequences, along with a text search tool for exploring them. You can [search online](https://www.ncbi.nlm.nih.gov/search/) via browser, or through the NCBI's E-utilities API. 

### Entrez Rules and Etiquette

### UIDs

Every entry in a NCBI database has a UID (unique identifier). This UID will vary depending on the database. For example, PubMed uses PMID while protein records use GI numbers. 

| Entrez Database    | UID common name | E-utility Database Name |
|--------------------|-----------------|-------------------------|
| BioProject         | BioProject ID  | bioproject              |
| BioSample          | BioSample ID   | biosample               |
| Books              | Book ID        | books                   |
| Conserved Domains  | PSSM-ID        | cdd                     |
| dbGaP              | dbGaP ID       | gap                     |
| dbVar              | dbVar ID       | dbvar                   |
| Gene               | Gene ID        | gene                    |
| Genome             | Genome ID      | genome                  |
| GEO Datasets       | GDS ID         | gds                     |
| GEO Profiles       | GEO ID         | geoprofiles             |
| HomoloGene         | HomoloGene ID  | homologene              |
| MeSH               | MeSH ID        | mesh                    |
| NCBI C++ Toolkit   | Toolkit ID     | toolkit                 |
| NLM Catalog        | NLM Catalog ID | nlmcatalog              |
| Nucleotide         | GI number      | nuccore                 |
| PopSet             | PopSet ID      | popset                  |
| Probe              | Probe ID       | probe                   |
| Protein            | GI number      | protein                 |
| Protein Clusters   | Protein Cluster ID | proteinclusters      |
| PubChem BioAssay   | AID            | pcassay                 |
| PubChem Compound   | CID            | pccompound              |
| PubChem Substance  | SID            | pcsubstance             |
| PubMed             | PMID           | pubmed                  |
| PubMed Central     | PMCID          | pmc                     |
| SNP                | rs number      | snp                     |
| SRA                | SRA ID         | sra                     |
| Structure          | MMDB-ID        | structure               |
| Taxonomy           | TaxID          | taxonomy                |

**Accession.Version vs GI Number**

Sequences will have two parallel identifiers given by the NCBI: the GI number and the Accession version. For a full disambiguation, see this [link](https://www.ncbi.nlm.nih.gov/genbank/sequenceids/). 

### E-Utilities on the Unix Command Line

While Biopython gives us nice tools for using the E-utilities withi Python code, they are also available as [command line tools](https://www.ncbi.nlm.nih.gov/books/NBK179288/)

### Core Concepts of the E-utilities

**What are the E-utilities?**

There are nine total E-utilities which perform different tasks with respect to the NCBI databases (from [A General Introduction to the E-utilities](https://www.ncbi.nlm.nih.gov/books/NBK25497/)).

1. EInfo (database statistics) eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi

    Provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrez databases.

2. ESearch (text searches) eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi

    Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query.

3. EPost (UID uploads) eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi

    Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset.

4. ESummary (document summary downloads) eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi

    Responds to a list of UIDs from a given database with the corresponding document summaries.

5. EFetch (data record downloads) eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi

    Responds to a list of UIDs in a given database with the corresponding data records in a specified format.

6. ELink (Entrez links) eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi

    Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database; checks for the existence of a specified link from a list of one or more UIDs; creates a hyperlink to the primary LinkOut provider for a specific UID and database, or lists LinkOut URLs and attributes for multiple UIDs.

7. EGQuery (global query) eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi

    Responds to a text query with the number of records matching the query in each Entrez database.

8. ESpell (spelling suggestions) eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi

    Retrieves spelling suggestions for a text query in a given database.

9. ECitMatch (batch citation searching in PubMed) eutils.ncbi.nlm.nih.gov/entrez/eutils/ecitmatch.cgi

    Retrieves PubMed IDs (PMIDs) corresponding to a set of input citation strings.



In [9]:
from Bio import Entrez

Entrez.email = "cwarner@rockefeller.edu"

handle = Entrez.einfo()
record = Entrez.read(handle)
print(record)



{'DbList': ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']}


## Useful Links and Documentation

* [SearchIO](https://biopython.org/docs/latest/api/Bio.SearchIO.html)
* [Entrez Help](https://www.ncbi.nlm.nih.gov/books/NBK3837/)

## Credits and Inspiration

* [Biopython Tutorial and Cookbook](https://biopython.org/DIST/docs/tutorial/Tutorial.html)
