# BMI565: Bioinformatics Programming & Scripting

#### (C) Michael Mooney (mooneymi@ohsu.edu)

## Week 5: BioPython - Entrez E-utilities

1. NCBI - Entrez Databases
2. E-Utils
    - `Entrez.esearch()`
    - `Entrez.esummary()`
    - `Entrez.efetch()`
    - `Entrez.epost()`
    - `Entrez.einfo()`

#### Requirements

- Python 2.7 or 3.x
- `Bio` (BioPython) module (`conda install biopython`)
- Miscellaneous Files
    - `./images/ncbi_ids.jpg`

In [1]:
from __future__ import print_function, division

## NCBI - Entrez Databases

- Global Query Cross‐Database Search System
    - Allows metasearch of NCBI health science repository
    - National Center for Biotechnology Information (NBCI) started GenBank in 1992
    - [http://www.ncbi.nlm.nih.gov/gquery/](http://www.ncbi.nlm.nih.gov/gquery/)
- E-utilities
    - Supported by NCBI to provide a stable interface to Entrez query and database system
    - Queries are submitted via web URLs and XML formatted data is returned
    - The `Entrez` module from BioPython provides a programming interface to E-utils
        - Make no more than 3 queries per second (enforced by BioPython)
        - Queries should be accompanied by your email address
        - For large/regular queries consider downloading and accessing a local copy of the database

## E-Utils

[http://biopython.org/DIST/docs/tutorial/Tutorial.html](http://biopython.org/DIST/docs/tutorial/Tutorial.html)

NCBI limits the frequency of requests sent to its server:<br />
[https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/](https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/)

### `Entrez.esearch()`

[http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch](http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch)

The `Entrez.esearch()` function allows you to search specific NCBI databases for entries that match a specified search term. The function will return a list of unique identifiers (UIDs). The type of UID will depend on the database searched. By default, only the first 20 records are returned (use the `retmax` parameter to change this).

<img src="./images/ncbi_ids.jpg" align="left"/>

In [2]:
from Bio import Entrez

In [3]:
help(Entrez.esearch)

Help on function esearch in module Bio.Entrez:

esearch(db, term, **keywds)
    Run an Entrez search and return a handle to the results.
    
    ESearch searches and retrieves primary IDs (for use in EFetch, ELink
    and ESummary) and term translations, and optionally retains results
    for future use in the user's environment.
    
    See the online documentation for an explanation of the parameters:
    http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
    
    Return a handle to the results which are always in XML format.
    
    Raises an IOError exception if there's a network error.
    
    Short example:
    
    >>> from Bio import Entrez
    >>> Entrez.email = "Your.Name.Here@example.org"
    >>> handle = Entrez.esearch(db="nucleotide", retmax=10, term="opuntia[ORGN] accD", idtype="acc")
    >>> record = Entrez.read(handle)
    >>> handle.close()
    >>> int(record["Count"]) >= 2
    True
    >>> "EF590893.1" in record["IdList"]
    True
    >>> "EF590892.1" in

In [4]:
## Provide your email address
email = "mooneymi@ohsu.edu"
Entrez.email = email

## Submit a query
handle = Entrez.esearch(db="nuccore", term="sonic")

## Entrez.read() parses XML results
## A dictionary is returned
record = Entrez.read(handle)
record.keys()

dict_keys(['Count', 'RetMax', 'RetStart', 'IdList', 'TranslationSet', 'TranslationStack', 'QueryTranslation'])

In [5]:
ids = record["IdList"]
ids

['2296767475', '2296663653', '2296125215', '2295684180', '2295677668', '1447199263', '1063759739', '1036551422', '1036551081', '922304385', '922304383', '262205520', '171543834', '118130385', '59858556', '1077156777', '922304372', '651914274', '167736374', '167736372']

In [6]:
record["Count"]

'5839'

In [7]:
record['RetMax']

'20'

In [8]:
handle.close()

### `Entrez.esummary()`

The `Entrez.esummary()` function provides a document summary for a specified UID. The provided summary is useful for initial filtering of the UID list returned by `Entrez.esearch()`.

#### UIDs Matter!

When searching multiple databases, make sure to use the appropriate UID for the given database. 

For example, <b>Gene ID != GI number</b> (although both are integers).

In [9]:
handle = Entrez.esummary(db="nuccore", id=ids[0])
summary = Entrez.read(handle)
summary

[{'Item': [], 'Id': '2296767475', 'Caption': 'XM_050606295', 'Title': 'PREDICTED: Cataglyphis hispanica sonic hedgehog protein (LOC126857136), mRNA', 'Extra': 'gi|2296767475|ref|XM_050606295.1|[2296767475]', 'Gi': IntegerElement(2296767475, attributes={}), 'CreateDate': '2022/09/09', 'UpdateDate': '2022/09/09', 'Flags': IntegerElement(512, attributes={}), 'TaxId': IntegerElement(1086592, attributes={}), 'Length': IntegerElement(3804, attributes={}), 'Status': 'live', 'ReplacedBy': '', 'Comment': '  ', 'AccessionVersion': 'XM_050606295.1'}]

In [10]:
for k,v in summary[0].items():
    print(k+":", v)

Item: []
Id: 2296767475
Caption: XM_050606295
Title: PREDICTED: Cataglyphis hispanica sonic hedgehog protein (LOC126857136), mRNA
Extra: gi|2296767475|ref|XM_050606295.1|[2296767475]
Gi: IntegerElement(2296767475, attributes={})
CreateDate: 2022/09/09
UpdateDate: 2022/09/09
Flags: IntegerElement(512, attributes={})
TaxId: IntegerElement(1086592, attributes={})
Length: IntegerElement(3804, attributes={})
Status: live
ReplacedBy: 
Comment:   
AccessionVersion: XM_050606295.1


In [11]:
handle.close()

### `Entrez.efetch`

The `Entrez.efetch()` function retrieves entire records in a specified format. In addition to the `database` and `id` parameters, you can specify the retrieval type `rettype` and retrieval mode `retmode` parameters. 

[http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_](http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_)

In [12]:
## Use Entrez.efetch() to get a fasta record
handle = Entrez.efetch(db="nuccore", id=ids[0], rettype="fasta", retmode="txt")

## Here we use the handle's read() method, not Entrez.read(),
## since the retmode parameter is text, not XML
fasta_record = handle.read()
print(fasta_record)

>XM_050606295.1 PREDICTED: Cataglyphis hispanica sonic hedgehog protein (LOC126857136), mRNA
TAGGATAATCTTAAAAGAAGAAGTGCGAGAGGAGAAAAGAGAACGATGTACTTCGATAAAGAGAGAAGAA
CGAAGTAACAAGGTGAGACGGTATATGACGAAGAAGAAGTGCAATCCGATCGTGCTGATAGTGGGGCAGA
CGAGAAACATCGAAAGAAGAAAAATACAAGAGACGGACAAGAACGAAGAAATGGTTCGCGAGAAAGCGGA
GAGACAGAAGGATCGAGAAGACGAGTAGGGGAGAGAACGAGAGACCGATCGAAGGAGAAAGCTAGAAGAT
CATTCTCCGAGTCTCTTTCTTCTTCTTCTCCCTTTCTCTCGCACTCCCAATCTCTTCTTTCTCTCTCTCC
CTCTCTCTTTCTCCTTATCCCTCCTGTCACGTGCACCTGTCTTCGTTTACCTGGTGAACGCCCCGCCCCA
GCCAATAGAGGTACGGCGGGCCTGGTGACGTCAGTCACGCGTACCAGGAGGCCCTAGGTAGAGGGGACGA
GATCATCCACCACCACCTTTGGAGGACACGAGAGCCCCACGGGGGCCAGCGGCTAACAGCAGCCAGCGGC
AGCGGTCTGGCCGGCCGTTTGCCACGGGGTGGGGGCAGGAAGAGGGGTACTCTGGACCTGGCCAGGTCGC
AGAAACAACGCAGGAGTAACCCACCAAAACCCATCCTGTATTTCCCACCACCGTGGCTCGTCCGGAAAAC
GTGCGCCAGCTTCCGGACAGCGGTTTCGTGCGAGTTCAGTCAAGTGAGTCGAGTCGCACATACATCCTCT
CGCGACGAAAACCAATCGCAGGGGGTTTTCGAGTGAGGAAAGACGGGAACGCGTTGGCACGTGCCGACGA
GGCAATTTAAAGGGCTCTCACGGGTGTCGAATCGTTCGACAGTGTCATACAATCG

In [13]:
handle.close()

In [14]:
## Use Entrez.efetch() to get a fasta record
handle = Entrez.efetch(db="nuccore", id=ids[0], rettype="fasta", retmode="xml")

## Use Entrez.read() to parse XML output
fasta_record = Entrez.read(handle)
fasta_record[0].keys()

dict_keys(['TSeq_seqtype', 'TSeq_accver', 'TSeq_taxid', 'TSeq_orgname', 'TSeq_defline', 'TSeq_length', 'TSeq_sequence'])

In [15]:
fasta_record[0]['TSeq_taxid']

'1086592'

In [16]:
fasta_record[0]['TSeq_sequence']

'TAGGATAATCTTAAAAGAAGAAGTGCGAGAGGAGAAAAGAGAACGATGTACTTCGATAAAGAGAGAAGAACGAAGTAACAAGGTGAGACGGTATATGACGAAGAAGAAGTGCAATCCGATCGTGCTGATAGTGGGGCAGACGAGAAACATCGAAAGAAGAAAAATACAAGAGACGGACAAGAACGAAGAAATGGTTCGCGAGAAAGCGGAGAGACAGAAGGATCGAGAAGACGAGTAGGGGAGAGAACGAGAGACCGATCGAAGGAGAAAGCTAGAAGATCATTCTCCGAGTCTCTTTCTTCTTCTTCTCCCTTTCTCTCGCACTCCCAATCTCTTCTTTCTCTCTCTCCCTCTCTCTTTCTCCTTATCCCTCCTGTCACGTGCACCTGTCTTCGTTTACCTGGTGAACGCCCCGCCCCAGCCAATAGAGGTACGGCGGGCCTGGTGACGTCAGTCACGCGTACCAGGAGGCCCTAGGTAGAGGGGACGAGATCATCCACCACCACCTTTGGAGGACACGAGAGCCCCACGGGGGCCAGCGGCTAACAGCAGCCAGCGGCAGCGGTCTGGCCGGCCGTTTGCCACGGGGTGGGGGCAGGAAGAGGGGTACTCTGGACCTGGCCAGGTCGCAGAAACAACGCAGGAGTAACCCACCAAAACCCATCCTGTATTTCCCACCACCGTGGCTCGTCCGGAAAACGTGCGCCAGCTTCCGGACAGCGGTTTCGTGCGAGTTCAGTCAAGTGAGTCGAGTCGCACATACATCCTCTCGCGACGAAAACCAATCGCAGGGGGTTTTCGAGTGAGGAAAGACGGGAACGCGTTGGCACGTGCCGACGAGGCAATTTAAAGGGCTCTCACGGGTGTCGAATCGTTCGACAGTGTCATACAATCGGGTGCGAATCCGACGTAAGATATAATTCTCGGGATGTAGAATGTAGATTGTAGAAAGATCCTCCTCGATTCGGTACCAACACCGATTCGATGACACCCGGTATA

In [17]:
fasta_record[0]['TSeq_accver']+' '+fasta_record[0]['TSeq_defline']

'XM_050606295.1 PREDICTED: Cataglyphis hispanica sonic hedgehog protein (LOC126857136), mRNA'

In [18]:
handle.close()

#### Downloading Records in Bulk

Multiple IDs can be supplied to `Entrez.efetch()` as a comma separated list. 

In [19]:
print(','.join(ids[0:3]))

2296767475,2296663653,2296125215


In [20]:
## Use Entrez.efetch() to get a fasta record
handle = Entrez.efetch(db="nuccore", id=','.join(ids[0:3]), rettype="fasta", retmode="text")

## Here we use the handle's read() method, not Entrez.read(),
## since the retmode parameter is text, not XML
fasta_records = handle.read()
print(fasta_records)

>XM_050606295.1 PREDICTED: Cataglyphis hispanica sonic hedgehog protein (LOC126857136), mRNA
TAGGATAATCTTAAAAGAAGAAGTGCGAGAGGAGAAAAGAGAACGATGTACTTCGATAAAGAGAGAAGAA
CGAAGTAACAAGGTGAGACGGTATATGACGAAGAAGAAGTGCAATCCGATCGTGCTGATAGTGGGGCAGA
CGAGAAACATCGAAAGAAGAAAAATACAAGAGACGGACAAGAACGAAGAAATGGTTCGCGAGAAAGCGGA
GAGACAGAAGGATCGAGAAGACGAGTAGGGGAGAGAACGAGAGACCGATCGAAGGAGAAAGCTAGAAGAT
CATTCTCCGAGTCTCTTTCTTCTTCTTCTCCCTTTCTCTCGCACTCCCAATCTCTTCTTTCTCTCTCTCC
CTCTCTCTTTCTCCTTATCCCTCCTGTCACGTGCACCTGTCTTCGTTTACCTGGTGAACGCCCCGCCCCA
GCCAATAGAGGTACGGCGGGCCTGGTGACGTCAGTCACGCGTACCAGGAGGCCCTAGGTAGAGGGGACGA
GATCATCCACCACCACCTTTGGAGGACACGAGAGCCCCACGGGGGCCAGCGGCTAACAGCAGCCAGCGGC
AGCGGTCTGGCCGGCCGTTTGCCACGGGGTGGGGGCAGGAAGAGGGGTACTCTGGACCTGGCCAGGTCGC
AGAAACAACGCAGGAGTAACCCACCAAAACCCATCCTGTATTTCCCACCACCGTGGCTCGTCCGGAAAAC
GTGCGCCAGCTTCCGGACAGCGGTTTCGTGCGAGTTCAGTCAAGTGAGTCGAGTCGCACATACATCCTCT
CGCGACGAAAACCAATCGCAGGGGGTTTTCGAGTGAGGAAAGACGGGAACGCGTTGGCACGTGCCGACGA
GGCAATTTAAAGGGCTCTCACGGGTGTCGAATCGTTCGACAGTGTCATACAATCG

In [21]:
handle.close()

### `Entrez.epost()`

Alternatively, use the `Entrez.epost()` function to cache a large number of IDs (too many IDs can make the URL-based requests fail). This function uploads the ID list to the NCBI servers and returns a `WebEnv` value and a `QueryKey` value that can be supplied to `Entrez.efetch()` to retrieve the query results.

[http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EPost_](http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EPost_)

In [22]:
## Use Entrez.epost() to cache multiple IDs
handle = Entrez.epost(db="nuccore", id=','.join(ids[0:3]))
epost_results = Entrez.read(handle)
web_env = epost_results['WebEnv']
query_key = epost_results['QueryKey']
handle.close()

## Use the WebEnv and QueryKey values to retrieve
## the query results with Entrez.efetch()
handle = Entrez.efetch(db="nuccore", rettype="fasta", retmode="text", webenv=web_env, query_key=query_key)
fasta_records = handle.read()
print(fasta_records[:500])

>XM_050606295.1 PREDICTED: Cataglyphis hispanica sonic hedgehog protein (LOC126857136), mRNA
TAGGATAATCTTAAAAGAAGAAGTGCGAGAGGAGAAAAGAGAACGATGTACTTCGATAAAGAGAGAAGAA
CGAAGTAACAAGGTGAGACGGTATATGACGAAGAAGAAGTGCAATCCGATCGTGCTGATAGTGGGGCAGA
CGAGAAACATCGAAAGAAGAAAAATACAAGAGACGGACAAGAACGAAGAAATGGTTCGCGAGAAAGCGGA
GAGACAGAAGGATCGAGAAGACGAGTAGGGGAGAGAACGAGAGACCGATCGAAGGAGAAAGCTAGAAGAT
CATTCTCCGAGTCTCTTTCTTCTTCTTCTCCCTTTCTCTCGCACTCCCAATCTCTTCTTTCTCTCTCTCC
CTCTCTCTTTCTCCTTATCCCTCCTGTCACGTGCACCTGTCTTCGTTTACCT


### `Entrez.einfo()`

The `Entrez.einfo()` function can be used to retrieve information about the structure of Entrez databases.

In [23]:
## To list available databases
handle = Entrez.einfo()
result = handle.read()
print(result)

b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20190110//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20190110/einfo.dtd">\n<eInfoResult>\n<DbList>\n\n\t<DbName>pubmed</DbName>\n\t<DbName>protein</DbName>\n\t<DbName>nuccore</DbName>\n\t<DbName>ipg</DbName>\n\t<DbName>nucleotide</DbName>\n\t<DbName>structure</DbName>\n\t<DbName>genome</DbName>\n\t<DbName>annotinfo</DbName>\n\t<DbName>assembly</DbName>\n\t<DbName>bioproject</DbName>\n\t<DbName>biosample</DbName>\n\t<DbName>blastdbinfo</DbName>\n\t<DbName>books</DbName>\n\t<DbName>cdd</DbName>\n\t<DbName>clinvar</DbName>\n\t<DbName>gap</DbName>\n\t<DbName>gapplus</DbName>\n\t<DbName>grasp</DbName>\n\t<DbName>dbvar</DbName>\n\t<DbName>gene</DbName>\n\t<DbName>gds</DbName>\n\t<DbName>geoprofiles</DbName>\n\t<DbName>homologene</DbName>\n\t<DbName>medgen</DbName>\n\t<DbName>mesh</DbName>\n\t<DbName>nlmcatalog</DbName>\n\t<DbName>omim</DbName>\n\t<DbName>orgtrack</DbName>\n\t<DbName>pmc</DbName>\n

In [24]:
handle.close()

In [25]:
## Or you can parse the XML
handle = Entrez.einfo()
result = Entrez.read(handle)
print(result.keys())
result['DbList']

dict_keys(['DbList'])


['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']

In [26]:
handle.close()



By specifying the database name when calling `Entrez.einfo()` database field information can be retrieved.

In [27]:
## To get info about a specific database
handle = Entrez.einfo(db="nuccore")
result = Entrez.read(handle)
for field in result['DbInfo']['FieldList']:
    print("%(Name)s: %(Description)s" % field)

ALL: All terms from all searchable fields
UID: Unique number assigned to each sequence
FILT: Limits the records
WORD: Free text associated with record
TITL: Words in definition line
KYWD: Nonstandardized terms provided by submitter
AUTH: Author(s) of publication
JOUR: Journal abbreviation of publication
VOL: Volume number of publication
ISS: Issue number of publication
PAGE: Page number(s) of publication
ORGN: Scientific and common names of organism, and all higher levels of taxonomy
ACCN: Accession number of sequence
PACC: Does not include retired secondary accessions
GENE: Name of gene associated with sequence
PROT: Name of protein associated with sequence
ECNO: EC number for enzyme or CAS registry number
PDAT: Date sequence added to GenBank
MDAT: Date of last update
SUBS: CAS chemical name or MEDLINE Substance Name
PROP: Classification by source qualifiers and molecule type
SQID: String identifier for sequence
GPRJ: BioProject
SLEN: Length of sequence
FKEY: Feature annotated on sequ

In [28]:
result.keys()

dict_keys(['DbInfo'])

In [29]:
result['DbInfo'].keys()

dict_keys(['DbName', 'MenuName', 'Description', 'DbBuild', 'Count', 'LastUpdate', 'FieldList', 'LinkList'])

In [30]:
handle.close()

## In-Class Exercises

In [None]:
## Exercise 1.
## Use the Entrez BioPython module to retrieve fasta records
## for 3 Refseq mRNA sequences for the P53 gene.
## Use the following search term: 
## "TP53[Gene] AND Homo sapiens[Organism] AND mRNA[Filter] AND Refseq[Filter]"
##
## Remember to provide your email address
##


In [None]:
## Exercise 2.
## Parse the 3 fasta records and save each sequence in
## a separate fasta file.
##
## description_line = fasta_record[0]['TSeq_accver']+' '+fasta_record[0]['TSeq_defline']


## References

- Python for Bioinformatics, Sebastian Bassi, CRC Press (2010)
- [http://en.wikipedia.org/wiki/Entrez](http://en.wikipedia.org/wiki/Entrez)
- [http://www.ncbi.nlm.nih.gov/books/NBK1058/](http://www.ncbi.nlm.nih.gov/books/NBK1058/)
- [http://www.ncbi.nlm.nih.gov/books/NBK25499/](http://www.ncbi.nlm.nih.gov/books/NBK25499/)
- [http://biopython.org/DIST/docs/tutorial/Tutorial.html](http://biopython.org/DIST/docs/tutorial/Tutorial.html)
- [http://biopython.org/DIST/docs/api/](http://biopython.org/DIST/docs/api/)

#### Last Updated: 15-Sep-2022