# Querying NCBI's databases

### Entrez Programming Utilities (EUtils)

- 9 server-side programs
- stable interface into the NCBI's Entrez system
- see [EUtils](https://www.ncbi.nlm.nih.gov/books/NBK25501/) for more information

### Entrez
- query and database system
- 39 NCBI databases such as PubMed and GenBank
- Help and more information: [Entrez Help](https://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.The_Entrez_Databases)

- Access:
    - Manual via web browser: [Entrez](https://www.ncbi.nlm.nih.gov/gquery/)
    - Programmatic via __Bio.Entrez__ module
    
    
#### Understanding EUtils
Entrez databases:
- EUtils accesses data already in the Entrez system
- Entrez identifies database records using unique identifiers (UIDs)
    - e.g., GI numbers for Nucleotide and Protein, PMIDs for PubMed
    - EUtils use UIDs for both data input and output
    
Utilities
- __ESearch__ (EGQuery): list of matching UIDs in a (all) database
- __ESummary__: summary record for each UID
- __EInfo__: database statistics
- __EPost__: UID uploads
- __EFetch__: data record downloads
- __ELink__: Entrez links
- __ESpell__: spelling suggestions
- __ECitMatch__: batch citation search in PubMed


### Biopython's Entrez module (Bio.Entrez)
Bio.Entrez API uses EUtils:
    - __Python functions__ for eight EUtils tools
    - __Parser__ for the EUtils's XML output
takes care that:
    - the correct URL is used for the queries
    - __NCBI requirement__: not more than one request is made every three seconds
Attributes ( requried by Entrez):
- email (contact of user)
- tool (default is 'biopython'

## 1.) einfo

In [1]:
from Bio import Entrez
Entrez.email = "A.N.Other@example.com" # tell NCBI who you are
# use the einfo tool
handle = Entrez.einfo()

In [2]:
# read the information
result = handle.read()
# list of databases in XML format
print(result)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">
<eInfoResult>
<DbList>

	<DbName>pubmed</DbName>
	<DbName>protein</DbName>
	<DbName>nuccore</DbName>
	<DbName>ipg</DbName>
	<DbName>nucleotide</DbName>
	<DbName>nucgss</DbName>
	<DbName>nucest</DbName>
	<DbName>structure</DbName>
	<DbName>sparcle</DbName>
	<DbName>genome</DbName>
	<DbName>annotinfo</DbName>
	<DbName>assembly</DbName>
	<DbName>bioproject</DbName>
	<DbName>biosample</DbName>
	<DbName>blastdbinfo</DbName>
	<DbName>books</DbName>
	<DbName>cdd</DbName>
	<DbName>clinvar</DbName>
	<DbName>clone</DbName>
	<DbName>gap</DbName>
	<DbName>gapplus</DbName>
	<DbName>grasp</DbName>
	<DbName>dbvar</DbName>
	<DbName>gene</DbName>
	<DbName>gds</DbName>
	<DbName>geoprofiles</DbName>
	<DbName>homologene</DbName>
	<DbName>medgen</DbName>
	<DbName>mesh</DbName>
	<DbName>ncbisearch</DbName>
	<DbName>nlmcatalog</DbName>
	<DbName

In [3]:
handle = Entrez.einfo()

# or parse the data 
record = Entrez.read(handle)

# print the dictionary's keys
print(record.keys())

dict_keys(['DbList'])


In [4]:
# get the entries --> all databases available
print(record["DbList"])

['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']


In [5]:
# get information about a specific database
record2 = Entrez.read(Entrez.einfo(db="pubmed"))
print(record2["DbInfo"]["Description"])

PubMed bibliographic record


In [6]:
record2["DbInfo"]["Count"]

'29528432'

## 2.) esearch/ esummary
Query a specific Entrez database

In [7]:
# Search PubMed for biopython-related publications
handle = Entrez.esearch(db="pubmed", term="biopython")
record = Entrez.read(handle)
print("Number of found publications: ", record["Count"])
print(record["IdList"])

Number of found publications:  24
['30013827', '29641230', '28011774', '24929426', '24497503', '24267035', '24194598', '23842806', '23157543', '22909249', '22399473', '21666252', '21210977', '20015970', '19811691', '19773334', '19304878', '18606172', '21585724', '16403221']


Retrieving summaries from UIDs

In [8]:
handle = Entrez.esummary(db= "pubmed", id="12230038" )
record = Entrez.read(handle)
record[0]["Id"]

'12230038'

In [9]:
print("Publication title: ", record[0]["Title"], " publication date: ", record[0]["PubDate"])

Publication title:  The Bio* toolkits--a brief overview.  publication date:  2002 Sep


## 3.) efetch
Request and download data records

In [10]:
from Bio import SeqIO

handle = Entrez.efetch(db="protein", id="349839", rettype="gb")
record = SeqIO.read(handle, "gb")
handle.close()
print(record)

ID: 1ATP_E
Name: 1ATP_E
Description: Chain E, 2.2 Angstrom Refined Crystal Structure Of The Catalytic Subunit Of Camp-Dependent Protein Kinase Complexed With Mnatp And A Peptide Inhibitor
Number of features: 26
/topology=linear
/data_file_division=ROD
/date=24-SEP-2008
/accessions=['1ATP_E']
/db_source=pdb: molecule 1ATP, chain 69, release Aug 27, 2007; deposition: Jan 8, 1993; class: Transferase(Phosphotransferase); source: Mol_id: 1; Organism_scientific: Mus Musculus; Mol_id: 2; Organism_scientific: Mus Musculus; Exp. method: X-Ray Diffraction.
/keywords=['']
/source=Mus musculus (house mouse)
/organism=Mus musculus
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Glires', 'Rodentia', 'Sciurognathi', 'Muroidea', 'Muridae', 'Murinae', 'Mus', 'Mus']
/references=[Reference(title='Expression of the catalytic subunit of cAMP-dependent protein kinase in Escherichia coli', ...), Reference(title='Crystal str



In [11]:
# or query with multiple UIDs
handle = Entrez.efetch(db="protein", id="349839, 349840", rettype="fasta")
records = SeqIO.parse(handle, "fasta")

In [12]:
# save the record in a file before parsing so you don' t have to get it each time
import os 
file_name = "data/output/M27569.gbk"
if not os.path.isfile(file_name):
    net_handle = Entrez.efetch(db='nucleotide', id='M27569', rettype='gb' )
    out_handle = open(file_name, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
record = SeqIO.read(file_name, "genbank")

 #### An example:

In [13]:
from Bio import Entrez, Seq, SeqIO
from Bio.Alphabet import generic_protein

Entrez.email = "A.N.Other@example.com" # tell NCBI who you are
# get genbank record for a specific gene
handle = Entrez.efetch(db="nucleotide", id="M27569", rettype="gb")
record = SeqIO.read(handle, "gb")
handle.close()
print(record)
print("Lenght of record: ", len(record))
print("Last 11% of the genome: ", record[24608:])
print(record.features)

# extract the protein sequences that the genome encodes
translations = (f.qualifiers["translation"] for f in record.features[1:])
proteins = [Seq(t[0], generic_protein) for t in translations]
print(proteins)

ID: M27569.1
Name: M27569
Description: Figure 3. 770-bp sequence of part of IBV cDNA clone C5 136
Number of features: 1
/molecule_type=DNA
/topology=linear
/data_file_division=UNA
/date=04-AUG-1993
/accessions=['M27569']
/sequence_version=1
/keywords=['']
/source=unclassified
/organism=unclassified unclassified.
/taxonomy=[]
/references=[Reference(title='sequencing of coronavirus ibv genomic rna: a 195-base open reading frame encoded by mrna b', ...)]
Seq('TACCTTTCAAGTAGATAATGGAAAAGTCTACTACGAAGGAACACCAGTTTTCCA...GCC', IUPACAmbiguousDNA())
Lenght of record:  771
Last 11% of the genome:  ID: M27569.1
Name: M27569
Description: Figure 3. 770-bp sequence of part of IBV cDNA clone C5 136
Number of features: 0
Seq('', IUPACAmbiguousDNA())
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(771), strand=1), type='source')]
[]
