# Querying NCBI's databases

### Entrez Programming Utilities (EUtils)

- 9 server-side programs
- stable interface into the NCBI's Entrez system
- see [EUtils](https://www.ncbi.nlm.nih.gov/books/NBK25501/) for more information

### Entrez
- query and database system
- 39 NCBI databases such as PubMed and GenBank
- Help and more information: [Entrez Help](https://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.The_Entrez_Databases)

- Access:
    - Manual via web browser: [Entrez](https://www.ncbi.nlm.nih.gov/gquery/)
    - Programmatic via __Bio.Entrez__ module
    
    
#### Understanding EUtils
Entrez databases:
- EUtils accesses data already in the Entrez system
- Entrez identifies database records using unique identifiers (UIDs)
    - e.g., GI numbers for Nucleotide and Protein, PMIDs for PubMed
    - EUtils use UIDs for both data input and output
    
Utilities
- __ESearch__ (EGQuery): list of matching UIDs in a (all) database
- __ESummary__: summary record for each UID
- __EInfo__: database statistics
- __EPost__: UID uploads
- __EFetch__: data record downloads
- __ELink__: Entrez links
- __ESpell__: spelling suggestions
- __ECitMatch__: batch citation search in PubMed


### Biopython's Entrez module (Bio.Entrez)
Bio.Entrez API uses EUtils:
    - __Python functions__ for eight EUtils tools
    - __Parser__ for the EUtils's XML output
takes care that:
    - the correct URL is used for the queries
    - __NCBI requirement__: not more than one request is made every three seconds
Attributes ( requried by Entrez):
- email (contact of user)
- tool (default is 'biopython'

## 1.) einfo

In [None]:
from Bio import Entrez
Entrez.email = "A.N.Other@example.com" # tell NCBI who you are
# use the einfo tool
handle = Entrez.einfo()

In [None]:
# read the information
result = handle.read()
# list of databases in XML format
print(result)

In [None]:
handle = Entrez.einfo()

# or parse the data 
record = Entrez.read(handle)

# print the dictionary's keys
print(record.keys())

In [None]:
# get the entries --> all databases available
print(record["DbList"])

In [None]:
# get information about a specific database
record2 = Entrez.read(Entrez.einfo(db="pubmed"))
print(record2["DbInfo"]["Description"])

In [None]:
record2["DbInfo"]["Count"]

## 2.) esearch/ esummary
Query a specific Entrez database

In [None]:
# Search PubMed for biopython-related publications
handle = Entrez.esearch(db="pubmed", term="biopython")
record = Entrez.read(handle)
print("Number of found publications: ", record["Count"])
print(record["IdList"])

Retrieving summaries from UIDs

In [None]:
handle = Entrez.esummary(db= "pubmed", id="12230038" )
record = Entrez.read(handle)
record[0]["Id"]

In [None]:
print("Publication title: ", record[0]["Title"], " publication date: ", record[0]["PubDate"])

## 3.) efetch
Request and download data records

In [None]:
from Bio import SeqIO

handle = Entrez.efetch(db="protein", id="349839", rettype="gb")
record = SeqIO.read(handle, "gb")
handle.close()
print(record)

In [None]:
# or query with multiple UIDs
handle = Entrez.efetch(db="protein", id="349839, 349840", rettype="fasta")
records = SeqIO.parse(handle, "fasta")

In [None]:
# save the record in a file before parsing so you don' t have to get it each time
import os 
file_name = "data/output/M27569.gbk"
if not os.path.isfile(file_name):
    net_handle = Entrez.efetch(db='nucleotide', id='M27569', rettype='gb' )
    out_handle = open(file_name, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
record = SeqIO.read(file_name, "genbank")

 #### An example:

In [None]:
from Bio import Entrez, Seq, SeqIO
from Bio.Alphabet import generic_protein

Entrez.email = "A.N.Other@example.com" # tell NCBI who you are
# get genbank record for a specific gene
handle = Entrez.efetch(db="nucleotide", id="M27569", rettype="gb")
record = SeqIO.read(handle, "gb")
handle.close()
print(record)
print("Lenght of record: ", len(record))
print("Last 11% of the genome: ", record[24608:])
print(record.features)

# extract the protein sequences that the genome encodes
translations = (f.qualifiers["translation"] for f in record.features[1:])
proteins = [Seq(t[0], generic_protein) for t in translations]
print(proteins)