# Using Entrex package to work with NCBI data

Small simple examples of downloading and working with NCBI data.
Requirements: biopython

__Documentation__: https://www.ncbi.nlm.nih.gov/books/NBK25501/

Check which data and types are supported by Entrez: 

Before using Biopython to access the NCBI’s online resources (via Bio.Entrez or some of the other modules), please read the NCBI’s Entrez User Requirements. If the NCBI finds you are abusing their systems, they can and will ban your access!

To paraphrase: For any series of more than 100 requests, do this at weekends or outside USA peak times. This is up to you to obey. Use the http://eutils.ncbi.nlm.nih.gov address, not the standard NCBI Web address. Biopython uses this web address. Make no more than three requests every seconds (relaxed from at most one request every three seconds in early 2009). This is automatically enforced by Biopython. 

For large queries, the NCBI also recommend using their session history feature (the WebEnv session cookie string, see Section 8.15). This is only slightly more complicated.

Tech support (and to register the tool, accessing NCBI): 


In [51]:
import Bio.Entrez as etz
import Bio.SeqIO as SeqIO
import pandas as pd

#### Best practise is to register the email, before querying NCBI
Otherwise, there will be a warning with each response that you have not registered

Use the optional email parameter so the NCBI can contact you if there is a problem. You can either explicitly set this as a parameter with each call to Entrez (e.g. include email=”A.N.Other@example.com” in the argument list), or as of Biopython 1.48, you can set a global email address

In [52]:
etz.email="oksana.korol@agr.gc.ca"

If you are using Biopython within some larger software suite, use the tool parameter to specify this. You can either explicitly set the tool name as a parameter with each call to Entrez (e.g. include tool=”MyLocalScript” in the argument list), or as of Biopython 1.54, you can set a global tool name:

In [53]:
etz.tool = "AAFC-reference-data-manager"

### List databases, available at NCBI

In [54]:
handler = etz.einfo()
response = handler.read()
handler.close()
response

'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20130322//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20130322/einfo.dtd">\n<eInfoResult>\n<DbList>\n\n\t<DbName>pubmed</DbName>\n\t<DbName>protein</DbName>\n\t<DbName>nuccore</DbName>\n\t<DbName>ipg</DbName>\n\t<DbName>nucleotide</DbName>\n\t<DbName>nucgss</DbName>\n\t<DbName>nucest</DbName>\n\t<DbName>structure</DbName>\n\t<DbName>sparcle</DbName>\n\t<DbName>genome</DbName>\n\t<DbName>annotinfo</DbName>\n\t<DbName>assembly</DbName>\n\t<DbName>bioproject</DbName>\n\t<DbName>biosample</DbName>\n\t<DbName>blastdbinfo</DbName>\n\t<DbName>books</DbName>\n\t<DbName>cdd</DbName>\n\t<DbName>clinvar</DbName>\n\t<DbName>clone</DbName>\n\t<DbName>gap</DbName>\n\t<DbName>gapplus</DbName>\n\t<DbName>grasp</DbName>\n\t<DbName>dbvar</DbName>\n\t<DbName>gene</DbName>\n\t<DbName>gds</DbName>\n\t<DbName>geoprofiles</DbName>\n\t<DbName>homologene</DbName>\n\t<DbName>medgen</DbName>\n\t<DbName>mesh</DbName>\n\t

In [55]:
handler = etz.einfo()
parsed_response = etz.read(handler)
handler.close()
parsed_response

{'DbList': ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'nucgss', 'nucest', 'structure', 'sparcle', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'clone', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'biosystems', 'pccompound', 'pcsubstance', 'pubmedhealth', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'unigene', 'gencoll', 'gtr']}

### Query a database

In [56]:
handler = etz.einfo(db="nucleotide")
parsed_response = etz.read(handler)
handler.close()
parsed_response

{'DbInfo': {'DbName': 'nuccore', 'MenuName': 'Nucleotide', 'Description': 'Core Nucleotide db', 'DbBuild': 'Build180512-2140m.1', 'Count': '260069855', 'LastUpdate': '2018/05/14 12:21', 'FieldList': [{'Name': 'ALL', 'FullName': 'All Fields', 'Description': 'All terms from all searchable fields', 'TermCount': '4119477627', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'N', 'Hierarchy': 'N', 'IsHidden': 'N'}, {'Name': 'UID', 'FullName': 'UID', 'Description': 'Unique number assigned to each sequence', 'TermCount': '0', 'IsDate': 'N', 'IsNumerical': 'Y', 'SingleToken': 'Y', 'Hierarchy': 'N', 'IsHidden': 'Y'}, {'Name': 'FILT', 'FullName': 'Filter', 'Description': 'Limits the records', 'TermCount': '426', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'Y', 'Hierarchy': 'N', 'IsHidden': 'N'}, {'Name': 'WORD', 'FullName': 'Text Word', 'Description': 'Free text associated with record', 'TermCount': '1871287831', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'N', 'Hierarchy': 'N', 'Is

In [57]:
parsed_response.keys()

dict_keys(['DbInfo'])

In [58]:
parsed_response['DbInfo'].keys()

dict_keys(['DbName', 'MenuName', 'Description', 'DbBuild', 'Count', 'LastUpdate', 'FieldList', 'LinkList'])

In [59]:
pd.DataFrame(parsed_response['DbInfo']['FieldList'])

Unnamed: 0,Description,FullName,Hierarchy,IsDate,IsHidden,IsNumerical,Name,SingleToken,TermCount
0,All terms from all searchable fields,All Fields,N,N,N,N,ALL,N,4119477627
1,Unique number assigned to each sequence,UID,N,N,Y,Y,UID,Y,0
2,Limits the records,Filter,N,N,N,N,FILT,Y,426
3,Free text associated with record,Text Word,N,N,N,N,WORD,N,1871287831
4,Words in definition line,Title,N,N,N,N,TITL,N,140789434
5,Nonstandardized terms provided by submitter,Keyword,N,N,N,N,KYWD,Y,15637719
6,Author(s) of publication,Author,N,N,N,N,AUTH,Y,2773103
7,Journal abbreviation of publication,Journal,N,N,N,N,JOUR,Y,34918
8,Volume number of publication,Volume,N,N,N,N,VOL,Y,3693
9,Issue number of publication,Issue,N,N,N,N,ISS,Y,4219


In [60]:
pd.DataFrame(parsed_response['DbInfo']['LinkList'])

Unnamed: 0,DbTo,Description,Menu,Name
0,ccds,Link to Consensus CDS,ccds,nucleotide_ccds
1,genome,Genome record containing nucleotide sequence,Assembly to Genome,nucleotide_genome


## Queries

#### Get all fungal ITS sequences

In [61]:
query = "txid4751[Organism:exp] AND \"Internal Transcribed Spacer\"[All Fields]"
#query = "\"Internal Transcribed Spacer\"[All Fields]"
handler = etz.esearch(db='nucleotide', term=query)
parsed_response = etz.read(handler)
parsed_response

{'Count': '1075149', 'RetMax': '20', 'RetStart': '0', 'IdList': ['1388874334', '1388874333', '1388874332', '1388874331', '1388874330', '1388874329', '1388874328', '1388874327', '1386813114', '1386812756', '1386811116', '1386808931', '1386808869', '1386808066', '1386807225', '1386806404', '1386806368', '1386806366', '1278989629', '1268013305'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'txid4751[Organism:exp]', 'Field': 'Organism', 'Count': '7241877', 'Explode': 'Y'}, {'Term': '"Internal Transcribed Spacer"[All Fields]', 'Field': 'All Fields', 'Count': '1647071', 'Explode': 'N'}, 'AND'], 'QueryTranslation': 'txid4751[Organism:exp] AND "Internal Transcribed Spacer"[All Fields]'}

In [62]:
parsed_response.keys()

dict_keys(['Count', 'RetMax', 'RetStart', 'IdList', 'TranslationSet', 'TranslationStack', 'QueryTranslation'])

In [63]:
val = parsed_response['TranslationStack']
del val[-1]
pd.DataFrame(val)

Unnamed: 0,Count,Explode,Field,Term
0,7241877,Y,Organism,txid4751[Organism:exp]
1,1647071,N,All Fields,"""Internal Transcribed Spacer""[All Fields]"


In [64]:
parsed_response['QueryTranslation']

'txid4751[Organism:exp] AND "Internal Transcribed Spacer"[All Fields]'

In [65]:
pd.DataFrame(parsed_response['TranslationStack'])

Unnamed: 0,Count,Explode,Field,Term
0,7241877,Y,Organism,txid4751[Organism:exp]
1,1647071,N,All Fields,"""Internal Transcribed Spacer""[All Fields]"


#### Now get the data

In [70]:
handle = etz.efetch(db='nucleotide', id="1388874334", format='fasta')

#record = SeqIO.read(handle, "fasta")
#handle.close()
#record

<_io.TextIOWrapper encoding='latin-1'>

#### Get all CO1

In [19]:
query = "COI OR CO1 OR COX1"
handler = etz.esearch(db='nucleotide', term=query)
parsed_response = etz.read(handler)
handler.close()
parsed_response

{'Count': '2849202', 'RetMax': '20', 'RetStart': '0', 'IdList': ['1388875724', '1388875722', '1388875720', '1388875718', '1388875716', '1388875714', '1388875712', '1388875710', '1388875708', '1388875706', '1388875704', '1388875702', '1388875700', '1388875698', '1388875696', '1388875694', '1388875692', '1388875690', '1388875688', '1388875686'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'COI[All Fields]', 'Field': 'All Fields', 'Count': '2543439', 'Explode': 'N'}, {'Term': 'CO1[All Fields]', 'Field': 'All Fields', 'Count': '54254', 'Explode': 'N'}, 'OR', {'Term': 'COX1[All Fields]', 'Field': 'All Fields', 'Count': '1585496', 'Explode': 'N'}, 'OR'], 'QueryTranslation': 'COI[All Fields] OR CO1[All Fields] OR COX1[All Fields]'}

#### Complete blast database
ncbi_nr_nt?

In [72]:
handler = etz.einfo(db="blastdbinfo")
parsed_response = etz.read(handler)
handler.close()
parsed_response

{'DbInfo': {'DbName': 'blastdbinfo', 'MenuName': 'BlastdbInfo', 'Description': 'BlastdbInfo Database', 'DbBuild': 'Build180514-1442.1', 'Count': '3869192', 'LastUpdate': '2018/05/14 15:45', 'FieldList': [{'Name': 'ALL', 'FullName': 'All Fields', 'Description': 'All terms from all searchable fields', 'TermCount': '6600339', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'N', 'Hierarchy': 'N', 'IsHidden': 'N'}, {'Name': 'UID', 'FullName': 'UID', 'Description': 'Unique number assigned to publication', 'TermCount': '0', 'IsDate': 'N', 'IsNumerical': 'Y', 'SingleToken': 'Y', 'Hierarchy': 'N', 'IsHidden': 'Y'}, {'Name': 'FILT', 'FullName': 'Filter', 'Description': 'Limits the records', 'TermCount': '8', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'Y', 'Hierarchy': 'N', 'IsHidden': 'N'}, {'Name': 'DB', 'FullName': 'Database Name', 'Description': 'Official name of the database', 'TermCount': '229004', 'IsDate': 'N', 'IsNumerical': 'N', 'SingleToken': 'Y', 'Hierarchy': 'N', 'IsHidden':

In [73]:
parsed_response['DbInfo'].keys()

dict_keys(['DbName', 'MenuName', 'Description', 'DbBuild', 'Count', 'LastUpdate', 'FieldList', 'LinkList'])

In [74]:
pd.DataFrame(parsed_response['DbInfo']['FieldList'])

Unnamed: 0,Description,FullName,Hierarchy,IsDate,IsHidden,IsNumerical,Name,SingleToken,TermCount
0,All terms from all searchable fields,All Fields,N,N,N,N,ALL,N,6600339
1,Unique number assigned to publication,UID,N,N,Y,Y,UID,Y,0
2,Limits the records,Filter,N,N,N,N,FILT,Y,8
3,Official name of the database,Database Name,N,N,N,N,DB,Y,229004
4,"Words in the title of database (e.g., ""NCBI Tr...",Database Title,N,N,N,N,TITL,Y,4418573
5,Date of last database update,Last Update,N,Y,N,N,DATE,Y,3321
6,Organism Taxid,Database Organism Taxid,N,N,N,N,ORGN,Y,1862391
7,Genome Collection Assembly Name,Genome Collection Assembly Name,N,N,N,N,ASM,Y,12921
8,"One of genomic, cdna, other-dna, or protein, o...",Blast Sequence Type,N,N,N,N,SEQT,Y,4
9,Appropriate sequence strategy for the sequence...,Blast Sequence Strategy,N,N,N,N,SEQS,Y,5


In [75]:
parsed_response['DbInfo']['Description']

'BlastdbInfo Database'

In [78]:
query = "nr.*"
handler = etz.esearch(db='blastdbinfo', term=query)
parsed_response = etz.read(handler)
handler.close()
parsed_response

{'Count': '128', 'RetMax': '20', 'RetStart': '0', 'IdList': ['73451714', '73451704', '73451694', '73451684', '73451674', '73451664', '73451654', '73451644', '73067674', '73067534', '73067524', '67250904', '63379034', '63379004', '59445324', '59445274', '59445264', '58914924', '57406204', '57406194'], 'TranslationSet': [], 'TranslationStack': [{'Term': 'nr 1[All Fields]', 'Field': 'All Fields', 'Count': '5', 'Explode': 'N'}, {'Term': 'nr 10492[All Fields]', 'Field': 'All Fields', 'Count': '3', 'Explode': 'N'}, 'OR', {'Term': 'nr 12[All Fields]', 'Field': 'All Fields', 'Count': '1', 'Explode': 'N'}, 'OR', {'Term': 'nr 2[All Fields]', 'Field': 'All Fields', 'Count': '5', 'Explode': 'N'}, 'OR', {'Term': 'nr 272[All Fields]', 'Field': 'All Fields', 'Count': '1', 'Explode': 'N'}, 'OR', {'Term': 'nr 28534[All Fields]', 'Field': 'All Fields', 'Count': '3', 'Explode': 'N'}, 'OR', {'Term': 'nr 3[All Fields]', 'Field': 'All Fields', 'Count': '9', 'Explode': 'N'}, 'OR', {'Term': 'nr 4[All Fields]'