In [17]:
import entrezpy
from Bio import Entrez
from Bio import Medline
import json
import io
from tqdm.notebook import tqdm
import pandas as pd
import requests
from collections import defaultdict
import utils


# Set up the api key and email to access entrez database
Entrez.api_key = "d4419da12a995f11504887366d19a2830c07"

# Don't use this email for your regular retreival, use your own email
Entrez.email = "hslexample@gmail.com" 

# Define retreive number
RET_NUM = 20



## Basic function

### ESearch
ESearch searches and retrieves primary IDs (for use in EFetch, ELink and ESummary) and term translations, and optionally retains results for future use in the user's environment. In addition, you can enable search history by setting up ***usehistory*** as "y"

In [5]:
query_term = "Diabetes[title] AND Female[title]"

# Return a handle to the results which are always in XML format by default.
handleSearch = Entrez.esearch(db="pubmed", 
                        term = query_term, 
                        retmode = 'xml',
                        sort = 'relevance',
                        retmax = RET_NUM )

# Return a dictionary of metadata of the search
recordhandleSearch = Entrez.read(handleSearch)
idlist = recordhandleSearch["IdList"]
# print(idlist)


['31965302', '32352894', '32742387', '32488735', '32814108', '32708907', '32354622', '32470060', '32073219', '32267365', '31502642', '32299459', '31867989', '31642603', '32046486', '32213183', '32146632', '32685429', '32248149', '31433270']


### Efetch
This function is used to retrieve a full record for a list of input UIDs from Entrez
The return format can be varied based on which dataset you retrieve. see [detail](https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly). 
For pubmed, it allows XML (see [example](https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=11748933,11700088&retmode=xml)) and MEDLINE (see [example](https://www.nlm.nih.gov/bsd/mms/medlineelements.html)).  

In [None]:
handleFetch = Entrez.efetch(db="pubmed", id = idlist, retmode="xml")
recordhandleSearch = Entrez.read(handleFetch)
# print(recordhandleSearch)

# you can also use a predefined function to parse the handle object to return a dataframe; 
# the parse_handle is a function in utils
df = utils.parse_handle(handleFetch)
# print(df)

### Epost
This function accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset. This is mainly for the retrieval and access of existing records in history servers. You can interpret it that you deposit the baggage and is given a key to get it. To demonstrate, we post the list of pmid retrieved in esearch to history server. We will get query_key and WebEnv parameters. 


In [21]:
# Post the idlist retreived from search to history server
handlePost = Entrez.epost(db="pubmed", id = ", ".join(idlist))
search_results = Entrez.read(handlePost)
# search_results

{'QueryKey': '1', 'WebEnv': 'MCID_5f59a926a9148b16ed299650'}

### Elink
This function find entities in a database that linked to a targeted set of entities. The function will return a list of ids that relevan to the targeted id list, if any, with relevancy scores. In the following, we retrieve the documents related to the input id. There is an optional param ***cmd*** that control the action and outcomes of Elink (see [detail](https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ELink)). For example, if we want to retreive the similarity score between the targeted documents with retrieved documents

In [10]:
# Get related articles' id of the targeted records with the relevancy score
handleLink = Entrez.elink(dbfrom = "pubmed", id = idlist, cmd = "neighbor_score")
record = Entrez.read(handleLink)
# record

### XML format parser
You may notice that Regardless of return types you select, API will first return *a _io.TextIOWrapper* type *handle* which includes publication data in an XML format. If you download xml file from pubmed web interface, you want to use **open()** to open this file, which will return this format.
There are two functions provided by entrez to parse XML files, **read()** and **parser()**. These two function parse an XML file from the NCBI Entrez Utilities into python objects and share many similarities. The difference lies in the way it delivered results. **read()**: obtain the complete file and return a single python list. **parse()**: a generator function and return resutls one by one


### History server and activities
With entrez e-utility, you can store and access the uids you in ESearch, EPost or ELink functions. As described before, you can store the search results (esearch), upload particular uids (epost) and find related records (elink). You can also access the history for further data operation. The ***WebEnv*** and ***query_key*** are the key to such purpose

## Pipeline - Advanced use of functions

Upon the above functions, we can develop [pipeline operations](https://www.ncbi.nlm.nih.gov/books/NBK25497/) on pubmed by sequencing those functions to perform basic and avdanced tasks. 

#### Retrieve data records
1. Esearch --> Efetch

#### Retrieving data records matching a list of UIDs
1. EPost --> ESearch

#### Finding UIDs linked to a set of records
1. EPost --> ELink
2. ESearch --> ELink

#### Limiting a set of records with an Entrez query
1. EPost --> ESearch
2. ELink --> ESearch

## Reference

Eric Sayers (2018), A General Introduction to the E-utilities: https://www.ncbi.nlm.nih.gov/books/NBK25499/
biopython: https://biopython.readthedocs.io/en/latest/api/Bio.Entrez.html?highlight=entrez