In [7]:
import entrezpy
from Bio import Entrez
from Bio import Medline
import json
import io
from tqdm.notebook import tqdm
import pandas as pd
import requests
from collections import defaultdict

# Set up the api key and email to access entrez database
Entrez.api_key = "d4419da12a995f11504887366d19a2830c07"

# Don't use this email for your regular retreival, use your own email
Entrez.email = "hslexample@gmail.com" 

# Define retreive number
RET_NUM = 20



## Basic function

### XML format parser

Regardless of return types you select, Api will first return *a _io.TextIOWrapper* type *handle* which includes publication data in an XML format. If you download xml file from pubmed web interface, you want to use **open()** to open this file, which will return this format.

There are two functions provided by entrez to parse XML files, **read()** and **parser()**. 


### ESearch

In [20]:
# ESearch searches and retrieves primary IDs (for use in EFetch, ELink and ESummary) and term translations, 
# and optionally retains results for future use in the user's environment.

query_term = "Diabetes[title] AND Female[title]"

# Return a handle to the results which are always in XML format by default.
handleSearch = Entrez.esearch(db="pubmed", 
                        term = query_term, 
                        retmode = 'xml',
                        sort = 'relevance',
                        retmax = RET_NUM )
recordhandleSearch = Entrez.read(handleSearch)
idlist = recordhandleSearch["IdList"]
# print(record)



<Element 'front' at 0x00000262F172C548>


### Efetch
This function is used to retrieve a full record for a list of input UIDs from Entrez
The return format can be varied based on which dataset you retrieve. see [detail](https://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.T._valid_values_of__retmode_and/?report=objectonly). 
For pubmed, it allows XML, MEDLINE, PMID list, Abstract. 

In [134]:
handleFetch = Entrez.efetch(db="pubmed", id = idlist, retmode="xml")
recordhandleSearch = Entrez.read(handleFetch)
# can also use a predefined function to parse the handle object to get a dataframe
df = parse_handle(handleFetch)
# print(df)

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




### Epost
This function accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset. This is mainly for future use of existing records

In [22]:
# Post the idlist retreived from search to history server
handlePost = Entrez.epost(db="pubmed", id = idlist)

### Elink
This function find entities in a database A that linked to another set of entities in database B. 
1. Returns UIDs linked to an input set of UIDs in either the same or a different Entrez database
2. Returns UIDs linked to other UIDs in the same Entrez database that match an Entrez query
3. Checks for the existence of Entrez links for a set of UIDs within the same database
4. Lists the available links for a UID
5. Lists LinkOut URLs and attributes for a set of UIDs
6. Lists hyperlinks to primary LinkOut providers for a set of UIDs
7. Creates hyperlinks to the primary LinkOut provider for a single UID

In [33]:
handleLink = Entrez.elink(db = "pmc", dbfrom = "pubmed", id = idlist, cmd = "neighbor_score")

In [34]:
print(idlist)
handleLink.read()

['32814108', '32742387', '32488735', '32708907', '32354622', '32470060', '32267365', '31502642', '32073219', '31867989', '32352894', '32299459', '31642603', '32213183', '32046486', '32685429', '32146632', '32248149', '31965302', '31433270']


'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD elink 20101123//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20101123/elink.dtd">\n<eLinkResult>\n\n  <LinkSet>\n    <DbFrom>pubmed</DbFrom>\n    <IdList>\n      <Id>32814108</Id>\n    </IdList>\n    \n    \n    \n    \n    \n  </LinkSet>\n\n  <LinkSet>\n    <DbFrom>pubmed</DbFrom>\n    <IdList>\n      <Id>32742387</Id>\n    </IdList>\n    <LinkSetDb>\n      <DbTo>pmc</DbTo>\n      <LinkName>pubmed_pmc</LinkName>\n      \n        <Link>\n\t\t\t\t<Id>7388399</Id>\n\t\t\t\t<Score>1</Score>\n\t\t\t</Link>\n      \n    </LinkSetDb>\n    \n    \n    \n    <LinkSetDb>\n      <DbTo>pmc</DbTo>\n      <LinkName>pubmed_pmc_local</LinkName>\n      \n        <Link>\n\t\t\t\t<Id>7388399</Id>\n\t\t\t\t<Score>0</Score>\n\t\t\t</Link>\n      \n    </LinkSetDb>\n  </LinkSet>\n\n  <LinkSet>\n    <DbFrom>pubmed</DbFrom>\n    <IdList>\n      <Id>32488735</Id>\n    </IdList>\n    \n    \n    \n    \n    \n  </LinkSet>\n\

## Basic pipeline

Upon the above functions, we can develop [pipeline operations](https://www.ncbi.nlm.nih.gov/books/NBK25497/) on pubmed by sequencing those functions to perform basic and avdanced tasks. 
#### Retrieve data records
1. Esearch --> Esummary
2. Esearch --> Efetch

#### 









