# Overview

# Imports

In [1]:
import requests

In [2]:
requests.__version__

'2.31.0'

# Identifying the workflow for PubMed article abstract extraction using a small test case

## Step 1 - querying PubMed and retrieving a list of IDs matching the search

**References and guides**

1. The base query, some example searches, and XML output parameters can be found here:
https://www.ncbi.nlm.nih.gov/books/NBK25500/

2. Parameters and syntax in depth:
https://www.ncbi.nlm.nih.gov/books/NBK25499/

3. These are the PubMed specific search term fields and their tags that should be used to construct the query:
https://pubmed.ncbi.nlm.nih.gov/help/#tiab

4. As seen in the url_suffix variable, it uses an 'Esearch' utility. There are a total of 9 functionalities under eutils (see base url). Others can be found here:
https://www.ncbi.nlm.nih.gov/books/NBK25497/

**Notes**
* This API returns an XML output and not a json. 
* By default, it will only return 20 IDs.
* However, the Count parameter within the XML output shows the total number of records matching the search query.

In [59]:
base_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'

In [35]:
database = 'pubmed'

In [41]:
query = 'glioblastoma[tiab]+AND+ALK[tiab]'

In [118]:
url_suffix = f'esearch.fcgi?db={database}&term={query}&usehistory=y'

In [119]:
url_suffix

'esearch.fcgi?db=pubmed&term=glioblastoma[tiab]+AND+ALK[tiab]&usehistory=y'

In [120]:
final_url = base_url + url_suffix

In [121]:
final_url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=glioblastoma[tiab]+AND+ALK[tiab]&usehistory=y'

In [122]:
response = requests.get(final_url)

* This code means the request was successful. There are different codes such as 400 which means bad request.

In [123]:
response.status_code

200

In [124]:
response.encoding

'UTF-8'

* As expected, it gives an XML output.
* Also, the number of records was confirmed by searching manually on PubMed: https://pubmed.ncbi.nlm.nih.gov/?term=%28ALK%5BTitle%2FAbstract%5D%29+AND+%28glioblastoma%5BTitle%2FAbstract%5D%29&sort=

In [125]:
print(response.text)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>76</Count><RetMax>20</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>MCID_65bac6ca82e59474417f79cb</WebEnv><IdList>
<Id>38285799</Id>
<Id>37939020</Id>
<Id>37861443</Id>
<Id>37271069</Id>
<Id>37260294</Id>
<Id>37240478</Id>
<Id>37168365</Id>
<Id>36823756</Id>
<Id>36780194</Id>
<Id>36707425</Id>
<Id>35724395</Id>
<Id>35625997</Id>
<Id>35190826</Id>
<Id>34702773</Id>
<Id>34626238</Id>
<Id>34341009</Id>
<Id>34323181</Id>
<Id>34015889</Id>
<Id>33966367</Id>
<Id>33887544</Id>
</IdList><TranslationSet/><QueryTranslation>"glioblastoma"[Title/Abstract] AND "ALK"[Title/Abstract]</QueryTranslation></eSearchResult>



## Step 2 - Extracting title and abstract for a given PubMed identifier

**Notes**

* The esearch utility used above only returns a list of identifers matching the query terms.
* In order to extract other information such as title, abstract, another utility called Efetch has to be used.

### 2A. Option 1 - Get content for PMIDs as text

In [63]:
# The base url and database fields have already been defined above:
base_url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'

In [62]:
database

'pubmed'

In [111]:
# Test one of the ids from the XML output above:
# ids_of_interest = '38285799,37939020'

rettype = 'abstract'
retmode = 'text'

In [112]:
# fetchurl_suffix = f'efetch.fcgi?db={database}&id={ids_of_interest}&rettype={rettype}&retmode={retmode}'
# fetchurl_suffix

'efetch.fcgi?db=pubmed&id=38285799,37939020&rettype=abstract&retmode=text'

In [126]:
query_key='1'
web_env = 'MCID_65bac6ca82e59474417f79cb'

fetchurl_suffix = f'efetch.fcgi?db={database}&query_key={query_key}&WebEnv={web_env}&rettype={rettype}&retmode={retmode}'

In [127]:
fetch_url = base_url + fetchurl_suffix
fetch_url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_65bac6ca82e59474417f79cb&rettype=abstract&retmode=text'

In [128]:
fetch_response = requests.get(fetch_url)

In [129]:
fetch_response.status_code

200

In [130]:
type(fetch_response.text)

str

In [131]:
print(fetch_response.text)

1. Asian Pac J Cancer Prev. 2024 Jan 1;25(1):317-323. doi: 
10.31557/APJCP.2024.25.1.317.

Evaluation of Immunohistochemical Expression of ALK-1 in Gliomas, WHO Grade 4 
and Its Correlation with IDH1-R132H Mutation Status.

Khairy RA(1), Momtaz EM(1), Abd El Aziz AM(1), Shibel PEE(1).

Author information:
(1)Department of Pathology, Faculty of Medicine, Cairo University, Egypt.

BACKGROUND: Glioblastoma (GB), a grade 4 glioma is the most common primary 
malignant brain tumor in adults. Recently, the mutation status of isocitrate 
dehydrogenase (IDH) has been crucial in the treatment of GB. IDH mutant cases 
display a more favorable prognosis than IDH-wild type ones. The anaplastic 
lymphoma kinase (ALK) is expressed as a receptor tyrosine kinase in both the 
developing central and peripheral nervous systems. Increasing lines of evidence 
suggest that ALK is over-expressed in GB and represents a potential therapeutic 
target.
OBJECTIVES: The goal of the current study was to investigate 

### 2B. Option 2 - Get content for PMIDs as XML

In [None]:
# Here only the retmode (retrieval mode) value needs to be changed.

retmode = 'text'