# Overview

# Imports

In [1]:
import requests

In [2]:
requests.__version__

'2.31.0'

# Identifying the workflow for PubMed article abstract extraction using a small test case

## Step 1 - querying PubMed and retrieving a list of IDs matching the search

**References and guides**

1. The base query, some example searches, and XML output parameters can be found here:
https://www.ncbi.nlm.nih.gov/books/NBK25500/

2. Parameters and syntax in depth:
https://www.ncbi.nlm.nih.gov/books/NBK25499/

3. These are the PubMed specific search term fields and their tags that should be used to construct the query:
https://pubmed.ncbi.nlm.nih.gov/help/#tiab

4. As seen in the url_suffix variable, it uses an 'Esearch' utility. There are a total of 9 functionalities under eutils (see base url). Others can be found here:
https://www.ncbi.nlm.nih.gov/books/NBK25497/

5. Alternative python package: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6821292/ (Not used here.)

6. EDirect - for batch access using command line
https://www.ncbi.nlm.nih.gov/books/NBK179288/

**Notes**
* This API returns an XML output and not a json. 
* By default, it will only return 20 IDs.
* However, the Count parameter within the XML output shows the total number of records matching the search query.
* To establish workflow, a small query was searched - papers with ALK and GBM in the title or abstract.

In [3]:
base_url = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'

In [29]:
# Specify search parameters
database = 'pubmed'
query = 'glioblastoma[tiab]+AND+ALK[tiab]'

# index of the first record. When doing calls in a for loop, this number can be incremented by adding the maximum records being extracted
retstart = '0'

# max. number of records to obtain. The default is 20 and the max. allowable IDs returned are 10,000
retmax = 100

# output type. The default is XML.
retmode = 'json'

In [30]:
'''The usehistory = y means it will save the IDs from the search on the server 
so that these can then be used for a subsequent call to then extract the abstract using another utility like Efetch.'''

search_suffix = f'esearch.fcgi?db={database}&term={query}&usehistory=y&retmode={retmode}&retstart={retstart}&retmax={retmax}'

In [31]:
search_url = base_url + search_suffix

In [32]:
search_url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=glioblastoma[tiab]+AND+ALK[tiab]&usehistory=y&retmode=json&retstart=0&retmax=100'

In [33]:
response = requests.get(search_url)

* This code means the request was successful. There are different codes such as 400 which means bad request.

In [34]:
response.status_code

200

In [35]:
response.encoding

'UTF-8'

* The default output is XML, however the retreival mode can be set to return a json.
* Also, the number of records was confirmed by searching manually on PubMed: https://pubmed.ncbi.nlm.nih.gov/?term=%28ALK%5BTitle%2FAbstract%5D%29+AND+%28glioblastoma%5BTitle%2FAbstract%5D%29&sort=

In [43]:
response.json()

{'header': {'type': 'esearch', 'version': '0.3'},
 'esearchresult': {'count': '76',
  'retmax': '76',
  'retstart': '0',
  'querykey': '1',
  'webenv': 'MCID_65badeba27a5c1011c1c31f8',
  'idlist': ['38285799',
   '37939020',
   '37861443',
   '37271069',
   '37260294',
   '37240478',
   '37168365',
   '36823756',
   '36780194',
   '36707425',
   '35724395',
   '35625997',
   '35190826',
   '34702773',
   '34626238',
   '34341009',
   '34323181',
   '34015889',
   '33966367',
   '33887544',
   '33853673',
   '33728771',
   '33486679',
   '33341678',
   '33109342',
   '32866816',
   '32308772',
   '31875306',
   '31776900',
   '31483918',
   '31399568',
   '30894200',
   '30065256',
   '29336268',
   '28960893',
   '28912153',
   '28837676',
   '28484053',
   '28465216',
   '28459464',
   '28090572',
   '28069875',
   '27993946',
   '27579614',
   '27178681',
   '27046135',
   '26939704',
   '26648752',
   '26498130',
   '26438251',
   '26235020',
   '26090865',
   '25882777',
   '257338

## Step 2 - Extracting title and abstract for a given PubMed identifier - Get content for PMIDs using the efetch utility using IDs stored on the server.

**Notes**

* The esearch utility used above only returns a list of identifers matching the query terms.
* In order to extract other information such as title, abstract, another utility called Efetch has to be used.

In [89]:
# The base url and database fields have already been defined above:
base_url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'

In [45]:
database

'pubmed'

In [80]:
# Test one of the ids from the XML output above:

rettype = 'abstract'
retmode = 'text'

In [88]:
# fetchurl_suffix = f'efetch.fcgi?db={database}&id={ids_of_interest}&rettype={rettype}&retmode={retmode}'


* Alternatively, if ids are not stored on the server using the usehistory=y in esearch, then here the search term should then include 'ids=38285799,37939020'. A comma separated list of ids can be passed here. 
* However, storing the id list on the server and then retrieving the querykey and webenv values will enable storing all IDs and it's not necessary to pass a really long list in case of multiple IDs.

In [81]:
# The query_key and web_env values can be obtained from the Esearch result above. 
# This stores the ids on the server for subsequent access to be used in another functionality.
query_key=response.json()['esearchresult']['querykey']
web_env = response.json()['esearchresult']['webenv']

fetchurl_suffix = f'efetch.fcgi?db={database}&query_key={query_key}&WebEnv={web_env}&rettype={rettype}&retmode={retmode}'


In [82]:
fetch_url = base_url + fetchurl_suffix
fetch_url

'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_65badeba27a5c1011c1c31f8&rettype=abstract&retmode=text'

In [83]:
fetch_response = requests.get(fetch_url)

In [84]:
fetch_response.status_code

200

In [85]:
type(fetch_response.text)

str

* The first 3 abstracts are shown as an example output. But this variable holds a string containing all 76 IDs for papers for GBM and ALK terms.
* Alternatively, if retmode=xml is set in the url then the output will be of XML format.

In [87]:
print(fetch_response.text[0:5000])

1. Asian Pac J Cancer Prev. 2024 Jan 1;25(1):317-323. doi: 
10.31557/APJCP.2024.25.1.317.

Evaluation of Immunohistochemical Expression of ALK-1 in Gliomas, WHO Grade 4 
and Its Correlation with IDH1-R132H Mutation Status.

Khairy RA(1), Momtaz EM(1), Abd El Aziz AM(1), Shibel PEE(1).

Author information:
(1)Department of Pathology, Faculty of Medicine, Cairo University, Egypt.

BACKGROUND: Glioblastoma (GB), a grade 4 glioma is the most common primary 
malignant brain tumor in adults. Recently, the mutation status of isocitrate 
dehydrogenase (IDH) has been crucial in the treatment of GB. IDH mutant cases 
display a more favorable prognosis than IDH-wild type ones. The anaplastic 
lymphoma kinase (ALK) is expressed as a receptor tyrosine kinase in both the 
developing central and peripheral nervous systems. Increasing lines of evidence 
suggest that ALK is over-expressed in GB and represents a potential therapeutic 
target.
OBJECTIVES: The goal of the current study was to investigate 