#PhD Juan José Oropeza Valdez

##Jul 2024

# Abstract Parser for PubMed

This notebook demonstrates how to fetch and process abstracts from PubMed using the BioPython library. The fetched abstracts are then saved into a CSV file. This process includes installing necessary packages, defining functions, and executing the workflow.


## 1. Install Required Packages

First, we need to install the BioPython and Unidecode packages to handle PubMed data and normalize text, respectively.

In [None]:
!pip install biopython
!pip install unidecode



## 2. Set Up BioPython Entrez Email

To use NCBI's Entrez, you need to provide an email address. This helps NCBI to contact you if there are issues with your queries.

In [None]:
from Bio import Entrez

Entrez.email = "example@email.com"

## 3. Define Function to Fetch Abstracts

The following function, `fetch_abstracts`, fetches abstracts from PubMed based on a keyword and a maximum count of articles to retrieve. It parses the XML response and extracts relevant information.

In [15]:
import re
import pandas as pd
import time
from Bio import Entrez
from unidecode import unidecode
from urllib.error import HTTPError, URLError
from http.client import RemoteDisconnected

def fetch_abstracts(keyword, max_count):
    abstracts = []
    batch_size = 400
    retstart = 0
    
    while retstart < max_count:
        handle = Entrez.esearch(db="pubmed", term=keyword, retmax=batch_size, retstart=retstart)
        record = Entrez.read(handle)
        ids = record["IdList"]
        if not ids:
            break

        for pmid in ids:
            retries = 3
            for i in range(retries):
                try:
                    handle = Entrez.efetch(db="pubmed", id=pmid, rettype="medline", retmode="xml")
                    record = Entrez.read(handle)
                    for article in record['PubmedArticle']:
                        abstract = article['MedlineCitation']['Article'].get('Abstract', {}).get('AbstractText', [''])[0]
                        title = article['MedlineCitation']['Article'].get('ArticleTitle', '')
                        authors = ", ".join(["{} {}".format(author.get('ForeName', ''), author.get('LastName', '')) for author in article['MedlineCitation']['Article'].get('AuthorList', [])])
                        journal = article['MedlineCitation']['Article'].get('Journal', {}).get('Title', '')
                        pub_date = article['MedlineCitation']['Article'].get('Journal', {}).get('JournalIssue', {}).get('PubDate', {})
                        year = pub_date.get('Year', '')
                        month = pub_date.get('Month', '')
                        day = pub_date.get('Day', '')

                        # Extract DOI correctly
                        doi_list = article['MedlineCitation']['Article'].get('ELocationID', [])
                        doi = ""
                        for id_element in doi_list:
                            if id_element.attributes.get('EIdType') == 'doi':
                                doi = id_element

                        pmid = article['MedlineCitation'].get('PMID', '')
                        volume = article['MedlineCitation']['Article'].get('Journal', {}).get('JournalIssue', {}).get('Volume', '')
                        issue = article['MedlineCitation']['Article'].get('Journal', {}).get('JournalIssue', {}).get('Issue', '')
                        copyright_info = article['MedlineCitation'].get('CoiStatement', '')

                        # Apply unidecode to the extracted fields
                        title = unidecode(title)
                        authors = unidecode(authors)
                        journal = unidecode(journal)
                        abstract = unidecode(" ".join(abstract) if isinstance(abstract, list) else abstract)
                        copyright_info = unidecode(copyright_info)

                        abstracts.append({
                            "Title": title,
                            "Authors": authors,
                            "Journal": journal,
                            "Year": year,
                            "Month": month,
                            "Day": day,
                            "Volume": volume,
                            "Issue": issue,
                            "Abstract": abstract,
                            "DOI": doi,
                            "PMID": pmid,
                            "COPYRIGHT": copyright_info
                        })
                    break
                except (HTTPError, URLError, RemoteDisconnected) as e:
                    if i < retries - 1:
                        time.sleep(2 ** i)  # Exponential backoff
                        continue
                    else:
                        print(f"Failed to fetch data for PMID {pmid} after {retries} retries. Error: {e}")
                        continue

        retstart += batch_size
        time.sleep(1)  # To avoid overloading the server
    return abstracts

## 4. Fetch Abstracts for Specific Keywords

We can now fetch abstracts for a given list of keywords. In this example, we are fetching abstracts for the keywords "Pediatric" and "Newborn". Also please select the maxinum number of abstracts to fetch. Take into account that this may take time depending on the number of abstracts that you want

In [None]:
keywords = ["Pediatric", "Newborn"]
all_abstracts = []
for keyword in keywords:
    abstracts = fetch_abstracts(keyword, 10)
    all_abstracts.extend(abstracts)

In [None]:
all_abstracts

[{'Title': 'Early sepsis recognition: Is hypothermia the most neglected symptom?',
  'Authors': 'Georgios Papathanakos, Pedro Povoa, Stijn Blot',
  'Journal': 'Intensive & critical care nursing',
  'Year': '2024',
  'Month': 'Jul',
  'Day': '19',
  'Volume': '84',
  'Issue': '',
  'Abstract': '',
  'DOI': StringElement('10.1016/j.iccn.2024.103776', attributes={'EIdType': 'doi', 'ValidYN': 'Y'}),
  'PMID': StringElement('39032212', attributes={'Version': '1'}),
  'COPYRIGHT': 'Declaration of competing interest Georgios Papathanakos and Pedro Povoa declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Stijn Blot is Editor-in-Chief of Intensive & Critical Care Nursing.'},
 {'Title': 'Social scripts of violence among adolescent girls and young women in Zambia: Exploring how gender norms and social expectations are activated in the aftermath of violence.',
  'Authors': 'Christina Laurenz

## 5. Save Abstracts to CSV

Finally, we save the fetched abstracts to a CSV file for further analysis or reference.

In [None]:
import csv

with open("abstracts.csv", "w", newline="", encoding='utf-8') as csvfile:
    fieldnames = ["Title", "Authors", "Journal", "Month", "Day", "Year", "Volume", "Issue", "Abstract", "DOI", "PMID", "COPYRIGHT"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for abstract in all_abstracts:
        writer.writerow(abstract)


In [None]:
import pandas as pd

df = pd.read_csv("abstracts.csv")
df.head()

Unnamed: 0,Title,Authors,Journal,Month,Day,Year,Volume,Issue,Abstract,DOI,PMID,COPYRIGHT
0,Early sepsis recognition: Is hypothermia the m...,"Georgios Papathanakos, Pedro Povoa, Stijn Blot",Intensive & critical care nursing,Jul,19.0,2024,84.0,,,10.1016/j.iccn.2024.103776,39032212,Declaration of competing interest Georgios Pap...
1,Social scripts of violence among adolescent gi...,"Christina Laurenzi, Chanda Mwamba, Chuma Busak...",Social science & medicine (1982),Jul,16.0,2024,356.0,,Adolescent girls and young women ages 15-24 ex...,10.1016/j.socscimed.2024.117133,39032194,Declaration of competing interests The authors...
2,A case-control study to investigate determinan...,"Raden Ahmad Dedy Mardani, Zuhratul Hajri, Zurr...",Journal for specialists in pediatric nursing :...,Jul,,2024,29.0,3.0,This study aimed to examine determinants of un...,10.1111/jspn.12435,39032153,
3,The most optimal school recruitment strategies...,"Aliye B Cepni, Reshma Vilson, Rachel R Helbing...",Obesity reviews : an official journal of the I...,Jul,20.0,2024,,,This systematic review with the Delphi study a...,10.1111/obr.13808,39032149,
4,Identifying barriers and facilitators for the ...,"Gloria Lau, Roz Walker, Pamela Laird, Philomen...",Journal of paediatrics and child health,Jul,20.0,2024,,,To identify the barriers and facilitators for ...,10.1111/jpc.16626,39032110,


Now we have a dataframe as shown above, however, the format of the month is not standardized. In order to standardize the dates we can do the following:

In [None]:
import numpy as np

#Standardizing months
df['Month'].replace('Jan', '01', inplace=True)
df['Month'].replace('Feb', '02', inplace=True)
df['Month'].replace('Mar', '03', inplace=True)
df['Month'].replace('Apr', '04', inplace=True)
df['Month'].replace('May', '05', inplace=True)
df['Month'].replace('Jun', '06', inplace=True)
df['Month'].replace('Jul', '07', inplace=True)
df['Month'].replace('Aug', '08', inplace=True)
df['Month'].replace('Sep', '09', inplace=True)
df['Month'].replace('Oct', '10', inplace=True)
df['Month'].replace('Nov', '11', inplace=True)
df['Month'].replace('Dec', '12', inplace=True)
df['Month'].replace('', np.nan, inplace=True)

#Converting year to numeric
df['Year'].replace('', np.nan, inplace=True)
df['Year'] = pd.to_numeric(df['Year'], errors='coerce')

#Convert day to numeric
df['Day'].replace('', np.nan, inplace=True)
df['Day'] = pd.to_numeric(df['Day'], errors='coerce')

Optional we can replace NaNs with a name, in this case "no_data"

In [None]:
# Replacing NaNs with 'no_data'
df.fillna('no_data', inplace=True)

In [None]:
df.head()

Unnamed: 0,Title,Authors,Journal,Month,Day,Year,Volume,Issue,Abstract,DOI,PMID,COPYRIGHT
0,Early sepsis recognition: Is hypothermia the m...,"Georgios Papathanakos, Pedro Povoa, Stijn Blot",Intensive & critical care nursing,7,19.0,2024,84.0,no_data,no_data,10.1016/j.iccn.2024.103776,39032212,Declaration of competing interest Georgios Pap...
1,Social scripts of violence among adolescent gi...,"Christina Laurenzi, Chanda Mwamba, Chuma Busak...",Social science & medicine (1982),7,16.0,2024,356.0,no_data,Adolescent girls and young women ages 15-24 ex...,10.1016/j.socscimed.2024.117133,39032194,Declaration of competing interests The authors...
2,A case-control study to investigate determinan...,"Raden Ahmad Dedy Mardani, Zuhratul Hajri, Zurr...",Journal for specialists in pediatric nursing :...,7,no_data,2024,29.0,3.0,This study aimed to examine determinants of un...,10.1111/jspn.12435,39032153,no_data
3,The most optimal school recruitment strategies...,"Aliye B Cepni, Reshma Vilson, Rachel R Helbing...",Obesity reviews : an official journal of the I...,7,20.0,2024,no_data,no_data,This systematic review with the Delphi study a...,10.1111/obr.13808,39032149,no_data
4,Identifying barriers and facilitators for the ...,"Gloria Lau, Roz Walker, Pamela Laird, Philomen...",Journal of paediatrics and child health,7,20.0,2024,no_data,no_data,To identify the barriers and facilitators for ...,10.1111/jpc.16626,39032110,no_data
