PubMed is a free resource supporting the search and retrieval of biomedical and life sciences literature with the aim of improving health–both globally and personally.
The PubMed database contains more than 35 million citations and abstracts of biomedical literature. It does not include full text journal articles; however, links to the full text are often present when available from other sources, such as the publisher's website or PubMed Central (PMC).
Available to the public online since 1996, PubMed was developed and is maintained by the National Center for Biotechnology Information (NCBI), at the U.S. National Library of Medicine (NLM), located at the National Institutes of Health (NIH).



*   Install pubmed python API packge
*   Get abstracts
*   Clean texts
*   Save data


Intall package if it has not been installed before

In [None]:
#! pip install metapub

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import drive
drive.mount('gdrive')
import os

In [None]:
data_dir = "/content/gdrive/My Drive/DagDataScienceMaterial/data_folder/TextFolder/"


In [None]:
#Extract the abstract using the keywords


def get_pubmed_abs(keywords, num_article):
  from metapub import PubMedFetcher
  fetch = PubMedFetcher()
  pmids = fetch.pmids_for_query(keywords, retmax=num_article)
  abstracts = {}
  for pmid in pmids:
    abstracts[pmid] = fetch.article_by_pmid(pmid).abstract
  Abstract = pd.DataFrame(list(abstracts.items()),columns = ['pmid','Abstract'])
  return Abstract

def get_pubmed_title(keywords, num_article):
  from metapub import PubMedFetcher
  fetch = PubMedFetcher()
  pmids = fetch.pmids_for_query(keywords, retmax=num_article)
  abstracts = {}
  for pmid in pmids:
    abstracts[pmid] = fetch.article_by_pmid(pmid).title
  Abstract = pd.DataFrame(list(abstracts.items()),columns = ['pmid','Title'])
  return Abstract

In [None]:
keyword = "covid"
num_article = 300

In [None]:
df_abs = get_pubmed_abs(keyword, num_article)

In [None]:
df_abs.dropna(inplace = True)
df_abs.head()

In [None]:
df_abs["Abstract"] = df_abs["Abstract"]\
.apply(lambda x: x.replace("INTRODUCTION:",""))\
.apply(lambda x: x.replace("IMPORTANCE:",""))\
.apply(lambda x: x.replace("BACKGROUND:",""))

In [None]:
df_abs.head()

**Clean text remove puctuations**

In [None]:
def cleanup_text(text):
    import re
    # remove punctuation
    text = re.sub('[^a-zA-Z0-9]', ' ', text)
    # remove multiple spaces
    text = re.sub(r' +', ' ', text)
    # remove newline
    text = re.sub(r'\n', ' ', text)
    return text

**Save text data into the csv file **

In [None]:
df_abs["Abstract"] = df_abs["Abstract"].apply(lambda x: cleanup_text(x))

In [None]:
df_abs.to_csv(os.path.join(data_dir, "pubmed_abs.csv"), index = False)