---

## Easy Approach

Using metapub, but the drawback is that summary of the article cannot be easily retrieved.

In [1]:
# !pip install metapub

In [2]:
import pandas as pd
from metapub import PubMedFetcher
fetch = PubMedFetcher()



In [3]:
keyword = 'diabetes'
articles = fetch.pmids_for_query(keyword, retmax=10)  # Adjust retmax as needed to control the number of articles retrieved


In [4]:
articles

['37209023',
 '37209021',
 '37208984',
 '37208981',
 '37208883',
 '37208881',
 '37208873',
 '37208867',
 '37208852',
 '37208842']

In [5]:
articles = ['26144594',
'29527188',
'32082174']

In [6]:
# Initialize lists to hold abstracts and summaries
abstracts = []
summaries = []

# Iterate over each PubMed ID
for pmid in articles:
    # Fetch the article
    article = fetch.article_by_pmid(pmid)
    
    # Get the abstract and summary of the article
    if article is not None:
        if article.abstract is not None:
            abstracts.append(article.abstract)
        else:
            print(pmid, 'Abstract not available.')
        
        if article.content is not None:
            summaries.append(article.content)


29527188 Abstract not available.


In [7]:
len(abstracts)


2

In [8]:
abstracts

['AIMS/HYPOTHESIS: Glucagon-like peptide-1 (GLP-1) is an incretin hormone derived from proglucagon, which is released from intestinal L-cells and increases insulin secretion in a glucose dependent manner. GPR119 is a lipid derivative receptor present in L-cells, believed to play a role in the detection of dietary fat. This study aimed to characterize the responses of primary murine L-cells to GPR119 agonism and assess the importance of GPR119 for the detection of ingested lipid.\nMETHODS: GLP-1 secretion was measured from murine primary cell cultures stimulated with a panel of GPR119 ligands. Plasma GLP-1 levels were measured in mice lacking GPR119 in proglucagon-expressing cells and controls after lipid gavage. Intracellular cAMP responses to GPR119 agonists were measured in single primary L-cells using transgenic mice expressing a cAMP FRET sensor driven by the proglucagon promoter.\nRESULTS: L-cell specific knockout of GPR119 dramatically decreased plasma GLP-1 levels after a lipid 

In [9]:
print(summaries[0])

<Element PubmedArticle at 0x223f2dedf40>


---




---

## Alternative Approach

In [10]:
# !pip install biopython


In [11]:
from Bio import Entrez

Entrez.email = "noctis@bu.edu"

# Use the 'esearch' function to get the ids of articles that contain the keyword 'diabetes'
search_results = Entrez.read(Entrez.esearch(db="pubmed", term="diabetes", retmax=1))

idlist = search_results["IdList"]

# Use 'efetch' to get the details of the articles
fetch_results = Entrez.efetch(db="pubmed", id=idlist, rettype="abstract", retmode="text")

# Read the results
abstracts = fetch_results.read()

print(abstracts)


ModuleNotFoundError: No module named 'Bio'

In [None]:
# Use 'efetch' to get the details of the articles
summary_results = Entrez.efetch(db="pubmed", id=idlist, rettype="summary", retmode="text")

In [None]:
print(summary_results.read())

-----

https://github.com/BlueBrain/Search/issues/460

In [1]:
import csv
import re
import urllib
from time import sleep

In [2]:
query = 'cancer'

# common settings between esearch and efetch
base_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
db = 'db=pubmed'

# esearch specific settings
search_eutil = 'esearch.fcgi?'
search_term = '&term=' + query
search_usehistory = '&usehistory=y'
search_rettype = '&rettype=json'

In [3]:
search_url = base_url+search_eutil+db+search_term+search_usehistory+search_rettype
print(search_url)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer&usehistory=y&rettype=json


In [4]:
f = urllib.request.urlopen(search_url)
search_data = f.read().decode('utf-8')

Quick inspection on what we got:

In [5]:
search_data

'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">\n<eSearchResult><Count>4879606</Count><RetMax>20</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>MCID_648da6d6bdac8b45df3f7297</WebEnv><IdList>\n<Id>37328917</Id>\n<Id>37328908</Id>\n<Id>37328902</Id>\n<Id>37328893</Id>\n<Id>37328890</Id>\n<Id>37328883</Id>\n<Id>37328880</Id>\n<Id>37328876</Id>\n<Id>37328875</Id>\n<Id>37328872</Id>\n<Id>37328860</Id>\n<Id>37328857</Id>\n<Id>37328854</Id>\n<Id>37328852</Id>\n<Id>37328851</Id>\n<Id>37328842</Id>\n<Id>37328838</Id>\n<Id>37328835</Id>\n<Id>37328828</Id>\n<Id>37328826</Id>\n</IdList><TranslationSet><Translation>     <From>cancer</From>     <To>"cancer\'s"[All Fields] OR "cancerated"[All Fields] OR "canceration"[All Fields] OR "cancerization"[All Fields] OR "cancerized"[All Fields] OR "cancerous"[All Fields] OR "neoplasms"[MeSH Terms] OR "neoplasms"[All Fiel

In [6]:
# obtain total abstract count
total_abstract_count = int(re.findall("<Count>(\d+?)</Count>",search_data)[0])

# obtain webenv and querykey settings for efetch command
fetch_webenv = "&WebEnv=" + re.findall ("<WebEnv>(\S+)<\/WebEnv>", search_data)[0]
fetch_querykey = "&query_key=" + re.findall("<QueryKey>(\d+?)</QueryKey>",search_data)[0]

In [7]:
total_abstract_count

4879606

In [8]:
fetch_webenv

'&WebEnv=MCID_648da6d6bdac8b45df3f7297'

In [9]:
fetch_querykey

'&query_key=1'

In [10]:
# other efetch settings
fetch_eutil = 'efetch.fcgi?'
retmax = 100
retstart = 0
fetch_retstart = "&retstart=" + str(retstart)
fetch_retmax = "&retmax=" + str(retmax)
fetch_retmode = "&retmode=text"
fetch_rettype = "&rettype=abstract"

In [11]:
fetch_url = base_url+fetch_eutil+db+fetch_querykey+fetch_webenv+fetch_retstart+fetch_retmax+fetch_retmode+fetch_rettype
print(fetch_url)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_648da6d6bdac8b45df3f7297&retstart=0&retmax=100&retmode=text&rettype=abstract


In [12]:
f = urllib.request.urlopen(fetch_url)
fetch_data = f.read().decode('utf-8')

In [13]:
fetch_data[1:3000]

". J Genet Couns. 2023 Jun 16. doi: 10.1002/jgc4.1738. Online ahead of print.\n\nInterventions to support decision making in people considering germline genetic \ntesting for BRCA 1/2 pathogenic and likely pathogenic variants: A scoping \nreview.\n\nPozzar RA(1), Seven M(2).\n\nAuthor information:\n(1)Phyllis F. Cantor Center for Research in Nursing and Patient Care Services, \nDana-Farber Cancer Institute, Boston, Massachusetts, USA.\n(2)Elaine Marieb College of Nursing, University of Massachusetts, Amherst, \nMassachusetts, USA.\n\nPathogenic and likely pathogenic variants in BRCA1 and BRCA2 (BRCA1/2) are \nmedically actionable and may inform hereditary breast and ovarian cancer (HBOC) \ntreatment and prevention. However, rates of germline genetic testing (GT) in \npeople with and without cancer are suboptimal. Individuals' knowledge, \nattitudes, and beliefs may influence GT decisions. While genetic counseling (GC) \nprovides decision support, the supply of genetic counselors is ins

In [14]:
# splits the data into individual abstracts
abstracts = fetch_data.split("\n\n\n")
len(abstracts)

99

In [15]:
# For inpsection
# print out the first abstract
abstracts[0]

"1. J Genet Couns. 2023 Jun 16. doi: 10.1002/jgc4.1738. Online ahead of print.\n\nInterventions to support decision making in people considering germline genetic \ntesting for BRCA 1/2 pathogenic and likely pathogenic variants: A scoping \nreview.\n\nPozzar RA(1), Seven M(2).\n\nAuthor information:\n(1)Phyllis F. Cantor Center for Research in Nursing and Patient Care Services, \nDana-Farber Cancer Institute, Boston, Massachusetts, USA.\n(2)Elaine Marieb College of Nursing, University of Massachusetts, Amherst, \nMassachusetts, USA.\n\nPathogenic and likely pathogenic variants in BRCA1 and BRCA2 (BRCA1/2) are \nmedically actionable and may inform hereditary breast and ovarian cancer (HBOC) \ntreatment and prevention. However, rates of germline genetic testing (GT) in \npeople with and without cancer are suboptimal. Individuals' knowledge, \nattitudes, and beliefs may influence GT decisions. While genetic counseling (GC) \nprovides decision support, the supply of genetic counselors is in

In [16]:
split_abstract = abstracts[1].split("\n\n")
split_abstract

['2. Hum Genomics. 2023 Jun 16;17(1):53. doi: 10.1186/s40246-023-00482-8.',
 'Evaluation of a genetic risk score computed using human chromosomal-scale length \nvariation to predict breast cancer.',
 'Ko C(1), Brody JP(2).',
 'Author information:\n(1)Department of Biomedical Engineering, University of California, Irvine, USA.\n(2)Department of Biomedical Engineering, University of California, Irvine, USA. \njpbrody@uci.edu.',
 'INTRODUCTION: The ability to accurately predict whether a woman will develop \nbreast cancer later in her life, should reduce the number of breast cancer \ndeaths. Different predictive models exist for breast cancer based on family \nhistory, BRCA status, and SNP analysis. The best of these models has an accuracy \n(area under the receiver operating characteristic curve, AUC) of about 0.65. We \nhave developed computational methods to characterize a genome by a small set of \nnumbers that represent the length of segments of the chromosomes, called \nchromosomal-

In [17]:
len(split_abstract)

7

In [None]:
import csv
import re
import urllib
from time import sleep

query = "high+functioning+autism"

# common settings between esearch and efetch
base_url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
db = 'db=pubmed'

# esearch settings
search_eutil = 'esearch.fcgi?'
search_term = '&term=' + query
search_usehistory = '&usehistory=y'
search_rettype = '&rettype=json'

# call the esearch command for the query and read the web result
search_url = base_url+search_eutil+db+search_term+search_usehistory+search_rettype
print("this is the esearch command:\n" + search_url + "\n")
f = urllib.request.urlopen (search_url)
search_data = f.read().decode('utf-8')

# extract the total abstract count
total_abstract_count = int(re.findall("<Count>(\d+?)</Count>",search_data)[0])

# efetch settings
fetch_eutil = 'efetch.fcgi?'
retmax = 20
retstart = 0
fetch_retmode = "&retmode=text"
fetch_rettype = "&rettype=abstract"

# obtain webenv and querykey settings from the esearch results
fetch_webenv = "&WebEnv=" + re.findall ("<WebEnv>(\S+)<\/WebEnv>", search_data)[0]
fetch_querykey = "&query_key=" + re.findall("<QueryKey>(\d+?)</QueryKey>",search_data)[0]

# call efetch commands using a loop until all abstracts are obtained
run = True
all_abstracts = list()
loop_counter = 1

while run:
    print("this is efetch run number " + str(loop_counter))
    loop_counter += 1
    fetch_retstart = "&retstart=" + str(retstart)
    fetch_retmax = "&retmax=" + str(retmax)
    # create the efetch url
    fetch_url = base_url+fetch_eutil+db+fetch_querykey+fetch_webenv+fetch_retstart+fetch_retmax+fetch_retmode+fetch_rettype
    print(fetch_url)
    # open the efetch url
    f = urllib.request.urlopen (fetch_url)
    fetch_data = f.read().decode('utf-8')
    # split the data into individual abstracts
    abstracts = fetch_data.split("\n\n\n")
    # append to the list all_abstracts
    all_abstracts = all_abstracts+abstracts
    print("A total of " + str(len(all_abstracts)) + " abstracts have been downloaded.\n")
    # wait 2 seconds so we don't get blocked
    sleep(2)
    # update retstart to download the next chunk of abstracts
    retstart = retstart + retmax
    if retstart > total_abstract_count:
        run = False
    
    

this is the esearch command:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=high+functioning+autism&usehistory=y&rettype=json

this is efetch run number 1
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_648daa38cba8930fa66f7385&retstart=0&retmax=20&retmode=text&rettype=abstract
A total of 20 abstracts have been downloaded.

this is efetch run number 2
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_648daa38cba8930fa66f7385&retstart=20&retmax=20&retmode=text&rettype=abstract
A total of 40 abstracts have been downloaded.

this is efetch run number 3
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_648daa38cba8930fa66f7385&retstart=40&retmax=20&retmode=text&rettype=abstract
A total of 60 abstracts have been downloaded.

this is efetch run number 4
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=MCID_

In [20]:
len(all_abstracts)

1400

In [21]:
with open("abstracts.csv", "wt", encoding="utf8") as abstracts_file, open ("partial_abstracts.csv", "wt", encoding="utf8") as partial_abstracts:
    # csv writer for full abstracts
    abstract_writer = csv.writer(abstracts_file)
    abstract_writer.writerow(['Journal', 'Title', 'Authors', 'Author_Information', 'Abstract', 'DOI', 'Misc'])
    # csv writer for partial abstracts
    partial_abstract_writer = csv.writer(partial_abstracts)
    #For each abstract, split into categories and write it to the csv file
    for abstract in all_abstracts:
        #To obtain categories, split every double newline.
        split_abstract = abstract.split("\n\n")
        if len(split_abstract) > 5:
            abstract_writer.writerow(split_abstract)
        else:
            partial_abstract_writer.writerow(split_abstract)