# BioSearch Scraper

MBDS Universidad de Navarra 2023/2024

Paula Sanjuan Campos


## 1. Introduction

In this project, data from scientific articles will be obtained from the PubMed database through its API. PubMed is a leading database for biomedical and health sciences literature, offering a vast collection of articles, books, reviews, and other resources. PubMed was developed and is maintained by the National Center for Biotechnology Information (NCBI). This project gathers detailed information on scientific articles (title, keywords, publication date, authors).

The *Entrez* API, provided by the U.S. National Library of Medicine (NLM), is used. This API allows access to the PubMed database, enabling advanced queries and retrieval of detailed information about scientific articles. The *Entrez* API offers a wide range of functions that allow users to search for articles using search terms, filters, article identifiers, retrieve specific article information, and more. The *Entrez* *E-utilities* use a fixed URL syntax to translate a standard set of input parameters into the necessary values for searching and retrieving the requested data.

As demonstrated in the project code, *Entrez* provides access to 38 databases that cover a variety of biomedical data, including nucleotide and protein sequences, genetic records, three-dimensional molecular structures, and biomedical literature. This project focuses on the latter, the biomedical literature database.

Among the usage guidelines and requirements for NCBI utilities, there is a limit of three requests per second. To improve access, an API key is required, which can be requested through the NCBI account settings page. With an API key, access is granted for up to 10 requests per second. Furthermore, to minimize the number of requests and thus be more efficient in data retrieval, particularly when dealing with large numbers of records, it is recommended to use the *Entrez History*. With this tool, instead of sending individual requests for each record, users can make a single request for a set of records and then work with that data iteratively or in batches. This is especially useful when working with thousands or even millions of records, as it significantly reduces the number of individual requests needed and the time required to complete the task. Subsequently, the procedure is explained, although for this project I did not retrieve such a large number of records, so it was not used.


Another important aspect is managing potential IP blocks when using the *Entrez* services; for this, it is essential to register an email address. NCBI implements this process to ensure the responsible and equitable use of its resources, as well as to maintain the integrity and availability of its services.


A limitation of the system is that these utilities can only retrieve data that is already in *Entrez*, although the majority of NCBI data is available there. In fact, the search performed in the following code retrieves articles from the past year, and as can be seen in the dataframe, content from this same month can be downloaded.

Regarding system operation, the data records contained in each *Entrez* database are identified by an integer ID called UID (Unique Identifier). The core of the system consists of two tasks: gathering the list of UIDs that match a text query and retrieving a brief summary record called a Document Summary (DocSum) for each UID.


Access to this API is carried out through the *Biopython* library."



### Biopython API Documentation

Biopython API Documetation link: https://biopython.org/docs/latest/api/index.html

This API is used not only to access scientific article data but also for bioinformatics purposes. The package used in this project is *Bio.Entrez*, specifically to access the PubMed database (biomedical literature), although other options include *GenBank* (genetic sequences) and *BLAST* (biological sequence analysis).

Since it is a bioinformatics API, the rest of the packages within the API have functionalities related to the retrieval and manipulation of DNA and RNA sequences (*Bio.Seq*), working with protein structures (*Bio.PDB*), obtaining graphical representations of biological data (*Bio.Graphics*), among others.

***Bio.Entrez*** package utilities: 
- https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4
- https://biopython.org/docs/latest/api/Bio.Entrez.html

## 2. Libraries and modules

In [1]:
!pip install biopython
# pip install --upgrade biopython
# pip uninstall biopython




[notice] A new release of pip is available: 23.2.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Biopython requires NumPy (automatically installed with Biopython). 

In [2]:
from Bio import Entrez
import pandas as pd

To use the API, it is required to specify an email address for the request: `Entrez.email = 'A.N.Other@example.com'`. Its purpose is to manage requests to ensure efficient operation and to ensure compliance with the API's usage policies.

In [3]:
Entrez.email = "sanjuansanjuanp@gmail.com"

## 3. API Functions

- `Entrez.email`: User's email address. A string without spaces containing a valid email address. For more than 3 requests/s, an api_key is required.  
- `Entrez.einfo`: Returns a list of available databases. If a database is passed as a parameter, for example, `info = Entrez.einfo(db='pubmed')`, it provides detailed information about the database (available fields).
- `Entrez.esearch`: Performs a search in a specified database based on a query. Returns the article identifiers (list of UIDs). `handle = Entrez.esearch(db=db, term=term)`. The query must have special characters encoded in URL format (' ' = +). HTTP POST is used for long queries. Optionally, it allows specifying a search filter based on date.
- `Entrez.esummary`: Retrieves the summaries of the articles (*DocSum*) found using the identifiers (parameter: id). Without needing to retrieve the full content of the articles, it provides the key information.
- `Entrez.efetch`: Retrieves the full content of the articles (parameter: id). To display the result, use the `.read()` function.

As mentioned earlier, *Entrez* allows temporarily storing sets of UIDs in a history. These sets are accessed through the *Entrez History* server, where each set of UIDs is assigned a query key and a web environment via the web interface. To use and manage this service, first, the list of UIDs is obtained with the previously mentioned `esearch` function. Subsequently, the list is uploaded to the history server with `Entrez.epost`, which allows the list of identifiers to be uploaded into a specific web environment (identified by the `webenv` parameter). To access those records stored in the history, the `esearch` function allows determining the corresponding webenv as an optional parameter.


## 4. Code

The functionality of the following code consists of:

- Obtaining the available databases with this API.
- Retrieving detailed information about the PubMed database.
- Performing a search for *n* articles using a series of keywords and a date to obtain articles published after that date.
- Getting a list with their IDs, a dictionary with their IDs, titles, authors, publication date, DOI code, and a link to the article's PubMed page.
- Selecting one of the resulting articles to print its summary on the screen

In [4]:
def available_dbs():
    # Obtiene las bases de datos disponibles con la API de Entrez
    info = Entrez.einfo()
    record_info = Entrez.read(info)
    databases = record_info["DbList"]
    return databases

def pubmed_info(): 
    # Obtiene informacion detallada de la base de datos del PubMed
    info = Entrez.einfo(db="pubmed")
    record_info = Entrez.read(info)
    db_info = record_info['DbInfo']
    df_main = {
        "Dbname": db_info["DbName"],
        "Menuname": db_info["MenuName"],
        "Description": db_info["Description"],
        "Dbbuild": db_info["DbBuild"],
        "Count": db_info["Count"],
        "Lastupdate": db_info["LastUpdate"],
    }
    df_field_list = pd.DataFrame(db_info["FieldList"])
    return df_main, df_field_list

In [5]:
def search_articles(key_words, date, max_results=10):
    # Realiza la búsqueda de artículos en PubMed 
    term = f"{key_words}[Title/Abstract] AND {date}[PDAT]"
    search = Entrez.esearch(db="pubmed", term=term, retmax=max_results)
    record_search = Entrez.read(search)
    return record_search["IdList"]

def article_details(id_list):
    summary = Entrez.esummary(db="pubmed", id=",".join(id_list))
    record_summary = Entrez.read(summary)
    articles_info = []
    for article in record_summary:
        article_info = {
            "ID": article["Id"],
            "Title": article["Title"],
            "AuthorList": article["AuthorList"],
            "PubDate": article["PubDate"],
            "DOI": article.get("DOI", "No disponible"),
            "Link PubMed": f"https://www.ncbi.nlm.nih.gov/pubmed/{article['Id']}"
        }
        articles_info.append(article_info)
    return articles_info

def article_abstract(article_id):
    fetch = Entrez.efetch(db="pubmed", id=article_id, rettype="abstract", retmode="text")
    abstract = fetch.read()
    print(abstract)


In [6]:
# Obtener bases de datos disponibles
print("Available databases in Entrez:")
print(available_dbs())
print('-'*20)

# Obtener información detallada de PubMed
df_main, df_field_list= pubmed_info()
print('Main information in PubMed:\n', df_main)
print('\ndatabase info:\n')
df_field_list.head()

Available databases in Entrez:
['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'medgen', 'mesh', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']
--------------------
Main information in PubMed:
 {'Dbname': 'pubmed', 'Menuname': 'PubMed', 'Description': 'PubMed bibliographic record', 'Dbbuild': 'Build-2024.09.06.23.59', 'Count': '37708996', 'Lastupdate': '2024/09/06 23:59'}

database info:



Unnamed: 0,Name,FullName,Description,TermCount,IsDate,IsNumerical,SingleToken,Hierarchy,IsHidden
0,ALL,All Fields,All terms from all searchable fields,,N,N,N,N,N
1,UID,UID,Unique number assigned to publication,,N,Y,Y,N,Y
2,FILT,Filter,Limits the records,,N,N,Y,N,N
3,TITL,Title,Words in title of publication,,N,N,N,N,N
4,MESH,MeSH Terms,Medical Subject Headings assigned to publication,,N,N,Y,Y,N


In [7]:
# Realizar una búsqueda de artículos
key_words = ["cancer", "brain"]
date = "2023"
print(f"\nSearch for articles on '{key_words}' published from {date}:")
id_list = search_articles(key_words, date)
print("IDs of the articles found: ", id_list)


Search for articles on '['cancer', 'brain']' published from 2023:
IDs of the articles found:  ['38370347', '38328712', '38303306', '38260227', '38201564', '38187734', '38175350', '38149244', '38145439', '38142850']


In [8]:
# Obtener información detallada de los artículos en un dataframe
articles_info = article_details(id_list)
print("\nArticles detailed information:")
df = pd.DataFrame(articles_info)
df


Articles detailed information:


Unnamed: 0,ID,Title,AuthorList,PubDate,DOI,Link PubMed
0,38370347,Neurotoxicity-sparing radiotherapy for brain m...,"[Buczek D, Zaucha R, Jassem J]",2023,10.3389/fonc.2023.1215426,https://www.ncbi.nlm.nih.gov/pubmed/38370347
1,38328712,Experimental models for cancer brain metastasis.,"[Liu Z, Dong S, Liu M, Liu Y, Ye Z, Zeng J, Ya...",2024 Jan,10.1016/j.cpt.2023.10.005,https://www.ncbi.nlm.nih.gov/pubmed/38328712
2,38303306,[A Case of Breast Cancer Brain Metastases Succ...,"[Hikino H, Otani A, Makino Y, Murata Y]",2023 Dec,No disponible,https://www.ncbi.nlm.nih.gov/pubmed/38303306
3,38260227,Effects of Ataxia-Telangiectasia Mutated Varia...,"[Floyd W, Carpenter D, Vaios E, Shenker R, Hen...",2024 Jan,10.1016/j.adro.2023.101320,https://www.ncbi.nlm.nih.gov/pubmed/38260227
4,38201564,Stereotactic Radiosurgery for Women Older than...,"[Upadhyay R, Klamer BG, Perlow HK, White JR, B...",2023 Dec 27,10.3390/cancers16010137,https://www.ncbi.nlm.nih.gov/pubmed/38201564
5,38187734,Discovery of novel brain permeable human ACSS2...,"[Esquea E, Ciraku L, Young RG, Merzy J, Talari...",2023 Dec 23,10.1101/2023.12.22.573073,https://www.ncbi.nlm.nih.gov/pubmed/38187734
6,38175350,Breast Cancer Brain Metastases: Achilles' Heel...,"[Ferraro E, Seidman AD]",2023,10.1007/978-3-031-33602-7_11,https://www.ncbi.nlm.nih.gov/pubmed/38175350
7,38149244,Unlocking molecular mechanisms and identifying...,"[Najjary S, de Koning W, Kros JM, Mustafa DAM]",2023,10.3389/fimmu.2023.1305644,https://www.ncbi.nlm.nih.gov/pubmed/38149244
8,38145439,Air quality and cancer risk in the All of Us R...,"[Craver A, Luo J, Kibriya MG, Randorf N, Bahl ...",2024 May,10.1007/s10552-023-01823-7,https://www.ncbi.nlm.nih.gov/pubmed/38145439
9,38142850,Microtubule destabilising activity of selected...,"[Perużyńska M, Birger R, Piotrowska K, Kwiecie...",2024 Feb 5,10.1016/j.ejphar.2023.176308,https://www.ncbi.nlm.nih.gov/pubmed/38142850


In [9]:
# Seleccionar un artículo para imprimir su resumen
if id_list:
    article_id = id_list[0]
    print("\nFirst article found summary:")
    article_abstract(article_id)
else:
    print("\nNo articles found.")


First article found summary:
1. Front Oncol. 2024 Feb 2;13:1215426. doi: 10.3389/fonc.2023.1215426.
eCollection  2023.

Neurotoxicity-sparing radiotherapy for brain metastases in breast cancer: a 
narrative review.

Buczek D(#)(1), Zaucha R(#)(1), Jassem J(1).

Author information:
(1)Department of Oncology and Radiotherapy, Medical University of Gdańsk, 
Gdańsk, Poland.
(#)Contributed equally

Breast cancer brain metastasis (BCBM) has a devastating impact on patient 
survival, cognitive function and quality of life. Radiotherapy remains the 
standard management of BM but may result in considerable neurotoxicity. Herein, 
we describe the current knowledge on methods for reducing radiation-induced 
cognitive dysfunction in patients with BCBM. A better understanding of the 
biology and molecular underpinnings of BCBM, as well as more sophisticated 
prognostic models and individualized treatment approaches, have appeared to 
enable more effective neuroprotection. The therapeutic armamenta