# 3.1 Searching PubMed

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github.com/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-1-searching-pubmed.ipynb) 

In this notebook, we will show how to use the PubMed API to validate a search query against a set of seed studies. We will then use a large language model to generate a search query for a systematic review and validate it against the same set of seed studies.

The first cell below sets up the environment and installs the required packages.

In [1]:
!pip install requests huggingface transformers torch accelerate -q
import requests
import torch
from transformers import pipeline
import transformers.utils

2024-04-15 08:07:09.898070: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-15 08:07:09.898127: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-15 08:07:09.899715: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


## Validating a Search Query

We can use the PubMed API to find if our search string is retrieving the seed studies we have while maintaining an acceptable level to total studies to screen.

In [2]:
search_string = """
("Acne Vulgaris"[Mesh] OR Acne[tiab] OR Blackheads[tiab] OR Whiteheads[tiab] OR Pimples[tiab]) AND ("Phototherapy"[Mesh] OR "Blue light"[tiab] OR Phototherapy[tiab] OR Phototherapies[tiab] OR "Photoradiation therapy"[tiab] OR "Photoradiation Therapies"[tiab] OR "Light Therapy"[tiab] OR "Light Therapies"[tiab]) AND (Randomized controlled trial[pt] OR controlled clinical trial[pt] OR randomized[tiab] OR randomised[tiab] OR placebo[tiab] OR "drug therapy"[sh] OR randomly[tiab] OR trial[tiab] OR groups[tiab]) NOT (Animals[Mesh] not (Animals[Mesh] and Humans[Mesh]))
"""
seed_studies = ["27575854", "25594129", "20098847", "22091799", "23278295", "24313686", "29152718", "10809858",
                "18664153", "15379878"]

In [3]:
response = requests.get(  # GET request
    url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",  # URL of the API
    params={  # Parameters of the request
        "db": "pubmed",
        "term": search_string,
        "retmax": 10_000,  # We can retrieve up to 100,000 studies at a time
        "format": "json"
    }
).json()  # Parse the response as JSON
response

{'header': {'type': 'esearch', 'version': '0.3'},
 'esearchresult': {'count': '495',
  'retmax': '495',
  'retstart': '0',
  'idlist': ['38464448',
   '38316341',
   '38297922',
   '38243786',
   '38070633',
   '37984524',
   '37951327',
   '37943263',
   '37931693',
   '37881149',
   '37843672',
   '37805765',
   '37643988',
   '37634105',
   '37592125',
   '37558093',
   '37446050',
   '37328000',
   '37097168',
   '37095152',
   '37073672',
   '36946749',
   '36842473',
   '36740179',
   '36704881',
   '36676145',
   '36587863',
   '36564903',
   '36505308',
   '36384211',
   '36381183',
   '36208057',
   '36199062',
   '35951024',
   '35917260',
   '35829974',
   '35792252',
   '35791482',
   '35789996',
   '35763391',
   '35760351',
   '35500742',
   '35490959',
   '35309844',
   '35286707',
   '35176498',
   '35132604',
   '35044116',
   '34981580',
   '34929354',
   '34919759',
   '34840709',
   '34799388',
   '34797414',
   '34752842',
   '34696155',
   '34674364',
   '34613687

In [4]:
def validate_search(seed_studies, response):
    total_studies = int(response["esearchresult"]["count"])
    retrieved_studies = response["esearchresult"]["idlist"]
    retrieved_seed_studies = []
    for study in seed_studies:
        if study in retrieved_studies:
            retrieved_seed_studies.append(study)
    return total_studies, len(retrieved_seed_studies), retrieved_seed_studies

In [5]:
retrieved_studies, retrieved_seed_studies, _ = validate_search(seed_studies, response)
print(f"Total studies retrieved: {retrieved_studies}")
print(f"Retrieved {retrieved_seed_studies} out of {len(seed_studies)} studies.")

Total studies retrieved: 495
Retrieved 10 out of 10 studies.


In [6]:
print("Precision:", retrieved_seed_studies / retrieved_studies)
print("Recall:", retrieved_seed_studies / len(seed_studies))

Precision: 0.020202020202020204
Recall: 1.0


In [7]:
# This cell is required if running on Apple Silicon. It can otherwise be ignored.
%env PYTORCH_ENABLE_MPS_FALLBACK=1

env: PYTORCH_ENABLE_MPS_FALLBACK=1


## Generating Search Queries with LLMs

A [recent paper](https://dl.acm.org/doi/pdf/10.1145/3539618.3591703) used ChatGPT to generate search queries for systematic reviews. Since then, there have been many open-source models that are just as good if not better than ChatGPT. In the next set of cells, we will use one of these models to replicate the results of the paper.

In the first cell below, we define a few lines of code that will help us interact with the model.

In [8]:

class ChatAssistant:
    def __init__(self):
        if torch.cuda.is_available():
            torch.backends.cuda.enable_mem_efficient_sdp(False)
            torch.backends.cuda.enable_flash_sdp(False)
            device = "cuda"
        elif transformers.utils.is_torch_mps_available():
            device = "mps"
        else:
            device = "cpu"
        self.pipe = pipeline("text-generation", model="BioMistral/BioMistral-7B-SLERP", torch_dtype=torch.bfloat16,
                             device_map=device)
        self.context = []
        self.outputs = []

    def new_chat(self):
        self.context = []
        self.outputs = []

    def chat(self, message):
        prompt = self.pipe.tokenizer.apply_chat_template(self.context + [{"role": "user", "content": message}],
                                                         tokenize=False, add_generation_prompt=True)
        outputs = self.pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
        self.context.append({"role": "user", "content": message})
        self.context.append({"role": "assistant", "content": outputs[0]["generated_text"]})
        self.outputs.append(outputs[0]["generated_text"].split("[/INST]")[-1])
        print(self.outputs[-1])


assistant = ChatAssistant()

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Next, we define some data that we will use inside the prompts to the model.

In [9]:
statement = "Blue-Light Therapy for Acne Vulgaris"
title = "A randomized controlled study for the treatment of acne vulgaris using high-intensity 414 nm solid state diode arrays."

Then, we can start a new chat with the model we created and have it follow a series of instructions to generate a search query.

In [25]:
assistant.new_chat()

assistant.chat(
    f"Follow my instructions precisely to develop a highly effective Boolean query for a medical systematic review literature search. Do not explain or elaborate. First, Given the following statement and title from a relevant study, identify 10 terms or phrases that are relevant. The terms you identify will be used to retrieve more relevant studies. statement: {statement} title: {title}")

assistant.chat(
    "For each item in step 1, classify it as of three categories: terms relating to health conditions (A), terms relating to a treatment (B), terms relating to types of study design (C). When an item does not fit one of these categories, mark it as (N/A). Do not explain or elaborate.")

assistant.chat(
    "Using the list in step 2, use your expert knowledge to create a valid Boolean query that can be submitted to PubMed which groups together items from each category. Also add relevant MeSH terms into the query where necessary. Each main clause of the query must correspond to a PICO element. Do not explain or elaborate.")

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
--- Logging error ---
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/logging/__init__.py", line 1100, in emit
    msg = self.format(record)
  File "/opt/conda/lib/python3.10/logging/__init__.py", line 943, in format
    return fmt.format(record)
  File "/opt/conda/lib/python3.10/logging/__init__.py", line 678, in format
    record.message = record.getMessage()
  File "/opt/conda/lib/python3.10/logging/__init__.py", line 368, in getMessage
    msg = msg % self.args
TypeError: not all arguments converted during string formatting
Call stack:
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/ipykernel_launcher.py", line 17, in <module>
    app.launch_new_instance()
  File "/opt/conda/lib

 1. Blue-light therapy
2. Acne vulgaris
3. Solid state diode arrays
4. High-intensity 414 nm
5. Treatment
6. Randomized controlled study
7. Diode arrays
8. 414 nm
9. Acne
10. High-intensity


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


 1. Blue-light therapy (B)
2. Acne vulgaris (A)
3. Solid state diode arrays (B)
4. High-intensity 414 nm (B)
5. Treatment (B)
6. Randomized controlled study (C)
7. Diode arrays (B)
8. 414 nm (B)
9. Acne (A)
10. High-intensity (B)
 PubMed Boolean query: ((((((((((((((((((((blue-light therapy) OR (blue light therapy)) OR (blue light treatment)) OR (blue light phototherapy)) OR (blue light therapy for acne)) OR (blue light therapy for acne vulgaris)) OR (blue light therapy for acne treatment)) OR (blue light therapy for acne treatment options)) OR (blue light therapy for acne vulgaris treatment)) OR (blue light therapy for acne vulgaris treatment options)) OR (blue light therapy for acne vulgaris treatment methods)) OR (blue light therapy for acne vulgaris treatment outcomes)) OR (blue light therapy for acne vulgaris treatment results)) OR (blue light therapy for acne vulgaris treatment effects)) OR (blue light therapy for acne vulgaris treatment efficacy)) OR (blue light therapy for acne

We should see the generated query above, but we can also extract it from the model directly.

In [26]:
generated_search_string = assistant.outputs[-1]

We can then use the same code we used before for our manually created search string to validate the search query generated by the model. We use the PubMed API to search this time using the generated search string.

In [27]:
response = requests.get(  # GET request
    url="https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi",  # URL of the API
    params={  # Parameters of the request
        "db": "pubmed",
        "term": generated_search_string,
        "retmax": 10_000,  # We can retrieve up to 100,000 studies at a time
        "format": "json"
    }
).json()  # Parse the response as JSON

generated_retrieved_studies, generated_retrieved_seed_studies, _ = validate_search(seed_studies, response)

In [28]:
print(f"Total studies retrieved: {generated_retrieved_studies}")
print(f"Retrieved {generated_retrieved_seed_studies} out of {len(seed_studies)} studies.")

Total studies retrieved: 877
Retrieved 10 out of 10 studies.


In [29]:
print("Precision:", generated_retrieved_seed_studies / generated_retrieved_studies)
print("Recall:", generated_retrieved_seed_studies / len(seed_studies))

Precision: 0.011402508551881414
Recall: 1.0


## Summary

In this notebook, we've shown how to automatically validate searches using a set of seed studies using the PubMed API. We then used a local large language model to generate a search query for a systematic review and validated it against the same set of seed studies.

---
[top](https://github.com/hscells/apis-for-evidence-identification)<br/>
[next: Searching ClinicalTrials.gov](https://github.com/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-2-clinicaltrials-gov.ipynb)<br/>
[previous: Using APIs via HTTPie](https://github.com/hscells/apis-for-evidence-identification/blob/main/3-use-cases/3-1-searching-pubmed.ipynb)<br/>