## PubMedQA - Metadata Enrichment using NLP + LLM
This notebook demonstrates how to enrich PubMedQA metadata using NLP techniques and Large Language Models (LLMs). The workflow includes:

- Loading and exploring the PubMedQA dataset
- Labeling and preparing data for analysis
- Extracting keywords using BERT-based models
- Extracting topics, phrases, synonyms, and acronyms using LLMs (LLAMA3, MIXTRAL)
- Saving the enriched metadata for downstream tasks

This notebook enriches the PubMedQA dataset by extracting additional metadata using advanced NLP and LLM techniques. The goal is to enhance the dataset for improved downstream analysis and applications in biomedical research.

In [1]:
import os
import re
from datetime import date
import pandas as pd
import json
from pathlib import Path
from sklearn.feature_extraction.text import CountVectorizer
from keybert import KeyBERT
import numpy as np

  from tqdm.autonotebook import tqdm, trange


In [4]:
from dotenv import load_dotenv

# Path to your .env file
env_path = "../.env"  # Change path if needed

# Load environment variables from .env
load_dotenv(dotenv_path=env_path)
# Access the environment variables
watsonx_url = os.getenv("watsonx_url")
watsonx_apikey = os.getenv("watsonx_apikey")
watsonx_projectID = os.getenv("watsonx_projectID")

print(watsonx_url)

https://us-south.ml.cloud.ibm.com


In [2]:
from datasets import load_dataset
import pandas as pd
ds = load_dataset("qiaojin/PubMedQA", "pqa_labeled")

In [None]:
df_train = pd.DataFrame(ds['train'])
df_train.head()

Unnamed: 0,pubid,question,context,long_answer,final_decision
0,21645374,Do mitochondria play a role in remodelling lac...,{'contexts': ['Programmed cell death (PCD) is ...,Results depicted mitochondrial dynamics in viv...,yes
1,16418930,Landolt C and snellen e acuity: differences in...,{'contexts': ['Assessment of visual acuity dep...,"Using the charts described, there was only a s...",no
2,9488747,"Syncope during bathing in infants, a pediatric...",{'contexts': ['Apparent life-threatening event...,"""Aquagenic maladies"" could be a pediatric form...",yes
3,17208539,Are the long-term results of the transanal pul...,{'contexts': ['The transanal endorectal pull-t...,Our long-term study showed significantly bette...,no
4,10808977,Can tailored interventions increase mammograph...,{'contexts': ['Telephone counseling and tailor...,The effects of the intervention were most pron...,yes


In [4]:
len(df_train)

1000

In [5]:
## Custom_label
df_train['labelled'] = "Yes"

In [6]:
dict_data =df_train['context'][0]

In [7]:
dict_data

{'contexts': ['Programmed cell death (PCD) is the regulated death of cells within an organism. The lace plant (Aponogeton madagascariensis) produces perforations in its leaves through PCD. The leaves of the plant consist of a latticework of longitudinal and transverse veins enclosing areoles. PCD occurs in the cells at the center of these areoles and progresses outwards, stopping approximately five cells from the vasculature. The role of mitochondria during PCD has been recognized in animals; however, it has been less studied during PCD in plants.',
  'The following paper elucidates the role of mitochondrial dynamics during developmentally regulated PCD in vivo in A. madagascariensis. A single areole within a window stage leaf (PCD is occurring) was divided into three areas based on the progression of PCD; cells that will not undergo PCD (NPCD), cells in early stages of PCD (EPCD), and cells in late stages of PCD (LPCD). Window stage leaves were stained with the mitochondrial dye MitoT

## Distribution of final_decision

In [7]:
df_train['final_decision'].value_counts()

final_decision
yes      552
no       338
maybe    110
Name: count, dtype: int64

In [10]:
unlabbeld = load_dataset("qiaojin/PubMedQA", "pqa_unlabeled")

In [11]:
df_unlabelled = pd.DataFrame(unlabbeld['train'])
df_unlabelled.head()

Unnamed: 0,pubid,question,context,long_answer
0,14499029,Is naturopathy as effective as conventional th...,{'contexts': ['Although the use of alternative...,Naturopathy appears to be an effective alterna...
1,14499049,Can randomised trials rely on existing electro...,"{'contexts': ['To estimate the feasibility, ut...",Routine data have the potential to support hea...
2,14499672,Is laparoscopic radical prostatectomy better t...,{'contexts': ['To compare morbidity in two gro...,The results of our non-randomized study show t...
3,14499773,Does bacterial gastroenteritis predispose peop...,{'contexts': ['Irritable bowel syndrome (IBS) ...,Symptoms consistent with IBS and functional di...
4,14499777,Is early colonoscopy after admission for acute...,{'contexts': ['Urgent colonoscopy has been pro...,No significant association is apparent between...


In [None]:
df_unlabelled['context'][0]

In [None]:
df_unlabelled['labelled'] = "No"

In [12]:
df_unlabelled.shape

(61249, 4)

## Keyword Extraction using BERT Model

In [None]:
kw_model = KeyBERT()
def extract_keyphrases(text, top_n=10, ngram_range=(1, 3)):
    vectorizer = CountVectorizer(ngram_range=ngram_range, stop_words='english').fit([text])
    candidate_phrases = vectorizer.get_feature_names_out()
    keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=ngram_range, stop_words='english', use_mmr=True, candidates=candidate_phrases, top_n=top_n, diversity=0.9)
    keys = [key for key, _ in keywords]
    return keys

## Topic , phrases and synonyms extraction using LLM (LLAMA3)

In [None]:
from ibm_watson_machine_learning.foundation_models import Model
from ibm_watson_machine_learning.metanames import GenTextParamsMetaNames as GenParams
project_id=''
# another cred
wml_credentials = {
    "url": watsonx_url,
    "apikey": watsonx_apikey,
    "project_id": watsonx_projectID,
}

def build_prompt(context,model_id="MIXTRAL"):
    
    formatted_prompt=""

    SYSTEM_PROMPT = """You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
                        If you don't know the answer to a question, don't hallucinate and share false information."""

    USER_PROMPT = """ You are a medical expert 
    Extract key topics, important phrases, synonyms, and acronyms from the given text: '{context}' which has special meaning in the medical. 
    Don't include any other infromatuion which is not available in the context. Provide the output only in valid JSON format with the valid keys like topics, phrases, synonyms and achronyms.
    """

    LLAMA3_PROMPT= """
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    {system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
    {user_prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
    Answer Based on the Provided Context: 
    """

    MIXTRAL_PROMPT = """[INST]
    [ROLE]
    {system_prompt}
    [/ROLE]
    [USER_INSTRUCTIONS]
    {user_prompt}
    [/USER_INSTRUCTIONS]

    Answer Based on the Provided Context:
    [/INST]"""

    user_prompt = USER_PROMPT.format(context=context)
    if  model_id == "MIXTRAL":
        formatted_prompt = MIXTRAL_PROMPT.format(system_prompt=SYSTEM_PROMPT,user_prompt=user_prompt)
    elif model_id == "LLAMA3":
        formatted_prompt = LLAMA3_PROMPT.format(system_prompt=SYSTEM_PROMPT,user_prompt=user_prompt)
    
    return formatted_prompt
      

def send_to_watsonxai(prompts,
                    model_id="MIXTRAL",
                    decoding_method="greedy",
                    max_new_tokens=2000,
                    min_new_tokens=2,
                    temperature=1.0,
                    repetition_penalty=1.0
                    ):
    if  model_id == "MIXTRAL":
         model_name = "mistralai/mistral-large"
    elif model_id == "LLAMA3":
         model_name="meta-llama/llama-3-3-70b-instruct"
    # Instantiate parameters for text generation
    model_params = {
        GenParams.DECODING_METHOD: decoding_method,
        GenParams.MIN_NEW_TOKENS: min_new_tokens,
        GenParams.MAX_NEW_TOKENS: max_new_tokens,
        #GenParams.RANDOM_SEED: 42,
        GenParams.TEMPERATURE: temperature,
        GenParams.REPETITION_PENALTY: repetition_penalty,
    }
    # Instantiate a model proxy object to send your requests
    model = Model(
        model_id=model_name,
        params=model_params,
        credentials=wml_credentials,
        project_id=watsonx_projectID)

    print("Model used ---",model.model_id)
    response=model.generate_text(prompts)
    # print(response)
    return response

def get_title(story):
    model_name="LLAMA3"
    llm_input=build_prompt(story,model_name)
    llm_response=send_to_watsonxai(llm_input,model_name)
    return llm_response

In [None]:
index_doc =[]
for index, row in df_unlabelled.iterrows():
    print("Processing index-----",index)
    pub_id = row['pubid']
    id = "p"+ str(pub_id)
    contexts = row['context']['contexts']
    lables = row['context']['labels']
    meshes =row['context']['meshes']
    context_str =''
    for context in contexts:
        context_str += context
    #passage_embedding = model.encode(context_str)
    keywords = extract_keyphrases(context_str)
    topics_res = get_title(contexts)
    print(topics_res)
    try:
        data = json.loads(topics_res)
        # Extract values
        topics = data.get("topics", [])
        phrases = data.get("phrases", [])
        synonyms = data.get("synonyms", {})
        acronyms = data.get("acronyms", {})
    except:
        topics = topics_res
        phrases = topics_res
        synonyms = topics_res
        acronyms = topics_res
    doc ={
            "id": ""+id+"",
            "pubid":pub_id,
            "contexts": contexts,
            "labels": lables,
            "meshes":meshes,
            "long_answer": row['long_answer'],
            "labelled":row['labelled'],
            "keywords": keywords,
            "topics": topics,
            "phrases": phrases,
            "synonyms": synonyms,
            "achronym": acronyms,
            }
    index_doc.append(doc)

In [None]:
## To save the all metadata_gen files
with open("pubmedqa_index_metadata.json", "w") as f:
    json.dump(index_doc, f, indent=4)  # `indent=4` makes it more readable