# PubMedQA: A Dataset for Biomedical Research Question Answering

Group: TLDR

* Federica Maria Laudizi
* Francesca Visalli
* Margherita Marino
* Tomaz Maia Suller

This notebook computes embedding similarities and stores each question in the labelled dataset with its associated closest question in the artificial one. This dataset is used for RAG.

In [1]:
from collections import defaultdict
from pathlib import Path

import numpy as np
import pandas as pd

from utils import load

In [2]:
EMBEDDING_MODEL = "BioSentVec_PubMed_MIMICIII-bigram_d700"

EMBEDDING_BASE_DIR = Path("embeddings")
OUTPUT_DATA_DIR = Path("data")
SENTENCE_EMBEDDINGS_DIR = EMBEDDING_BASE_DIR / "sentence"

In [3]:
TARGET_SPLIT = "labeled"
SOURCE_SPLITS = (
    "artificial",
)

In [4]:
target_df = load("labeled")
target_df

Unnamed: 0,pubid,question,long_answer,final_decision,context.contexts,context.labels,context.meshes,context.reasoning_required_pred,context.reasoning_free_pred,context.reasoning_required_pred_str,context.reasoning_free_pred_str,final_decision_str
0,21645374,Do mitochondria play a role in remodelling lac...,Results depicted mitochondrial dynamics in viv...,True,[Programmed cell death (PCD) is the regulated ...,"[BACKGROUND, RESULTS]","[Alismataceae, Apoptosis, Cell Differentiation...",True,True,True,True,True
1,16418930,Landolt C and snellen e acuity: differences in...,"Using the charts described, there was only a s...",False,[Assessment of visual acuity depends on the op...,"[BACKGROUND, PATIENTS AND METHODS, RESULTS]","[Adolescent, Adult, Aged, Aged, 80 and over, A...",False,False,False,False,False
2,9488747,"Syncope during bathing in infants, a pediatric...","""Aquagenic maladies"" could be a pediatric form...",True,[Apparent life-threatening events in infants a...,"[BACKGROUND, CASE REPORTS]","[Baths, Histamine, Humans, Infant, Syncope, Ur...",True,True,True,True,True
3,17208539,Are the long-term results of the transanal pul...,Our long-term study showed significantly bette...,False,[The transanal endorectal pull-through (TERPT)...,"[PURPOSE, METHODS, RESULTS]","[Child, Child, Preschool, Colectomy, Female, H...",True,False,True,False,False
4,10808977,Can tailored interventions increase mammograph...,The effects of the intervention were most pron...,True,[Telephone counseling and tailored print commu...,"[BACKGROUND, DESIGN, PARTICIPANTS, INTERVENTIO...","[Cost-Benefit Analysis, Female, Health Mainten...",True,False,True,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...
995,8921484,Does gestational age misclassification explain...,Gestational age misclassification is an unlike...,False,"[After 34 weeks gestation, summary measures of...","[BACKGROUND, METHODS, RESULTS]","[Adult, Australia, Birth Weight, Classificatio...",True,False,True,False,False
996,16564683,Is there any interest to perform ultrasonograp...,Sonography has no place in the diagnosis of un...,False,[To evaluate the accuracy of ultrasonographic ...,"[OBJECTIVE, MATERIAL AND METHODS, RESULTS]","[Child, Child, Preschool, Cryptorchidism, Huma...",False,False,False,False,False
997,23147106,Is peak concentration needed in therapeutic dr...,These results suggest little need to use peak ...,False,[We analyzed the pharmacokinetic-pharmacodynam...,"[BACKGROUND, METHODS, RESULTS]","[Aged, Aged, 80 and over, Anti-Bacterial Agent...",True,False,True,False,False
998,21550158,Can autologous platelet-rich plasma gel enhanc...,"The PRP group recorded reduced pain, swelling,...",True,[This investigation assesses the effect of pla...,"[PURPOSE, PATIENTS AND METHODS, RESULTS]","[Adult, Bone Regeneration, Chi-Square Distribu...",False,True,False,True,True


In [5]:
source_df = pd.concat([
    load(split)
    for split in SOURCE_SPLITS
]).reset_index()
source_df

Unnamed: 0,index,pubid,question,long_answer,final_decision,context.contexts,context.labels,context.meshes,final_decision_str
0,0,25429730,Are group 2 innate lymphoid cells ( ILC2s ) in...,"As ILC2s are elevated in patients with CRSwNP,...",True,[Chronic rhinosinusitis (CRS) is a heterogeneo...,"[BACKGROUND, OBJECTIVE, METHODS, RESULTS]","[Adult, Aged, Antigens, Surface, Case-Control ...",True
1,1,25433161,Does vagus nerve contribute to the development...,Neuronal signals via the hepatic vagus nerve c...,True,[Phosphatidylethanolamine N-methyltransferase ...,"[OBJECTIVE, METHODS, RESULTS]","[Animals, Chemokine CCL2, Diet, High-Fat, Dise...",True
2,2,25445714,Does psammaplin A induce Sirtuin 1-dependent a...,PsA significantly inhibited MCF-7/adr cells pr...,True,[Psammaplin A (PsA) is a natural product isola...,"[BACKGROUND, METHODS]","[Acetylation, Animals, Antibiotics, Antineopla...",True
3,3,25431941,Is methylation of the FGFR2 gene associated wi...,We identified a novel biologically plausible c...,True,[This study examined links between DNA methyla...,"[OBJECTIVE, METHODS, RESULTS]",[],True
4,4,25432519,Do tumor-infiltrating immune cell profiles and...,Breast cancer immune cell subpopulation profil...,True,[Tumor microenvironment immunity is associated...,"[BACKGROUND, METHODS, RESULTS]","[Adult, Aged, Anthracyclines, Antibodies, Mono...",True
...,...,...,...,...,...,...,...,...,...
211264,211264,8217974,Is urine production rate related to behavioura...,During active sleep (state 2F) hourly fetal ur...,True,[To investigate the relation between hourly fe...,"[OBJECTIVE, METHODS, METHODS, METHODS, METHODS...","[Behavior, Embryonic and Fetal Development, Fe...",True
211265,211265,8204319,Does evaluation of the use of general practice...,General practice registers can provide a suita...,True,[This study set out to show how well samples f...,"[OBJECTIVE, METHODS, RESULTS]","[Adult, Age Factors, Epidemiologic Methods, Fa...",True
211266,211266,8205673,Does intracoronary angiotensin-converting enzy...,Intracoronary enalaprilat resulted in an impro...,True,[There is increasing recognition of myocardial...,"[BACKGROUND, RESULTS]","[Adult, Aged, Coronary Vessels, Diastole, Enal...",True
211267,211267,8215873,Does transfusion significantly increase the ri...,The choice between splenectomy and splenic rep...,True,[To determine if splenectomy results in an inc...,"[OBJECTIVE, METHODS, METHODS, METHODS, METHODS...","[Adult, Bacteremia, Female, Humans, Injury Sev...",True


In [6]:
embeddings: dict[str, dict[str, np.ndarray]] = defaultdict(dict)
for split in ("labeled", "unlabeled", "artificial"):
    with (SENTENCE_EMBEDDINGS_DIR / split / f"{EMBEDDING_MODEL}.npz").open("rb") as f:
        with np.load(f) as data:
            for key, value in data.items():
                embeddings[split][key] = value
embeddings

defaultdict(dict,
            {'labeled': {'question': array([[-0.03413809,  0.08783249,  0.23454772, ...,  0.18894325,
                      -0.07009058, -0.05960368],
                     [-0.01844016, -0.19973463, -0.27306175, ...,  0.1554322 ,
                       0.01059111,  0.09929473],
                     [-0.29999515, -0.5277951 , -0.4147372 , ...,  0.10904694,
                       0.10407549,  0.25947914],
                     ...,
                     [-0.21563652, -0.11012568,  0.24946418, ...,  0.09730155,
                      -0.04870246,  0.13552524],
                     [ 0.05309517,  0.21788427, -0.34573796, ..., -0.2428476 ,
                       0.18243514,  0.0655036 ],
                     [ 0.30908936,  0.43286923, -0.03107509, ..., -0.11518667,
                       0.11042855,  0.41196337]], shape=(1000, 700), dtype=float32),
              'long_answer': array([[ 0.05308781,  0.17729971,  0.06652988, ...,  0.06373439,
                      -0.05955856, 

In [7]:
[key for key in embeddings["labeled"].keys()]

['question',
 'long_answer',
 'context.full_context',
 'context.full_meshes',
 'full_text',
 'no_reasoning']

In [8]:
source_embeddings = np.vstack(
    [embeddings[split]["full_text"] for split in SOURCE_SPLITS],
)
source_embeddings

array([[-0.00479869,  0.182581  , -0.18844976, ..., -0.11320306,
         0.13177823,  0.15832873],
       [ 0.11821024,  0.07054456, -0.02101681, ..., -0.02390636,
        -0.03392243,  0.09293078],
       [ 0.06065191,  0.18323833, -0.02539335, ...,  0.01422214,
        -0.07669193,  0.13235876],
       ...,
       [ 0.14675404, -0.03312806,  0.02538787, ..., -0.01243096,
         0.0104538 , -0.04476508],
       [ 0.06257252, -0.03196436, -0.15788035, ...,  0.00607273,
         0.03348964, -0.00370837],
       [ 0.12065759,  0.06298785, -0.05880241, ..., -0.08862936,
        -0.06808811,  0.15857315]], shape=(211269, 700), dtype=float32)

In [9]:
source_norms = np.linalg.norm(source_embeddings, axis=1).reshape((-1, 1))
normalised_source_embeddings = source_embeddings / source_norms
normalised_source_embeddings

array([[-0.00165882,  0.06311502, -0.06514375, ..., -0.0391323 ,
         0.0455534 ,  0.05473144],
       [ 0.03776748,  0.02253857, -0.00671475, ..., -0.00763794,
        -0.01083802,  0.02969084],
       [ 0.02055335,  0.06209468, -0.00860514, ...,  0.00481951,
        -0.02598889,  0.04485292],
       ...,
       [ 0.06021097, -0.01359194,  0.01041626, ..., -0.00510023,
         0.00428904, -0.01836644],
       [ 0.02458053, -0.01255664, -0.06202055, ...,  0.00238557,
         0.01315582, -0.00145677],
       [ 0.04316441,  0.02253346, -0.02103615, ..., -0.03170653,
        -0.02435805,  0.05672844]], shape=(211269, 700), dtype=float32)

In [10]:
target_embeddings = embeddings["labeled"]["full_text"]
target_norms = np.linalg.norm(target_embeddings, axis=1).reshape((-1, 1))
normalised_target_embeddings = target_embeddings / target_norms
normalised_target_embeddings

array([[ 0.01749045,  0.03329223,  0.00589029, ...,  0.01683456,
        -0.01601021, -0.00192221],
       [ 0.01270052,  0.02733243, -0.05338116, ...,  0.00447058,
        -0.01774536,  0.00195   ],
       [ 0.04961868, -0.02705407, -0.05419235, ...,  0.0069914 ,
         0.00834018,  0.04095153],
       ...,
       [-0.06857593, -0.02499809,  0.01961668, ...,  0.02140907,
        -0.02343112,  0.05594094],
       [-0.01125871, -0.01080015, -0.06714322, ..., -0.05599077,
        -0.00145456,  0.04321832],
       [ 0.09861822,  0.00512113, -0.04626896, ...,  0.02244047,
         0.00260826,  0.08075263]], shape=(1000, 700), dtype=float32)

In [11]:
embedding_similarities: np.ndarray = normalised_target_embeddings @ normalised_source_embeddings.T
embedding_similarities

array([[0.48322773, 0.52042145, 0.5706818 , ..., 0.45885706, 0.45454654,
        0.40899062],
       [0.44100472, 0.36688218, 0.361651  , ..., 0.48534036, 0.44923443,
        0.44422936],
       [0.5663876 , 0.4687558 , 0.38878846, ..., 0.54912317, 0.5876865 ,
        0.5840443 ],
       ...,
       [0.48836172, 0.41253626, 0.4843232 , ..., 0.5285813 , 0.57625914,
        0.50031507],
       [0.5431512 , 0.44224215, 0.43224412, ..., 0.5576449 , 0.59722626,
        0.5158511 ],
       [0.49003065, 0.4559083 , 0.45115662, ..., 0.5565902 , 0.5234872 ,
        0.6632899 ]], shape=(1000, 211269), dtype=float32)

In [12]:
closest_source = embedding_similarities.argmax(axis=1)
closest_source

array([209582, 174144, 173931,  22648, 169075, 176824, 203394,  17016,
       164601,  91343,  66484, 167535,  52064, 122742,  92317, 138222,
        28890, 174982, 106129, 123157,  24857,  91474,  45710, 205535,
        39463, 164166,   5512,  31243, 136861,  82171,  57230,  95950,
       100740,  61170,  94462, 108867,  42459,  48701,  81270,  67156,
        81028,  79761, 192572, 149140, 103877, 114089, 135803,  10478,
        88240, 170125,  48030, 150469, 135893, 179734, 187534, 160450,
        20542, 209438, 153209,   1319,    213, 103296,  86494,  91499,
       150023,  50728, 156451, 163893, 155215,  77214,  69401, 178277,
       144804, 131771, 158081, 178447,  30472, 201755,  15954, 137073,
        82062, 162682,  73521, 172289,  29943,  39577, 122567, 107415,
         5663,  66785,  53796, 179924, 115997,  39952,  42035,  42063,
       127860, 149997,   3032, 162759,  57133,  79680, 202033, 151748,
       180542, 199547, 147835,  92471, 175526, 198814, 179804, 104047,
      

In [13]:
def format_abstract(series: pd.Series) -> dict:
    context = list(series["context.contexts"])
    labels = list(series["context.labels"])

    labels.append("conclusion")
    context.append(series["long_answer"])

    labels.append("question")
    context.append(series["question"])

    abstract = "\n\n".join(
        [
            f"{label.upper()}\n{context}"
            for label, context in zip(labels, context)
        ]
    )
    final_decision_str = "Yes" if series["final_decision"] else "No"
    abstract += f"\n\nANSWER: {final_decision_str}"
    return abstract

In [14]:
source_abstracts = source_df.apply(format_abstract, axis="columns")
source_abstracts

0         BACKGROUND\nChronic rhinosinusitis (CRS) is a ...
1         OBJECTIVE\nPhosphatidylethanolamine N-methyltr...
2         BACKGROUND\nPsammaplin A (PsA) is a natural pr...
3         OBJECTIVE\nThis study examined links between D...
4         BACKGROUND\nTumor microenvironment immunity is...
                                ...                        
211264    OBJECTIVE\nTo investigate the relation between...
211265    OBJECTIVE\nThis study set out to show how well...
211266    BACKGROUND\nThere is increasing recognition of...
211267    OBJECTIVE\nTo determine if splenectomy results...
211268    OBJECTIVE\nTo determine if low gastric intramu...
Length: 211269, dtype: object

In [17]:
print(source_abstracts.iloc[0])

BACKGROUND
Chronic rhinosinusitis (CRS) is a heterogeneous disease with an uncertain pathogenesis. Group 2 innate lymphoid cells (ILC2s) represent a recently discovered cell population which has been implicated in driving Th2 inflammation in CRS; however, their relationship with clinical disease characteristics has yet to be investigated.

OBJECTIVE
The aim of this study was to identify ILC2s in sinus mucosa in patients with CRS and controls and compare ILC2s across characteristics of disease.

METHODS
A cross-sectional study of patients with CRS undergoing endoscopic sinus surgery was conducted. Sinus mucosal biopsies were obtained during surgery and control tissue from patients undergoing pituitary tumour resection through transphenoidal approach. ILC2s were identified as CD45(+) Lin(-) CD127(+) CD4(-) CD8(-) CRTH2(CD294)(+) CD161(+) cells in single cell suspensions through flow cytometry. ILC2 frequencies, measured as a percentage of CD45(+) cells, were compared across CRS phenotype

In [18]:
target_df["closest_abstract"] = source_abstracts[closest_source].to_list()
target_df

Unnamed: 0,pubid,question,long_answer,final_decision,context.contexts,context.labels,context.meshes,context.reasoning_required_pred,context.reasoning_free_pred,context.reasoning_required_pred_str,context.reasoning_free_pred_str,final_decision_str,closest_abstract
0,21645374,Do mitochondria play a role in remodelling lac...,Results depicted mitochondrial dynamics in viv...,True,[Programmed cell death (PCD) is the regulated ...,"[BACKGROUND, RESULTS]","[Alismataceae, Apoptosis, Cell Differentiation...",True,True,True,True,True,BACKGROUND\nDevelopmentally regulated programm...
1,16418930,Landolt C and snellen e acuity: differences in...,"Using the charts described, there was only a s...",False,[Assessment of visual acuity depends on the op...,"[BACKGROUND, PATIENTS AND METHODS, RESULTS]","[Adolescent, Adult, Aged, Aged, 80 and over, A...",False,False,False,False,False,OBJECTIVE\nDetection of amblyopia in infants a...
2,9488747,"Syncope during bathing in infants, a pediatric...","""Aquagenic maladies"" could be a pediatric form...",True,[Apparent life-threatening events in infants a...,"[BACKGROUND, CASE REPORTS]","[Baths, Histamine, Humans, Infant, Syncope, Ur...",True,True,True,True,True,BACKGROUND\nUrticaria is the disease that has ...
3,17208539,Are the long-term results of the transanal pul...,Our long-term study showed significantly bette...,False,[The transanal endorectal pull-through (TERPT)...,"[PURPOSE, METHODS, RESULTS]","[Child, Child, Preschool, Colectomy, Female, H...",True,False,True,False,False,OBJECTIVE\nThe purpose of the investigation wa...
4,10808977,Can tailored interventions increase mammograph...,The effects of the intervention were most pron...,True,[Telephone counseling and tailored print commu...,"[BACKGROUND, DESIGN, PARTICIPANTS, INTERVENTIO...","[Cost-Benefit Analysis, Female, Health Mainten...",True,False,True,False,True,OBJECTIVE\nA randomized trial was conducted to...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,8921484,Does gestational age misclassification explain...,Gestational age misclassification is an unlike...,False,"[After 34 weeks gestation, summary measures of...","[BACKGROUND, METHODS, RESULTS]","[Adult, Australia, Birth Weight, Classificatio...",True,False,True,False,False,OBJECTIVE\nWe aimed to determine whether ethni...
996,16564683,Is there any interest to perform ultrasonograp...,Sonography has no place in the diagnosis of un...,False,[To evaluate the accuracy of ultrasonographic ...,"[OBJECTIVE, MATERIAL AND METHODS, RESULTS]","[Child, Child, Preschool, Cryptorchidism, Huma...",False,False,False,False,False,OBJECTIVE\nAn inguinal sonogram often is obtai...
997,23147106,Is peak concentration needed in therapeutic dr...,These results suggest little need to use peak ...,False,[We analyzed the pharmacokinetic-pharmacodynam...,"[BACKGROUND, METHODS, RESULTS]","[Aged, Aged, 80 and over, Anti-Bacterial Agent...",True,False,True,False,False,OBJECTIVE\nVancomycin is a common treatment fo...
998,21550158,Can autologous platelet-rich plasma gel enhanc...,"The PRP group recorded reduced pain, swelling,...",True,[This investigation assesses the effect of pla...,"[PURPOSE, PATIENTS AND METHODS, RESULTS]","[Adult, Bone Regeneration, Chi-Square Distribu...",False,True,False,True,True,OBJECTIVE\nLower third molar removal provides ...


In [19]:
target_df.to_parquet(OUTPUT_DATA_DIR / "labeled.parquet")