## PubMedQA - Retriever Evalaution: Evaluating various semantic search
This notebook evaluates various semantic search strategies for the PubMedQA dataset using Elasticsearch. The evaluation covers:

- **BM25 Retrieval:** Baseline keyword-based retrieval using Elasticsearch's BM25 algorithm.
- **KNN Index + Hybrid Search:** Combines dense vector search (using sentence embeddings) with traditional keyword search, leveraging best field and tie breaker strategies for improved relevance.

**Workflow Overview:**
1. **Dataset Loading:** The PubMedQA dataset is loaded and prepared for evaluation.
2. **Elasticsearch Setup:** Connections to Elasticsearch indices are established for both BM25 and KNN-based retrieval.
3. **Retrieval & Evaluation:** For each query, relevant documents are retrieved using different search strategies. The notebook checks if the correct document is among the top-k results and computes the retrieval accuracy.
4. **Comparison:** Results from different retrieval methods are compared to assess their effectiveness.

This notebook provides a practical guide for benchmarking semantic search approaches on biomedical QA tasks.

In [None]:
import os
import re
from datetime import date
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from elasticsearch import Elasticsearch
from elasticsearch.exceptions import RequestError
import numpy as np

In [None]:
from dotenv import load_dotenv
import os

# Path to your .env file
env_path = "../.env"  # Change path if needed

# Load environment variables from .env
load_dotenv(dotenv_path=env_path)
# Access the environment variables
ES_URL = os.getenv("ES_URL")
ES_USER = os.getenv("ES_USER")
ES_PASS = os.getenv("ES_PASS")

In [None]:
# Create a global client connection to elastic search
es_client = Elasticsearch(
    ES_URL, 
    basic_auth=(ES_USER,ES_PASS),
    verify_certs=False,
    request_timeout=10000
)
es_client.info()

In [3]:
print(es_client.info())

{'name': 'es-sample-es-data-master-2', 'cluster_name': 'es-sample', 'cluster_uuid': 'lxgst327RICarIi1P0c6TQ', 'version': {'number': '8.12.0', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '1665f706fd9354802c02146c1e6b5c0fbcddfbc9', 'build_date': '2024-01-11T10:05:27.953830042Z', 'build_snapshot': False, 'lucene_version': '9.9.1', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}


## Evalaution

## BM25

In [6]:
def processESIndex(df_test,index_name,k):
    df_test['model_op'] = ''
    for ind in df_test.index:
        question = df_test['question'][ind]
        print(question)
        search_query2 ={
                    "query": {
                        "match": {
                        "contexts": question
                        }
                    }
                    }
        response = es_client.search(
        index=index_name,
        body=search_query2,
        size=k  # Set the number of documents to retrieve per scroll
        )
        all_hits = response['hits']['hits']
        print(len(all_hits))
        flag = False
        for num, doc in enumerate(all_hits):
            print('file--path',doc["_source"]['pubid'])
            if doc["_source"]['pubid'] == df_test['pubid'][ind]:
                        flag = True
                        break              
        print(flag)
        df_test['model_op'][ind] = flag
    return df_test

In [7]:
from datasets import load_dataset
import pandas as pd
ds = load_dataset("qiaojin/PubMedQA", "pqa_labeled")

df_train = pd.DataFrame(ds['train'])
print(df_train.shape)
df_train.head()

(1000, 5)


Unnamed: 0,pubid,question,context,long_answer,final_decision
0,21645374,Do mitochondria play a role in remodelling lac...,{'contexts': ['Programmed cell death (PCD) is ...,Results depicted mitochondrial dynamics in viv...,yes
1,16418930,Landolt C and snellen e acuity: differences in...,{'contexts': ['Assessment of visual acuity dep...,"Using the charts described, there was only a s...",no
2,9488747,"Syncope during bathing in infants, a pediatric...",{'contexts': ['Apparent life-threatening event...,"""Aquagenic maladies"" could be a pediatric form...",yes
3,17208539,Are the long-term results of the transanal pul...,{'contexts': ['The transanal endorectal pull-t...,Our long-term study showed significantly bette...,no
4,10808977,Can tailored interventions increase mammograph...,{'contexts': ['Telephone counseling and tailor...,The effects of the intervention were most pron...,yes


In [8]:
index_name='pubmedqa_bm25_metadata'
query_json = processESIndex(df_train,index_name,20)

Do mitochondria play a role in remodelling lace plant leaves during programmed cell death?
20
file--path 21645374
True
Landolt C and snellen e acuity: differences in strabismus amblyopia?
20
file--path 16418930
True
Syncope during bathing in infants, a pediatric form of water-induced urticaria?
20
file--path 9488747
True
Are the long-term results of the transanal pull-through equal to those of the transabdominal pull-through?
20
file--path 17208539
True
Can tailored interventions increase mammography use among HMO women?
20
file--path 10808977
True
Double balloon enteroscopy: is it efficacious and safe in a community setting?
20
file--path 22537367
file--path 26224642
file--path 19670133
file--path 18562742
file--path 21487872
file--path 24817372
file--path 21033209
file--path 25251991
file--path 19609674
file--path 27177708
file--path 25748631
file--path 17593459
file--path 21502764
file--path 21107001
file--path 17192736
file--path 11490380
file--path 22194097
file--path 19670112
fil

## K =20

In [9]:
count = query_json['model_op'].sum()/len(query_json)
print(count)

0.953


## KNN Index + Hybrid search + Best Field + tie_breaker 

In [14]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [None]:
def processESIndex_knn_best_field_tieBreaker_boost(df_test,index_name,k):
    df_test['model_op'] = ''
    for ind in df_test.index:
        question = df_test['question'][ind]
        print(question)
        search_query2 ={
          "query": {
           "multi_match" : {
              "query":question,
                "type":"best_fields",
                "fields":["contexts^10", "meshes^10","yake_key_idea^4","keybert_title^8"],
                "tie_breaker":0.3
}
          },
          "knn": {
            "field": "contexts_embedding",
            "query_vector": model.encode(question),
            "k": k,
            "num_candidates": 100
          },
}
        response = es_client.search(
        index=index_name,
        body=search_query2,
        size=k  # Set the number of documents to retrieve per scroll
        )
        all_hits = response['hits']['hits']
        print(len(all_hits))
        flag = False
        for num, doc in enumerate(all_hits):
            print('file--path',doc["_source"]['pubid'])
            if doc["_source"]['pubid'] == df_test['pubid'][ind]:
                        flag = True
                        break
        print(flag)
        df_test['model_op'][ind] = flag
    return df_test

In [36]:
index_name_knn = 'pubmedqa_knn_metadata'

In [37]:
query_knn = processESIndex_knn_best_field_tieBreaker_boost(df_train,index_name_knn,1)

Do mitochondria play a role in remodelling lace plant leaves during programmed cell death?
1
file--path 21645374
True
Landolt C and snellen e acuity: differences in strabismus amblyopia?
1
file--path 16418930
True
Syncope during bathing in infants, a pediatric form of water-induced urticaria?
1
file--path 9488747
True
Are the long-term results of the transanal pull-through equal to those of the transabdominal pull-through?
1
file--path 17208539
True
Can tailored interventions increase mammography use among HMO women?
1
file--path 10808977
True
Double balloon enteroscopy: is it efficacious and safe in a community setting?
1
file--path 22537367
False
30-Day and 1-year mortality in emergency general surgery laparotomies: an area of concern and need for improvement?
1
file--path 26037986
True
Is adjustment for reporting heterogeneity necessary in sleep disorders?
1
file--path 26852225
True
Do mutations causing low HDL-C promote increased carotid intima-media thickness?
1
file--path 1711306

## K =1

In [38]:
count = query_knn['model_op'].sum()/len(query_knn)
print(count)

0.787
