## Natural Questions (NQ) Retriever Evaluation with Elasticsearch

This notebook demonstrates the evaluation of retrieval methods on the Natural Questions (NQ) dataset using Elasticsearch. It covers:

- Loading and exploring the NQ dataset
- Setting up Elasticsearch indices for BM25 and KNN retrieval
- Implementing and evaluating different search strategies:
    - Simple BM25 search
    - Hybrid search using metadata fields
    - KNN + Hybrid retriever evaluation

The notebook provides code for querying Elasticsearch, comparing retrieved results with ground truth, and calculating retrieval accuracy.

In [1]:
import os
import re
from datetime import date
import pandas as pd
import json
from datetime import datetime
import requests

from pathlib import Path

from elasticsearch import Elasticsearch
from elasticsearch.exceptions import RequestError
import warnings
warnings.filterwarnings('ignore')

In [None]:
from dotenv import load_dotenv
import os

# Path to your .env file
env_path = "../.env"  # Change path if needed

# Load environment variables from .env
load_dotenv(dotenv_path=env_path)
# Access the environment variables
ES_URL = os.getenv("ES_URL")
ES_USER = os.getenv("ES_USER")
ES_PASS = os.getenv("ES_PASS")

In [None]:
# Create a global client connection to elastic search
es_client = Elasticsearch(
    ES_URL,
    basic_auth=(ES_USER, ES_PASS),
    verify_certs=False,
    request_timeout=10000
)

In [3]:
index_name_bm25 = "nq_bm25_metadata"
index_name_knn = "nq_knn_metadata"

## 1. Evaluating on NQ DataSet

In [4]:
import pandas as pd
import json

from datasets import load_dataset
ds = load_dataset("facebook/kilt_tasks", "nq")

In [5]:
print(ds)

DatasetDict({
    train: Dataset({
        features: ['id', 'input', 'meta', 'output'],
        num_rows: 87372
    })
    validation: Dataset({
        features: ['id', 'input', 'meta', 'output'],
        num_rows: 2837
    })
    test: Dataset({
        features: ['id', 'input', 'meta', 'output'],
        num_rows: 1444
    })
})


In [6]:
validation_data = ds['validation']

In [7]:
type(validation_data)

datasets.arrow_dataset.Dataset

In [8]:
validation_data_df = pd.DataFrame(validation_data)
validation_data_df.head()

Unnamed: 0,id,input,meta,output
0,6915606477668963399,what do the 3 dots mean in math,"{'left_context': '', 'mention': '', 'right_con...","[{'answer': 'the therefore sign', 'meta': {'sc..."
1,-8366545547296627039,who wrote the song photograph by ringo starr,"{'left_context': '', 'mention': '', 'right_con...","[{'answer': 'Ringo Starr', 'meta': {'score': -..."
2,-5004457603684974952,who is playing the halftime show at super bowl...,"{'left_context': '', 'mention': '', 'right_con...","[{'answer': 'Coldplay', 'meta': {'score': -1},..."
3,7217222058435937287,where was the world economic forum held this year,"{'left_context': '', 'mention': '', 'right_con...","[{'answer': 'Davos', 'meta': {'score': -1}, 'p..."
4,-143054837169120955,where are the giant redwoods located in califo...,"{'left_context': '', 'mention': '', 'right_con...","[{'answer': 'Humboldt County', 'meta': {'score..."


In [9]:
validation_data_df['output'][0]

[{'answer': 'the therefore sign',
  'meta': {'score': -1},
  'provenance': [{'bleu_score': 1.0,
    'start_character': 44,
    'start_paragraph_id': 1,
    'end_character': 62,
    'end_paragraph_id': 1,
    'meta': {'fever_page_id': '',
     'fever_sentence_id': -1,
     'annotation_id': '-1',
     'yes_no_answer': '',
     'evidence_span': []},
    'section': 'Section::::Abstract.',
    'title': 'Therefore sign',
    'wikipedia_id': '10593264'}]},
 {'answer': 'therefore sign',
  'meta': {'score': -1},
  'provenance': [{'bleu_score': 1.0,
    'start_character': 48,
    'start_paragraph_id': 1,
    'end_character': 62,
    'end_paragraph_id': 1,
    'meta': {'fever_page_id': '',
     'fever_sentence_id': -1,
     'annotation_id': '-1',
     'yes_no_answer': '',
     'evidence_span': []},
    'section': 'Section::::Abstract.',
    'title': 'Therefore sign',
    'wikipedia_id': '10593264'}]},
 {'answer': 'a logical consequence , such as the conclusion of a syllogism',
  'meta': {'score':

In [10]:
validation_data_df['output'][0][0]['provenance'][0]['title']

'Therefore sign'

In [11]:
# knn_model_name ='sentence-transformers/multi-qa-MiniLM-L6-cos-v1'
knn_model_name = 'sentence-transformers/all-MiniLM-L6-v2'
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(knn_model_name, trust_remote_code=True)

In [None]:
def processESIndex_simplesearch(df_questions,index_name,k):
    i =0
    count =0
    df_questions['model_op'] = ''
    for ind in df_questions.index:
        question = df_questions['input'][ind]
        print(question)
        search_query2 ={
                "query": {
                    "match": {
                    "context": question
                    }
                }
                }
        response = es_client.search(
        index=index_name,
        body=search_query2,
        size=k  # Set the number of documents to retrieve per scroll
        )
        all_hits = response['hits']['hits']
        print(len(all_hits))
        flag = False
        for num, doc in enumerate(all_hits):
            answer_ids = df_questions['output'][ind]
            for ids in answer_ids:
                print(ids)
                answer = ids['answer']
                if answer != '':
                    if len(ids['provenance']) >0:
                        if ids['provenance'][0]['title'] == doc["_source"]['title']:
                            flag = True
                            break
            print(flag)
        df_questions['model_op'][ind] = flag    
        i =i+1
    return df_questions

In [None]:
validation_data_df.shape

In [None]:
df_eval_k5 = validation_data_df.copy()
df_eval_k5=processESIndex_simplesearch(df_eval_k5,index_name_bm25,5)

In [None]:
df_eval_k5['model_op'].sum()/len(df_eval_k5)

####  Metadata 1. keybert_title 2. yake_key_idea 3. entities

In [None]:
def processESIndex_Hybridsearch(df_questions,index_name,k):
    i =0
    count =0
    df_questions['model_op'] = ''
    for ind in df_questions.index:
        question = df_questions['input'][ind]
        print(question)
        search_query2 ={
          "query": {
           "multi_match" : {
              "query":question,
                "type":"best_fields",
                "fields": ["context","title","keybert_title"],
                #"fields": ["story^10","keybert_title^4", "title^8"],
                #  "fields": ["context^10","title^10","yake_key_idea","keybert_title^10"],
                "tie_breaker": 0.3
            }
          },
        }
        response = es_client.search(
        index=index_name,
        body=search_query2,
        size=k  # Set the number of documents to retrieve per scroll
        )
        all_hits = response['hits']['hits']
        print(len(all_hits))
        flag = False
        for num, doc in enumerate(all_hits):
            answer_ids = df_questions['output'][ind]
            for ids in answer_ids:
                print("******************", ids)
                answer = ids['answer']
                if answer != '':
                    print("----------------------", len(ids['provenance']))
                    if len(ids['provenance'])>0:
                        print("+++++++++++++++Golden title++++++++:", ids['provenance'][0]['title'])
                        print("+++++++++++++++Retrieved title++++++++:", doc["_source"]['title'])
                        if ids['provenance'][0]['title'] == doc["_source"]['title']:
                            flag = True
                            break
            print(flag)
        df_questions['model_op'][ind] = flag    
        i =i+1
    return df_questions

In [None]:
df_eval_k5 = validation_data_df.copy()
df_eval_k5 = processESIndex_Hybridsearch(df_eval_k5,index_name_bm25,5)

In [None]:
df_eval_k5.head()

In [None]:
score_val_k5 = (df_eval_k5['model_op']==True).sum()
print("Score Val ---",score_val_k5)
print("Score -------",score_val_k5/len(df_eval_k5))

### KNN+ Reteriver Evaluation

In [12]:
import time
def processESIndex_KNN_Hybridsearch(df_questions,index_name,k):
    i =0
    count =0
    df_questions['model_op'] =''
    for ind in df_questions.index:
        question = df_questions['input'][ind]
        print(ind,"**********",question)
        search_query2 ={
          "query": {
           "multi_match" : {
              "query":question,
                "type":"best_fields",
                # "fields": ["context","title"],
                # "fields": ["context","title","keybert_title"],
                # "fields": ["story^10","keywords^2"],
                #"fields": ["story^10","entities^2","keybert_title^8","yake_key_idea^2"],
                 "fields": ["context^10","title^10","yake_key_idea","keybert_title^10"],
                "tie_breaker": 0.3
            }
          },
          "knn": {
                        "field": "contexts_embedding",
                        "query_vector": df_questions['question_embedding'][ind],
                        "k": k,
                        "num_candidates": 100,
                        "boost":10
                    },
        }
        response = es_client.search(
        index=index_name,
        body=search_query2,
        size=k  # Set the number of documents to retrieve per scroll
        )
        all_hits = response['hits']['hits']
        print(len(all_hits))
        flag = False
        for num, doc in enumerate(all_hits):
            answer_ids = df_questions['output'][ind]
            for ids in answer_ids:
                print("******************", ids)
                answer = ids['answer']
                if answer != '':
                    print("----------------------", len(ids['provenance']))
                    if len(ids['provenance'])>0:
                        #print("+++++++++++++++Golden title++++++++:", ids['provenance'][0]['title'])
                        #print("+++++++++++++++Retrieved title++++++++:", doc["_source"]['title'])
                        if ids['provenance'][0]['title'] == doc["_source"]['title']:
                            flag = True
                            break
            if flag:
                break
            print(flag)
        df_questions['model_op'][ind] = flag    
        i =i+1
        ##time.sleep(1)
    return df_questions

In [None]:
validation_data_df['question_embedding'] =''
for index, row in validation_data_df.iterrows():
    print("**********",index)
    question = row['input']
    validation_data_df.at[index,"question_embedding"] = model.encode(question)

In [14]:
validation_data_df.head()

Unnamed: 0,id,input,meta,output,question_embedding
0,6915606477668963399,what do the 3 dots mean in math,"{'left_context': '', 'mention': '', 'right_con...","[{'answer': 'the therefore sign', 'meta': {'sc...","[-0.021617994, -0.032326434, -0.046972573, -0...."
1,-8366545547296627039,who wrote the song photograph by ringo starr,"{'left_context': '', 'mention': '', 'right_con...","[{'answer': 'Ringo Starr', 'meta': {'score': -...","[0.009027813, 0.078675, 0.036336917, 0.0033178..."
2,-5004457603684974952,who is playing the halftime show at super bowl...,"{'left_context': '', 'mention': '', 'right_con...","[{'answer': 'Coldplay', 'meta': {'score': -1},...","[-0.015418596, 0.038507085, 0.015025766, -0.18..."
3,7217222058435937287,where was the world economic forum held this year,"{'left_context': '', 'mention': '', 'right_con...","[{'answer': 'Davos', 'meta': {'score': -1}, 'p...","[0.023470119, -0.049751315, -0.0731394, 0.0047..."
4,-143054837169120955,where are the giant redwoods located in califo...,"{'left_context': '', 'mention': '', 'right_con...","[{'answer': 'Humboldt County', 'meta': {'score...","[0.12426712, 0.05179882, 0.039753318, 0.065471..."


In [15]:
df_eval_k5 = validation_data_df[0:100].copy()
df_eval_k5 = processESIndex_KNN_Hybridsearch(df_eval_k5,index_name_knn,5)

0 ********** what do the 3 dots mean in math
5
****************** {'answer': 'the therefore sign', 'meta': {'score': -1}, 'provenance': [{'bleu_score': 1.0, 'start_character': 44, 'start_paragraph_id': 1, 'end_character': 62, 'end_paragraph_id': 1, 'meta': {'fever_page_id': '', 'fever_sentence_id': -1, 'annotation_id': '-1', 'yes_no_answer': '', 'evidence_span': []}, 'section': 'Section::::Abstract.', 'title': 'Therefore sign', 'wikipedia_id': '10593264'}]}
---------------------- 1
****************** {'answer': 'therefore sign', 'meta': {'score': -1}, 'provenance': [{'bleu_score': 1.0, 'start_character': 48, 'start_paragraph_id': 1, 'end_character': 62, 'end_paragraph_id': 1, 'meta': {'fever_page_id': '', 'fever_sentence_id': -1, 'annotation_id': '-1', 'yes_no_answer': '', 'evidence_span': []}, 'section': 'Section::::Abstract.', 'title': 'Therefore sign', 'wikipedia_id': '10593264'}]}
---------------------- 1
****************** {'answer': 'a logical consequence , such as the conclusion

In [17]:
score_val_k5 = (df_eval_k5['model_op']==True).sum()
print("Score Val ---",score_val_k5)
print("Score -------",score_val_k5/len(df_eval_k5))

Score Val --- 63
Score ------- 0.63
