## Squad - Retriever Evalaution: Evaluating various semantic search
This notebook evaluates various semantic search retrieval methods on the SQuAD dataset using Elasticsearch. It compares BM25 and KNN-based retrieval strategies, including enriched metadata and RAG (Retrieval-Augmented Generation) approaches.

**Sections:**
- Data loading and preprocessing
- BM25-based retrieval evaluation
- KNN + Best Field retrieval evaluation
- RAG retriever evaluation
- Enriched metadata retrieval

**Requirements:**
- Elasticsearch instance with relevant indices
- SQuAD dataset in JSON format
- Python packages: pandas, elasticsearch, sentence-transformers

In [1]:
import os
import pandas as pd
from elasticsearch import Elasticsearch
import warnings
warnings.filterwarnings('ignore')

In [2]:
from dotenv import load_dotenv

# Path to your .env file
env_path = "../.env"  # Change path if needed

# Load environment variables from .env
load_dotenv(dotenv_path=env_path)
# Access the environment variables
ES_URL = os.getenv("ES_URL")
ES_USER = os.getenv("ES_USER")
ES_PASS = os.getenv("ES_PASS")

In [None]:

es_client = Elasticsearch(
    ES_URL, 
    basic_auth=(ES_USER,ES_PASS),
    verify_certs=False,
    request_timeout=10000
)
es_client.info()

ObjectApiResponse({'name': 'es-sample-es-data-master-2', 'cluster_name': 'es-sample', 'cluster_uuid': 'lxgst327RICarIi1P0c6TQ', 'version': {'number': '8.12.0', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '1665f706fd9354802c02146c1e6b5c0fbcddfbc9', 'build_date': '2024-01-11T10:05:27.953830042Z', 'build_snapshot': False, 'lucene_version': '9.9.1', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [3]:
index_name_bm25 = "research_index_bm25_squad"
index_name_knn = "research_index_knn_squad"

## 1. Evaluating on Squad DataSet

In [None]:
def get_all_files(folder_name):
    # Change the directory
    os.chdir(folder_name)
    # iterate through all file
    file_path_list =[]
    for file in os.listdir():
        print(file)
        file_path = f"{folder_name}/{file}"
        file_path_list.append(file_path)
    return file_path_list

files = get_all_files('../../data/Squad')

.DS_Store
dev-v1.1.json
train-v1.1.json


In [5]:
df_docs_train = pd.read_json(files[2]) 
df_docs_dev = pd.read_json(files[1]) 

In [6]:
df_docs_train.head()

Unnamed: 0,data,version
0,"{'title': 'University_of_Notre_Dame', 'paragra...",1.1
1,"{'title': 'Beyoncé', 'paragraphs': [{'context'...",1.1
2,"{'title': 'Montana', 'paragraphs': [{'context'...",1.1
3,"{'title': 'Genocide', 'paragraphs': [{'context...",1.1
4,"{'title': 'Antibiotics', 'paragraphs': [{'cont...",1.1


In [7]:
df_docs_train['data'][0]

{'title': 'University_of_Notre_Dame',
 'paragraphs': [{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
   'qas': [{'answers': [{'answer_start': 515,
       'text': 'Saint Bernadette Soubirous'}],
     'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
     'id': '5733be284776f41900661182'},
    {'ans

In [8]:
def create_question_on_squad(df_squad):
    squad_count =0
    doc_question =[]
    for index, row  in df_squad.iterrows():
        print(index)
        data = row['data']
        title = data['title']
        context  = data['paragraphs']
        squad_count = squad_count + len(context)
        if squad_count > -1:
            for i in range(len(context)):
                story = context[i]['context']
                questions = context[i]['qas']
                for question in questions:
                    doc = {
                            "title":title,
                            "context" :story,
                            "question":question['question'],
                            "answer":question['answers']
                        }
                    doc_question.append(doc)     
    return doc_question

In [None]:
doc_question_dev = create_question_on_squad(df_docs_dev)

In [10]:
len(doc_question_dev)

10570

In [11]:
def processESIndex_BM25_simple(df_test,index_name,k):
    df_test['model_op'] = ''
    for ind in df_test.index:
        question = df_test['question'][ind]
        print(question)
        search_query2 ={
                         "query": {
                        "match": {
                                "story": question
                                }
                }
        }
        response = es_client.search(
        index=index_name,
        body=search_query2,
        size=k  # Set the number of documents to retrieve per scroll
        )
        all_hits = response['hits']['hits']
        print(len(all_hits))
        flag = False
        for num, doc in enumerate(all_hits):
            print('title',doc["_source"]['title'])
            if doc["_source"]['title'] == df_test['title'][ind]:
                        flag = True
                        break
        print(flag)
        df_test['model_op'][ind] = flag
    return df_test

In [12]:
df_question_dev = pd.DataFrame(doc_question_dev)

In [13]:
df_question_dev.head()

Unnamed: 0,title,context,question,answer
0,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"[{'answer_start': 177, 'text': 'Denver Broncos..."
1,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"[{'answer_start': 249, 'text': 'Carolina Panth..."
2,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"[{'answer_start': 403, 'text': 'Santa Clara, C..."
3,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"[{'answer_start': 177, 'text': 'Denver Broncos..."
4,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"[{'answer_start': 488, 'text': 'gold'}, {'answ..."


In [15]:
doc_question = processESIndex_BM25_simple(df_question_dev,index_name_bm25,1)

Which NFL team represented the AFC at Super Bowl 50?
1
title Super_Bowl_50
True
Which NFL team represented the NFC at Super Bowl 50?
1
title Super_Bowl_50
True
Where did Super Bowl 50 take place?
1
title Super_Bowl_50
True
Which NFL team won Super Bowl 50?
1
title Super_Bowl_50
True
What color was used to emphasize the 50th anniversary of the Super Bowl?
1
title Super_Bowl_50
True
What was the theme of Super Bowl 50?
1
title Super_Bowl_50
True
What day was the game played on?
1
title Exhibition_game
False
What is the AFC short for?
1
title Red
False
What was the theme of Super Bowl 50?
1
title Super_Bowl_50
True
What does AFC stand for?
1
title Association_football
False
What day was the Super Bowl played on?
1
title Super_Bowl_50
True
Who won Super Bowl 50?
1
title Super_Bowl_50
True
What venue did Super Bowl 50 take place in?
1
title Super_Bowl_50
True
What city did Super Bowl 50 take place in?
1
title Super_Bowl_50
True
If Roman numerals were used, what would Super Bowl 50 have been

In [16]:
doc_question['model_op'].sum()/len(doc_question)

0.8467360454115421

In [18]:
def processESIndex_BM25_BestField(df_test,index_name,k):
    df_test['model_op'] = ''
    for ind in df_test.index:
        question = df_test['question'][ind]
        print(question)
        search_query2 ={
          "query": {
           "multi_match" : {
              "query":question,
                "type":"best_fields",
                "fields":[ "story", "topics","title"],
}}}
        response = es_client.search(
        index=index_name,
        body=search_query2,
        size=k  # Set the number of documents to retrieve per scroll
        )
        all_hits = response['hits']['hits']
        print(len(all_hits))
        flag = False
        for num, doc in enumerate(all_hits):
            print('title',doc["_source"]['title'])
            if doc["_source"]['title'] == df_test['title'][ind]:
                        flag = True
                        break
        print(flag)
        df_test['model_op'][ind] = flag
    return df_test

In [None]:
doc_question = processESIndex_BM25_BestField(df_question_dev,index_name_bm25,1)

In [20]:
doc_question['model_op'].sum()/len(doc_question)

0.8405865657521286

## KNN + Best_filed

In [21]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

In [22]:
def processESIndex_KNN_BestField(df_test,index_name,k):
    df_test['model_op'] = ''
    for ind in df_test.index:
        question = df_test['question'][ind]
        print(question)
        search_query2 ={
          "query": {
           "multi_match" : {
              "query":question,
                "type":"best_fields",
                 #"fields":["story^10","id^10","keywords^2","topics^6"],
                "fields":["story","id"],
                #"tie_breaker":0.3
            }
                    },
                    "knn": {
                        "field": "story_embedding",
                        "query_vector": model.encode(question),
                        "k": k,
                        "num_candidates": 100,
                        "boost":10
                    },
            }
        response = es_client.search(
        index=index_name,
        body=search_query2,
        size=k  # Set the number of documents to retrieve per scroll
        )
        all_hits = response['hits']['hits']
        print(len(all_hits))
        flag = False
        for num, doc in enumerate(all_hits):
            #print(doc)
            #print('title',doc["_source"]['id'])
            if doc["_source"]['id'] == df_test['title'][ind]:
                        flag = True
                        break
        print(flag)
        df_test['model_op'][ind] = flag
    return df_test

In [None]:
doc_question= processESIndex_KNN_BestField(df_question_dev,index_name_knn,1)

In [25]:
doc_question['model_op'].sum()/len(doc_question)

0.8706717123935667

In [26]:
doc_question['model_op'].sum()/len(doc_question)

0.8706717123935667

## RAG Retrieved data

In [28]:
def processESIndex_BM25_simple_RAG(df_test,index_name,k):
    df_test['R_context'] = ''
    for ind in df_test.index:
        question = df_test['question'][ind]
        print(question)
        search_query2 ={
                         "query": {
                        "match": {
                                "story": question
                                }
                }
        }
        response = es_client.search(
        index=index_name,
        body=search_query2,
        size=k  # Set the number of documents to retrieve per scroll
        )
        all_hits = response['hits']['hits']
        print(len(all_hits))
        context =[]
        for num, doc in enumerate(all_hits):
            print('title',doc["_source"]['title'])
            context.append(doc["_source"]['story'])
        df_test['R_context'][ind] = context
    return df_test

In [29]:
doc_question = processESIndex_BM25_simple_RAG(df_question_dev,index_name_bm25,1)

Which NFL team represented the AFC at Super Bowl 50?
1
title Super_Bowl_50
Which NFL team represented the NFC at Super Bowl 50?
1
title Super_Bowl_50
Where did Super Bowl 50 take place?
1
title Super_Bowl_50
Which NFL team won Super Bowl 50?
1
title Super_Bowl_50
What color was used to emphasize the 50th anniversary of the Super Bowl?
1
title Super_Bowl_50
What was the theme of Super Bowl 50?
1
title Super_Bowl_50
What day was the game played on?
1
title Exhibition_game
What is the AFC short for?
1
title Red
What was the theme of Super Bowl 50?
1
title Super_Bowl_50
What does AFC stand for?
1
title Association_football
What day was the Super Bowl played on?
1
title Super_Bowl_50
Who won Super Bowl 50?
1
title Super_Bowl_50
What venue did Super Bowl 50 take place in?
1
title Super_Bowl_50
What city did Super Bowl 50 take place in?
1
title Super_Bowl_50
If Roman numerals were used, what would Super Bowl 50 have been called?
1
title Super_Bowl_50
Super Bowl 50 decided the NFL champion for

In [30]:
doc_question.head()

Unnamed: 0,title,context,question,answer,model_op,R_context
0,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the AFC at Super Bo...,"[{'answer_start': 177, 'text': 'Denver Broncos...",True,[Super Bowl 50 was an American football game t...
1,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team represented the NFC at Super Bo...,"[{'answer_start': 249, 'text': 'Carolina Panth...",True,[Super Bowl 50 was an American football game t...
2,Super_Bowl_50,Super Bowl 50 was an American football game to...,Where did Super Bowl 50 take place?,"[{'answer_start': 403, 'text': 'Santa Clara, C...",True,"[CBS broadcast Super Bowl 50 in the U.S., and ..."
3,Super_Bowl_50,Super Bowl 50 was an American football game to...,Which NFL team won Super Bowl 50?,"[{'answer_start': 177, 'text': 'Denver Broncos...",True,[Super Bowl 50 was an American football game t...
4,Super_Bowl_50,Super Bowl 50 was an American football game to...,What color was used to emphasize the 50th anni...,"[{'answer_start': 488, 'text': 'gold'}, {'answ...",True,[Super Bowl 50 was an American football game t...


## KNN

In [32]:
def processESIndex_KNN_BestField_RAG(df_test,index_name,k):
    df_test['R_context'] = ''
    for ind in df_test.index:
        question = df_test['question'][ind]
        print(question)
        search_query2 ={
          "query": {
           "multi_match" : {
              "query":question,
                "type":"best_fields",
                 #"fields":["story^10","id^10","keywords^2","topics^6"],
                "fields":["story","id"],
                #"tie_breaker":0.3
            }
                    },
                    "knn": {
                        "field": "story_embedding",
                        "query_vector": model.encode(question),
                        "k": k,
                        "num_candidates": 100,
                        "boost":10
                    },
            }
        response = es_client.search(
        index=index_name,
        body=search_query2,
        size=k  # Set the number of documents to retrieve per scroll
        )
        all_hits = response['hits']['hits']
        print(len(all_hits))
        context =[]
        for num, doc in enumerate(all_hits):
            context.append(doc["_source"]['story'])
        df_test['R_context'][ind] = context
    return df_test

In [33]:
doc_question= processESIndex_KNN_BestField_RAG(df_question_dev,index_name_knn,1)

Which NFL team represented the AFC at Super Bowl 50?
1
Which NFL team represented the NFC at Super Bowl 50?
1
Where did Super Bowl 50 take place?
1
Which NFL team won Super Bowl 50?
1
What color was used to emphasize the 50th anniversary of the Super Bowl?
1
What was the theme of Super Bowl 50?
1
What day was the game played on?
1
What is the AFC short for?
1
What was the theme of Super Bowl 50?
1
What does AFC stand for?
1
What day was the Super Bowl played on?
1
Who won Super Bowl 50?
1
What venue did Super Bowl 50 take place in?
1
What city did Super Bowl 50 take place in?
1
If Roman numerals were used, what would Super Bowl 50 have been called?
1
Super Bowl 50 decided the NFL champion for what season?
1
What year did the Denver Broncos secure a Super Bowl title for the third time?
1
What city did Super Bowl 50 take place in?
1
What stadium did Super Bowl 50 take place in?
1
What was the final score of Super Bowl 50? 
1
What month, day and year did Super Bowl 50 take place? 
1
What 

In [35]:
doc_question['model_op'].sum()/len(doc_question)

0.8706717123935667

## Enriched metadata

In [36]:
def processESIndex_KNN_enriched_metadata(df_test,index_name,k):
    df_test['R_context'] = ''
    for ind in df_test.index:
        question = df_test['question'][ind]
        print(question)
        search_query2 ={
          "query": {
           "multi_match" : {
              "query":question,
                "type":"best_fields",
                 "fields":["story^10","title^10","topics^6","keywords^2"],
                # "fields":["story","id"],
                "tie_breaker":0.3
            }
                    },
                    "knn": {
                        "field": "story_embedding",
                        "query_vector": model.encode(question),
                        "k": k,
                        "num_candidates": 100,
                        "boost":10
                    },
            }
        response = es_client.search(
        index=index_name,
        body=search_query2,
        size=k  # Set the number of documents to retrieve per scroll
        )
        all_hits = response['hits']['hits']
        print(len(all_hits))
        context =[]
        for num, doc in enumerate(all_hits):
            context.append(doc["_source"]['story'])
        df_test['R_context'][ind] = context
    return df_test

In [37]:
doc_question= processESIndex_KNN_enriched_metadata(df_question_dev,index_name_knn,1)

Which NFL team represented the AFC at Super Bowl 50?
1
Which NFL team represented the NFC at Super Bowl 50?
1
Where did Super Bowl 50 take place?
1
Which NFL team won Super Bowl 50?
1
What color was used to emphasize the 50th anniversary of the Super Bowl?
1
What was the theme of Super Bowl 50?
1
What day was the game played on?
1
What is the AFC short for?
1
What was the theme of Super Bowl 50?
1
What does AFC stand for?
1
What day was the Super Bowl played on?
1
Who won Super Bowl 50?
1
What venue did Super Bowl 50 take place in?
1
What city did Super Bowl 50 take place in?
1
If Roman numerals were used, what would Super Bowl 50 have been called?
1
Super Bowl 50 decided the NFL champion for what season?
1
What year did the Denver Broncos secure a Super Bowl title for the third time?
1
What city did Super Bowl 50 take place in?
1
What stadium did Super Bowl 50 take place in?
1
What was the final score of Super Bowl 50? 
1
What month, day and year did Super Bowl 50 take place? 
1
What 

In [38]:
doc_question['model_op'].sum()/len(doc_question)

0.8706717123935667