# **Retrieval Experiment 1: Hybrid/Ensemble Retriever**

Combine sparse retriever (BM25 that is based on keyword search) with dense retriever (FAISS that is based on embedding similarity/semantic similarity).

Results are reranked based on the Reciprocal Rank Fusion algorithm (RRF).

The Hybrid/Ensemble approach, combining basic keyword and similarity search, could be applicable to our use case to retrieve a interpolation of keyword-wise relevant (BM25) and semantically-similar (FAISS) results.

In [1]:
%pip install --quiet --upgrade bitsandbytes langchain langchain-community langchain-huggingface transformers beautifulsoup4 faiss-gpu rank_bm25 lark

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m29.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m92.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m122.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 kB[0m [31m12.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.7/409.7 kB[0m [31m36.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from langchain_core.documents import Document
from langchain.retrievers import EnsembleRetriever # Supports Ensembling of results from multiple retrievers
from langchain_community.retrievers import BM25Retriever
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
import nltk
from nltk.corpus import stopwords
import re
import pandas as pd
import os
from google.colab import files
import random
import numpy as np

## **User Action Required**

1. Run the code below to create the ```eval_data``` folder

2. Choose to upload the eval zip file
- ```retriever_eval.zip```


In [3]:
# Upload retriever_eval.zip
files.upload();

Saving retriever_eval.zip to retriever_eval.zip


In [4]:
!unzip retriever_eval.zip

Archive:  retriever_eval.zip
   creating: retriever_eval/
  inflating: retriever_eval/query_3_docs.csv  
  inflating: retriever_eval/query_3_eval.csv  
  inflating: retriever_eval/query_2_docs.csv  
  inflating: retriever_eval/query_1_docs.csv  
  inflating: retriever_eval/query_4_eval.csv  
  inflating: retriever_eval/queries.csv  
  inflating: retriever_eval/query_2_eval.csv  
  inflating: retriever_eval/query_4_docs.csv  
  inflating: retriever_eval/query_5_eval.csv  
  inflating: retriever_eval/query_5_docs.csv  
  inflating: retriever_eval/query_1_eval.csv  


In [5]:
# Upload retriever_evaluation.py
files.upload();

Saving retriever_evaluation.py to retriever_evaluation.py


In [6]:
import retriever_evaluation

Your folder structure should now look as such:

```
retriever_eval
  - queries.csv
  - query_1_docs.csv
  - query_1_eval.csv
  - query_n_docs.csv
  - query_n_eval.csv

retriever_evaluation.py
```

## **Simple experiment to easily understand behaviour of individual retrievers**

**Setup Simple Experiment Data for Experiment 1**

We first set up simple experiment data where we have 2 short documents for each activity category. This is done to have an easier <u>preliminary understanding of the behaviour of the individual retrievers</u> as we are able to quickly look at what words our data contains.

In [7]:
# Simple experiment example data
docs = [
    Document(
        page_content="The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.",
        metadata={"activity": 'Hiking', "country": 'Norway'},
    ),
    Document(
        page_content="Unique hike that can be done are volcanic hikes which can be done in Iceland. It is recommended to go with a tour of experienced people!",
        metadata={"activity": 'Hiking', "country": 'Iceland'},
    ),
    Document(
        page_content="Popular food in Norway is seafood! The best seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!",
        metadata={"activity": 'Food', "country": 'Norway'},
    ),
    Document(
        page_content="The famous street food of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger.",
        metadata={"activity": 'Food', "country": 'Norway'},
    ),
    Document(
        page_content="Transportation within Reykjavik is fairly convenient as there is a public bus service called BSI. All you need to do is to download their mobile app, follow the instructions, and you're good to go. Transportation to places outside Reykjavik however requires a car. Some options include car rentals as well as booking bus tours.",
        metadata={"activity": 'Transportation', "country": 'Iceland'},
    ),
    Document(
        page_content="Finland is easily accessible with its HSL public transportation services where all you need to do is to download a mobile app and follow the instructions.",
        metadata={"activity": 'Transportation', "country": 'Iceland'},
    ),
    Document(
        page_content="Finland is known for its snowy-like landscape and captivating auroras. One of the best places to stay is the Glass huts in Skyfire village in Rovaniemi, Lapland where you can admire the beautiful northern lights and snowy landscape. The village has its very own restaurant called Sky Huts Restaurant and Bar which offers tailor-made menus by a professional chef using local ingredients.",
        metadata={"activity": 'Accomodation', "country": 'Finland'},
    ),
    Document(
        page_content="A nice place to stay in Norway is the Lofoten Islands, in particlar Unstad which provides a breathtaking view of the mountain valley, ocean, and if you're lucky, northern lights.",
        metadata={"activity": 'Accomodation', "country": 'Norway'},
    ),
]

**Control Variables for Simple Experiment Data**

For simplicity sake, we
- set the number of documents retrieved to be 5
- set the weights for each retriever to be 0.5
- do not chunk/split documents as our examples are short enough

We also fix the embeddings model and vector store index for FAISS based on our <u>prior research</u>
- embeddings model: all-mpnet-base-v2
- index for FAISS: IndexFlatL2

Fix the similarity search type to be MMR based on our prior research

In [8]:
num_docs_retrieved = 5
embeddings_model_name = "sentence-transformers/all-mpnet-base-v2" # https://www.sbert.net/docs/sentence_transformer/pretrained_models.html
embeddings_model = HuggingFaceEmbeddings(model_name=embeddings_model_name)
vector_store_index = faiss.IndexFlatL2(len(embeddings_model.embed_query("hello world")))
bm25_weight = 0.5
faiss_weight = 1-bm25_weight
search_type = 'mmr'

  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [9]:
# Initialise BM25 retreiver (keyword-based)
bm25_retriever = BM25Retriever.from_documents(
    docs
)
bm25_retriever.k = num_docs_retrieved

# Initialise the FAISS retriever (semantic-similarity)
embeddings_model = HuggingFaceEmbeddings(model_name=embeddings_model_name)
faiss_vector_store = FAISS(
    embedding_function=embeddings_model,
    index=vector_store_index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

faiss_vector_store.add_documents(docs)
faiss_retriever = faiss_vector_store.as_retriever(search_type=search_type, search_kwargs={"k": num_docs_retrieved})

# Initialise the hybrid/ensemble retriever
# Uses RRF to sum the rankings of each doc from both retrievers, discounting rankings that are lower.
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[bm25_weight, faiss_weight]
)

**Test Query 1: Testing Hybrid/Ensemble Retriever**

In [10]:
test_query_1 = "What are the best hikes?"

In [11]:
ensemble_retriever.invoke(test_query_1)

[Document(metadata={'activity': 'Hiking', 'country': 'Norway'}, page_content='The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.'),
 Document(metadata={'activity': 'Hiking', 'country': 'Iceland'}, page_content='Unique hike that can be done are volcanic hikes which can be done in Iceland. It is recommended to go with a tour of experienced people!'),
 Document(metadata={'activity': 'Accomodation', 'country': 'Finland'}, page_content='Finland is known for its snowy-like landscape and captivating auroras. One of the best places to stay is the Glass huts in Skyfire village in Rovaniemi, Lapland where you can admire the beautiful northern lights and snowy landscape. The village has its v

**Test Query 2: Testing Hybrid/Ensemble Retriever**

In [12]:
test_query_2 = "What are the best food?"

In [13]:
ensemble_retriever.invoke(test_query_2)

[Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content='Popular food in Norway is seafood! The best seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!'),
 Document(metadata={'activity': 'Hiking', 'country': 'Norway'}, page_content='The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.'),
 Document(metadata={'activity': 'Accomodation', 'country': 'Finland'}, page_content='Finland is known for its snowy-like landscape and captivating auroras. One of the best places to stay is the Glass huts in Skyfire village in Rovaniemi, 

**Questions about Hybrid/Ensemble Experiment**
- For <u>Test Query 2</u>, why is a hiking related document being returned first when we are making a query about food?

**Investigating BM25**

<u>Running BM25 with Test Query 2</u>

In [14]:
bm25_retriever.invoke(test_query_2)

[Document(metadata={'activity': 'Hiking', 'country': 'Norway'}, page_content='The best hikes in Norway include the Reinebringen hike in the Lofoten islands. At a modest 448 meters high, Reinebringen is far from one of the highest peaks on the Lofoten islands. Yet this is more than made up for by the iconic view from the summit of Reine. It is not suitable for winter! Also, the trail can be quite demanding as the steps are quite steep.'),
 Document(metadata={'activity': 'Hiking', 'country': 'Iceland'}, page_content='Unique hike that can be done are volcanic hikes which can be done in Iceland. It is recommended to go with a tour of experienced people!'),
 Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content='Popular food in Norway is seafood! The best seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!'),
 Document(metadata=

<u>Running BM25 with the keyword that we want: 'food'</u>

In [15]:
bm25_retriever.invoke("food")

[Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content='Popular food in Norway is seafood! The best seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!'),
 Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content="The famous street food of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger."),
 Document(metadata={'activity': 'Accomodation', 'country': 'Norway'}, page_content="A nice place to stay in Norway is the Lofoten Islands, in particlar Unstad which provides a breathtaking view of the mountain valley, ocean, and if you're lucky, northern lights."),
 Document(metadata={'activity': 'Accomodation', 'country': 'Finland'}, page_content='Finland is known for its snowy-lik

<u>Findings from BM25</u>

When using BM25 with the full query, why were the hiking-related documents ranked higher than the food-related documents?
- Other terms in the query are also considered (because stopwords are not removed) into the the BM25 calculation. For instance, the word 'best' and 'are' which are present in the hiking-related documents.


<u>How can we improve BM25</u>
- Drop stopwords from the query so that they are not factored into the BM25 calculation

In [16]:
nltk.download('stopwords')
print("The stopwords provided by the nltk library include:")
print(stopwords.words('english'))
nltk_stopwords = stopwords.words('english')

The stopwords provided by the nltk library include:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:
full_query = test_query_2
query_words = re.findall(r'\b\w+\b', full_query)
keywords = [keyword for keyword in query_words if keyword.lower() not in nltk_stopwords]
full_query_keywords_only = ' '.join(keywords)
print(f'The original query is: "{full_query}"')
print(f'The new query after removing stopwords is: "{full_query_keywords_only}"')
print("The result is: \n")
bm25_retriever.invoke(full_query_keywords_only)

The original query is: "What are the best food?"
The new query after removing stopwords is: "best food"
The result is: 



[Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content='Popular food in Norway is seafood! The best seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!'),
 Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content="The famous street food of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger."),
 Document(metadata={'activity': 'Accomodation', 'country': 'Finland'}, page_content='Finland is known for its snowy-like landscape and captivating auroras. One of the best places to stay is the Glass huts in Skyfire village in Rovaniemi, Lapland where you can admire the beautiful northern lights and snowy landscape. The village has its very own restaurant called Sky Huts Restaurant 

**We can see an improvement in the BM25 ranking by using removing stopwords from the query**

**Investigating FAISS**

<u>Running FAISS (IndexL2) with the Test Query 2</u>

In [18]:
faiss_retriever.invoke(test_query_2)

[Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content='Popular food in Norway is seafood! The best seafood in the Nordic region can be found in Norway. The seafood is freshly caught from the arctic ocean. Popular choices include the famous norwegian salmon. Other delicacies include whale steak!'),
 Document(metadata={'activity': 'Food', 'country': 'Norway'}, page_content="The famous street food of Iceland is the Hotdog! It is called the Baejarins Beztu Pylsur hot dog is made of a mix of lamb, beef and pork. Other delicacies of iceland include Fish and Chips as well as Tommi's burger."),
 Document(metadata={'activity': 'Transportation', 'country': 'Iceland'}, page_content='Finland is easily accessible with its HSL public transportation services where all you need to do is to download a mobile app and follow the instructions.'),
 Document(metadata={'activity': 'Hiking', 'country': 'Norway'}, page_content='The best hikes in Norway include the Reinebringen hike in the 

It can be observed that the embedding similarity approach works well even though the words in the query are not an exact match, where relevant documents are retrieved at the top. This is because of the vector similarity between words in the user query and document.

<br/>
<br/>
<br/>

## **Evaluation to find best weightages for improved Hybrid/Ensemble Approach**

In this experiment, we test the retrieval of the Hybrid/Ensemble Approach, with and without the removal of stopwords as well as by adjusting the different weightages, on the evaluation dataset

**Define Function to Remove Stopwords**

In [19]:
def remove_stopwords(query, stopwords):
  query_words = re.findall(r'\b\w+\b', query)
  keywords = [keyword for keyword in query_words if keyword.lower() not in stopwords]
  return ' '.join(keywords)

In [20]:
nltk.download('stopwords')
print("The stopwords provided by the nltk library include:")
print(stopwords.words('english'))
nltk_stopwords = stopwords.words('english')

The stopwords provided by the nltk library include:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Control Variables**

- Set number of documents retrieved to be 10 (5 for each retriever)
- Fix the embeddings model and vector store index for FAISS based on our <u>prior research</u>
  - embeddings model: all-mpnet-base-v2
  - index for FAISS: IndexFlatL2
- Fix the document chunk/splitting method based on our <u>prior research</u>
  - RecursiveCharacterTextSplitter
    - chunk_size=250
    - chunk_overlap=50
- Fix the similarity search type to be MMR based on our prior research
- Set the NDCG k-value to be calculated to be 5

**Experimental Variables**
- Weightage of bm25 (keyword similarity) and FAISS (vector similarity)


In [21]:
bm25_weights = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
faiss_weights = list(reversed(bm25_weights))

**Run Experiment on Eval Dataset with function to remove stopwords from query**

In [22]:
queries = pd.read_csv(os.path.join(os.getcwd(),'retriever_eval','queries.csv'))

In [23]:
embeddings_model_name = "sentence-transformers/all-mpnet-base-v2"
embeddings_model = HuggingFaceEmbeddings(model_name=embeddings_model_name)
num_docs_retrieved = 5
ndcg_k_val = 5
search_type = 'mmr'
# fetch_k = 20 # default for mmr

# Initialise dictionaries: { 'bm25_bm25weight_faiss_faiss_weight' = mean average precision}, { 'bm25_bm25weight_faiss_faiss_weight' = mean reciprocal rank}
retriever_eval_res_no_stopwords = {}
# For each combination of weights
for i in range(len(bm25_weights)):
  ndcg_vals = []
  binary_relevance_scores = []

  # For each query
  for idx,row in queries.iterrows():
    query = row['query']
    query_num = row['query_number']

    # Get the evaluation docs (positive and negative samples) and scores of the chunks
    docs_df = pd.read_csv(os.path.join(os.getcwd(),'retriever_eval',f'query_{query_num}_docs.csv'))
    eval_df = pd.read_csv(os.path.join(os.getcwd(),'retriever_eval',f'query_{query_num}_eval.csv'))
    # CHANGE THIS CODE ONCE WE ARE READING IN THE ACTUAL THING
    eval_df['score'] = [random.choice([0, 1, 2, 3, 4]) for _ in range(len(eval_df))]

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=250, chunk_overlap=50, add_start_index=True
    )

    # Chunk the evaluation docs
    docs = list(docs_df['docs'])
    docs_document_objects = [Document(page_content=doc) for doc in docs]
    chunked_docs = text_splitter.split_documents(docs_document_objects)

    # Build the ensemble retriever
    bm25_retriever = BM25Retriever.from_documents(chunked_docs)
    bm25_retriever.k = num_docs_retrieved

    vector_store_index = faiss.IndexFlatL2(len(embeddings_model.embed_query("hello world")))
    faiss_vector_store = FAISS(
      embedding_function=embeddings_model,
      index=vector_store_index,
      docstore=InMemoryDocstore(),
      index_to_docstore_id={})
    faiss_vector_store.add_documents(chunked_docs)
    faiss_retriever = faiss_vector_store.as_retriever(search_type=search_type, search_kwargs={"k": num_docs_retrieved})

    ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, faiss_retriever], weights=[bm25_weights[i], faiss_weights[i]])

    # Run the ensemble retriever
    retrieved_docs = [d.page_content for d in ensemble_retriever.invoke(remove_stopwords(query,nltk_stopwords))]

    # For each chunk, get its raw score
    raw_scores = []
    for chunk in retrieved_docs:
      raw_scores.append(eval_df.loc[eval_df['chunks'] == chunk, 'score'].values[0])
    # Accumulate the relevance (binary) scores over all the queries
    binary_scores = [1 if score > 0 else 0 for score in raw_scores]
    binary_relevance_scores.append(binary_scores)

    # Calculate NDCG@K using the raw scores
    ndcg_vals.append(retriever_evaluation.ndcg_at_k(raw_scores,ndcg_k_val))

  # Calculate mean NDCG@K
  mean_ndcg_vals = np.mean(ndcg_vals)
  # Calculate MAP
  map_vals = retriever_evaluation.mean_average_precision(binary_relevance_scores)
  # Calculate MRR
  mrr_vals = retriever_evaluation.mean_reciprocal_rank(binary_relevance_scores)

  res = {'mean_ndcg_vals':mean_ndcg_vals, 'mean_map_vals':map_vals, 'mean_mrr_vals':mrr_vals}

  retriever_eval_res_no_stopwords[f'bm25_{bm25_weights[i]}_faiss_{faiss_weights[i]}'] = res

In [24]:
print("The results for running the ensemble retriever on the eval dataset without stopwords in the query are as follows: ")
retriever_eval_res_no_stopwords

The results for running the ensemble retriever on the eval dataset without stopwords in the query are as follows: 


{'bm25_0.1_faiss_0.9': {'mean_ndcg_vals': 0.5893276339625265,
  'mean_map_vals': 0.8651952632905013,
  'mean_mrr_vals': 0.8},
 'bm25_0.2_faiss_0.8': {'mean_ndcg_vals': 0.6743021430074715,
  'mean_map_vals': 0.8648921327790375,
  'mean_mrr_vals': 0.9},
 'bm25_0.3_faiss_0.7': {'mean_ndcg_vals': 0.48295858352186843,
  'mean_map_vals': 0.7585260770975056,
  'mean_mrr_vals': 0.7666666666666667},
 'bm25_0.4_faiss_0.6': {'mean_ndcg_vals': 0.6767150864265614,
  'mean_map_vals': 0.9195789241622574,
  'mean_mrr_vals': 1.0},
 'bm25_0.5_faiss_0.5': {'mean_ndcg_vals': 0.5679051162235468,
  'mean_map_vals': 0.8483024691358023,
  'mean_mrr_vals': 1.0},
 'bm25_0.6_faiss_0.4': {'mean_ndcg_vals': 0.5999822277071096,
  'mean_map_vals': 0.8307079868984631,
  'mean_mrr_vals': 0.8},
 'bm25_0.7_faiss_0.3': {'mean_ndcg_vals': 0.6739591149877571,
  'mean_map_vals': 0.8736309523809525,
  'mean_mrr_vals': 0.9},
 'bm25_0.8_faiss_0.2': {'mean_ndcg_vals': 0.7083921765072875,
  'mean_map_vals': 0.9765167548500882,
 

**Run Experiment on Eval Dataset without function to remove stopwords from query**

In [25]:
queries = pd.read_csv(os.path.join(os.getcwd(),'retriever_eval','queries.csv'))

In [26]:
embeddings_model_name = "sentence-transformers/all-mpnet-base-v2"
embeddings_model = HuggingFaceEmbeddings(model_name=embeddings_model_name)
num_docs_retrieved = 5
ndcg_k_val = 5
search_type = 'mmr'
# fetch_k = 20 # default for mmr

# Initialise dictionaries: { 'bm25_bm25weight_faiss_faiss_weight' = mean average precision}, { 'bm25_bm25weight_faiss_faiss_weight' = mean reciprocal rank}
retriever_eval_res_with_stopwords = {}
# For each combination of weights
for i in range(len(bm25_weights)):
  ndcg_vals = []
  binary_relevance_scores = []

  # For each query
  for idx,row in queries.iterrows():
    query = row['query']
    query_num = row['query_number']

    # Get the evaluation docs (positive and negative samples) and scores of the chunks
    docs_df = pd.read_csv(os.path.join(os.getcwd(),'retriever_eval',f'query_{query_num}_docs.csv'))
    eval_df = pd.read_csv(os.path.join(os.getcwd(),'retriever_eval',f'query_{query_num}_eval.csv'))
    # CHANGE THIS CODE ONCE WE ARE READING IN THE ACTUAL THING
    eval_df['score'] = [random.choice([0, 1, 2, 3, 4]) for _ in range(len(eval_df))]

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=250, chunk_overlap=50, add_start_index=True
    )

    # Chunk the evaluation docs
    docs = list(docs_df['docs'])
    docs_document_objects = [Document(page_content=doc) for doc in docs]
    chunked_docs = text_splitter.split_documents(docs_document_objects)

    # Build the ensemble retriever
    bm25_retriever = BM25Retriever.from_documents(chunked_docs)
    bm25_retriever.k = num_docs_retrieved

    vector_store_index = faiss.IndexFlatL2(len(embeddings_model.embed_query("hello world")))
    faiss_vector_store = FAISS(
      embedding_function=embeddings_model,
      index=vector_store_index,
      docstore=InMemoryDocstore(),
      index_to_docstore_id={})
    faiss_vector_store.add_documents(chunked_docs)
    faiss_retriever = faiss_vector_store.as_retriever(search_type=search_type, search_kwargs={"k": num_docs_retrieved})

    ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, faiss_retriever], weights=[bm25_weights[i], faiss_weights[i]])

    # Run the ensemble retriever
    retrieved_docs = [d.page_content for d in ensemble_retriever.invoke(query)]

    # For each chunk, get its raw score
    raw_scores = []
    for chunk in retrieved_docs:
      raw_scores.append(eval_df.loc[eval_df['chunks'] == chunk, 'score'].values[0])
    # Accumulate the relevance (binary) scores over all the queries
    binary_scores = [1 if score > 0 else 0 for score in raw_scores]
    binary_relevance_scores.append(binary_scores)

    # Calculate NDCG@K using the raw scores
    ndcg_vals.append(retriever_evaluation.ndcg_at_k(raw_scores,ndcg_k_val))

  # Calculate mean NDCG@K
  mean_ndcg_vals = np.mean(ndcg_vals)
  # Calculate MAP
  map_vals = retriever_evaluation.mean_average_precision(binary_relevance_scores)
  # Calculate MRR
  mrr_vals = retriever_evaluation.mean_reciprocal_rank(binary_relevance_scores)

  res = {'mean_ndcg_vals':mean_ndcg_vals, 'mean_map_vals':map_vals, 'mean_mrr_vals':mrr_vals}

  retriever_eval_res_with_stopwords[f'bm25_{bm25_weights[i]}_faiss_{faiss_weights[i]}'] = res

In [27]:
print("The results for running the ensemble retriever on the eval dataset with stopwords in the query are as follows: ")
retriever_eval_res_with_stopwords

The results for running the ensemble retriever on the eval dataset with stopwords in the query are as follows: 


{'bm25_0.1_faiss_0.9': {'mean_ndcg_vals': 0.5004243999033039,
  'mean_map_vals': 0.870792863441673,
  'mean_mrr_vals': 0.9},
 'bm25_0.2_faiss_0.8': {'mean_ndcg_vals': 0.7421388008477431,
  'mean_map_vals': 0.8403784013605442,
  'mean_mrr_vals': 1.0},
 'bm25_0.3_faiss_0.7': {'mean_ndcg_vals': 0.607304285227612,
  'mean_map_vals': 0.892094671201814,
  'mean_mrr_vals': 1.0},
 'bm25_0.4_faiss_0.6': {'mean_ndcg_vals': 0.6434993523423779,
  'mean_map_vals': 0.8794356261022926,
  'mean_mrr_vals': 0.8666666666666666},
 'bm25_0.5_faiss_0.5': {'mean_ndcg_vals': 0.7443653956406208,
  'mean_map_vals': 0.8891964285714286,
  'mean_mrr_vals': 1.0},
 'bm25_0.6_faiss_0.4': {'mean_ndcg_vals': 0.5069670945362537,
  'mean_map_vals': 0.8628826530612244,
  'mean_mrr_vals': 0.8},
 'bm25_0.7_faiss_0.3': {'mean_ndcg_vals': 0.5353284011492496,
  'mean_map_vals': 0.7903330498866212,
  'mean_mrr_vals': 0.7},
 'bm25_0.8_faiss_0.2': {'mean_ndcg_vals': 0.6600491721327719,
  'mean_map_vals': 0.8172236394557822,
  'me

<br/>
<br/>
<br/>
<br/>
<br/>