# **Retriever Evaluation**

## **Pre-requisites**

1. You have ran ```Retrieval_Experiment_2``` to get an experiment output
2. You have scored all the files inside the experiment output
3. You have zipped the experiment output to ```experiment_2_output.zip```

In [1]:
%pip install --quiet --upgrade bitsandbytes langchain langchain-community langchain-huggingface transformers beautifulsoup4 faiss-gpu rank_bm25 lark langchain_groq ragas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m50.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m75.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m87.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.5/157.5 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import os
import pandas as pd
import random
import numpy as np
from google.colab import files
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import LLMContextRecall
from ragas.llms import LangchainLLMWrapper
from langchain_groq import ChatGroq
from google.colab import userdata

## **User Action Required**

1. Run the code below to create the ```experiment_outputs``` folder

2. Upload ```retriever_evaluation.py```

3. Upload the ```experiement_2_output.zip``` file that contains the files you have scored


In [3]:
experiment_folder = os.path.join(os.getcwd(), 'experiment_outputs', 'experiment_2_output')
os.makedirs(experiment_folder, exist_ok=True)

In [4]:
# Upload retriever_evaluation.py
files.upload();

Saving retriever_evaluation.py to retriever_evaluation.py


In [5]:
# Upload experiment_2_output.zip
files.upload();

Saving experiment_2_output.zip to experiment_2_output.zip


In [6]:
!unzip experiment_2_output.zip -d experiment_outputs/experiment_2_output

Archive:  experiment_2_output.zip
   creating: experiment_outputs/experiment_2_output/reranked/
  inflating: experiment_outputs/experiment_2_output/reranked/reranked_retriever_binary_relevance.csv  
  inflating: experiment_outputs/experiment_2_output/reranked/reranked_retriever_score_relevance.csv  
   creating: experiment_outputs/experiment_2_output/normal/
  inflating: experiment_outputs/experiment_2_output/normal/normal_retriever_binary_relevance.csv  
  inflating: experiment_outputs/experiment_2_output/normal/normal_retriever_score_relevance.csv  


In [7]:
import retriever_evaluation

## **Evaluate Experiment 2: Re-ranking**

### **Binary Relevance: Mean Average Precision, Mean Reciprocal Rank**

Mean Average Precision:

Mean Reciprocal Rank:

In [20]:
experiment_2_normal_binary_relevance = os.path.join(experiment_folder, 'normal', 'normal_retriever_binary_relevance.csv')
experiment_2_reranked_binary_relevance = os.path.join(experiment_folder, 'reranked', 'reranked_retriever_binary_relevance.csv')

# Initialise dictionaries:
# experiment_2_mean_ave_precision_res = { 'normal retriever' : mean average precision, 'reranked retriever' : mean average precision},
# experiment_2_mean_reciprocal_rank_res { 'normal retriever' : mean reciprocal rank, 'reranked retriever' : mean reciprocal rank}
experiment_2_mean_ave_precision_res = {}
experiment_2_mean_reciprocal_rank_res = {}
# For the normal retriever (no modifications), and then the reranked retriever,
# calculate the mean average precision and mean reciprocal rank over all the queries
for fp in [experiment_2_normal_binary_relevance, experiment_2_reranked_binary_relevance]:
    f_name = ' '.join(fp.split('/')[-1].split('_')[:2])
    full_df = pd.read_csv(fp)
    df_split = [group for query, group in full_df.groupby('query')]
    relevance_scores = []
    for df in df_split:
        # TOCHANGE: Assign random 1s or 0s to the relevant col for testing purposes
        df['relevant'] = [random.choice([0, 1]) for _ in range(len(df))]
        relevance_scores.append(list(df['relevant']))
    experiment_2_mean_ave_precision_res[f_name] = retriever_evaluation.mean_average_precision(relevance_scores)
    experiment_2_mean_reciprocal_rank_res[f_name] = retriever_evaluation.mean_reciprocal_rank(relevance_scores)

In [21]:
max_value_map = max(experiment_2_mean_ave_precision_res.values())
best_maps = {key: value for key, value in experiment_2_mean_ave_precision_res.items() if value == max_value_map}
print('Between no-reranking vs re-ranking, the method with higher mean average precision is:')
for k,v in best_maps.items():
    print(f'{k} with a value of {v}')

Between no-reranking vs re-ranking, the method with higher mean average precision is:
reranked retriever with a value of 0.5086904761904762


In [22]:
max_value_mrr = max(experiment_2_mean_reciprocal_rank_res.values())
best_mrrs = {key: value for key, value in experiment_2_mean_reciprocal_rank_res.items() if value == max_value_mrr}
print('Between no-reranking vs re-ranking, the method with higher mean reciprocal rank is:')
for k,v in best_mrrs.items():
    print(f'{k} with a value of {v}')

Between no-reranking vs re-ranking, the method with higher mean reciprocal rank is:
normal retriever with a value of 0.625


### **Score Relevance: Mean Normalised Discounted Cumulative Gain**

For score relevance, put at k=5 first

Mean Normalised Discounted Cumulative Gain:

In [23]:
k = 5
experiment_2_normal_score_relevance = os.path.join(experiment_folder, 'normal', 'normal_retriever_score_relevance.csv')
experiment_2_reranked_score_relevance = os.path.join(experiment_folder, 'reranked', 'reranked_retriever_score_relevance.csv')

# Initialise dictionaries:
# experiment_2_mean_normalised_discounted_cumulative_gain_res = { 'normal retriever' : mean normalised discounted cumulative gain, 'reranked retriever' : mean normalised discounted cumulative gain},
experiment_2_mean_normalised_discounted_cumulative_gain_res = {}
# For the normal retriever (no modifications), and then the reranked retriever,
# calculate the mean normalised discounted cumulative gain over all the queries
for fp in [experiment_2_normal_score_relevance, experiment_2_reranked_score_relevance]:
    f_name = ' '.join(fp.split('/')[-1].split('_')[:2])
    full_df = pd.read_csv(fp)
    df_split = [group for query, group in full_df.groupby('query')]
    relevance_scores = []
    for df in df_split:
        # TOCHANGE: Assign random score between 0 and 5 to the relevant col for testing purposes
        df['relevant'] = [random.choice([0, 1, 2, 3, 4, 5]) for _ in range(len(df))]
        relevance_scores.append(retriever_evaluation.ndcg_at_k(list(df['relevant']),k))
    experiment_2_mean_normalised_discounted_cumulative_gain_res[f_name] = np.mean(relevance_scores)

In [24]:
max_value_map = max(experiment_2_mean_normalised_discounted_cumulative_gain_res.values())
best_maps = {key: value for key, value in experiment_2_mean_normalised_discounted_cumulative_gain_res.items() if value == max_value_map}
print('Between no-reranking vs re-ranking, the method with higher mean normalised discounted cumulative gain is:')
for k,v in best_maps.items():
    print(f'{k} with a value of {v}')

Between no-reranking vs re-ranking, the method with higher mean normalised discounted cumulative gain is:
reranked retriever with a value of 0.6667368204468691


### **Estimated Context Recall with RAGAS**

Calculate using
- Reference/GT answer
- Retrieved context results

To estimate context recall from the Reference/GT answer, the Reference/GT answer is broken into claims

Each claim in the Reference/GT answer is analysed by an LLM to determine if it can be attributed to the retrieved context or not

```
context_recall = number of reference claims that can be attributed to the retrieved context / number of reference claims
```



**Extract the questions**

In [29]:
# From Retrieval Experiment 1 and 2, save questions to a list
question_1 = "what is the best food to eat in Finland?"
question_2 = "what is the best food to eat in Iceland?"

**Fill in ground truth answers for each question**

In [30]:
question_1_gpt_answer = """
Finland's culinary traditions offer a rich array of flavors, reflecting its natural resources and cultural heritage. Here are some quintessential Finnish dishes to experience:

Karjalanpiirakka (Karelian Pie)
Originating from the Karelia region, these rye crust pastries are traditionally filled with rice porridge and often topped with egg butter. They are a beloved Finnish snack, commonly enjoyed across the country.

Ruisleipä (Rye Bread)
A staple in Finnish cuisine, this dense and dark bread is made from sourdough rye. It's typically enjoyed with butter, cheese, or cold cuts, and forms an essential part of daily meals.

Kalakukko
Hailing from the Savonia region, this traditional dish consists of fish (commonly perch or salmon) and pork baked inside a thick rye bread crust, creating a hearty and portable meal.

Poronkäristys (Sautéed Reindeer)
A specialty from Lapland, this dish features thinly sliced reindeer meat sautéed with onions and butter, typically served with mashed potatoes and lingonberry jam.

Leipäjuusto (Bread Cheese)
Also known as 'squeaky cheese' due to its texture, this mild cheese is often warmed and served with cloudberry jam, offering a unique combination of flavors.

Lohikeitto (Salmon Soup)
A creamy soup made with fresh salmon, potatoes, leeks, and dill, providing a comforting and flavorful experience, especially during colder months.

Mustikkapiirakka (Blueberry Pie)
This traditional dessert features wild Finnish blueberries baked into a pie, often enjoyed with vanilla sauce or ice cream.

Exploring these dishes will provide a genuine taste of Finland's rich culinary heritage.
"""

question_2_gpt_answer = """
Iceland's culinary scene offers a rich tapestry of traditional dishes that reflect its unique heritage and natural resources. Here are some quintessential Icelandic foods to experience:

Pylsur (Icelandic Hot Dog)
A blend of lamb, pork, and beef, served in a soft bun with toppings like ketchup, sweet mustard, remoulade, and both raw and crispy fried onions. A popular spot to try this is Bæjarins Beztu Pylsur in Reykjavík, renowned for its delicious hot dogs.

Plokkfiskur (Fish Stew)
A hearty mix of white fish (such as cod or haddock), potatoes, onions, and béchamel sauce. This comforting dish showcases Iceland's rich fishing traditions.

Hangikjöt (Smoked Lamb)
Traditionally smoked over birch or dried sheep dung, this lamb is typically served thinly sliced with flatbread or potatoes, especially during festive seasons.

Kjötsúpa (Lamb Soup)
A nourishing soup made with lamb, root vegetables, and herbs, offering warmth during Iceland's colder months.

Skyr
A thick, creamy dairy product similar to yogurt but technically a cheese. It's enjoyed plain or with added flavors like berries and is a staple in Icelandic diets.

Harðfiskur (Dried Fish)
Wind-dried fish, often cod or haddock, served with salted butter. This protein-rich snack has been a traditional staple for centuries.

Kleinur
A twisted doughnut-like pastry, deep-fried and mildly sweet, commonly enjoyed with coffee.

For a contemporary twist on traditional Icelandic cuisine, consider dining at Dill in Reykjavík. As the first Icelandic restaurant awarded a Michelin star, Dill offers innovative dishes that highlight local ingredients.

Exploring these dishes will provide a genuine taste of Iceland's culinary heritage.
"""

qna = {
    question_1: question_1_gpt_answer,
    question_2: question_2_gpt_answer
}



**Use RAGAS library to calculate estimated context recall, with the top k only**
- Also subject to what GPT retrieves for now

In [31]:
os.environ["GROQ_API_KEY"] = userdata.get('GROQ_API_KEY')
llm = ChatGroq()
context_recall = LLMContextRecall(llm=LangchainLLMWrapper(llm))

In [32]:
k = 5
experiment_2_normal_binary_relevance = os.path.join(experiment_folder, 'normal', 'normal_retriever_binary_relevance.csv')
experiment_2_reranked_binary_relevance = os.path.join(experiment_folder, 'reranked', 'reranked_retriever_binary_relevance.csv')

# Initialise dictionaries:
# experiment_1_estimated_context_recall = { 'normal retriever' : average estimated context recall, 'reranked retriever' : average estimated context recall},
experiment_2_estimated_context_recall = {}
# For the normal retriever (no modifications), and then the reranked retriever,
# calculate the mean average precision and mean reciprocal rank over all the queries
for fp in [experiment_2_normal_binary_relevance, experiment_2_reranked_binary_relevance]:
    f_name = ' '.join(fp.split('/')[-1].split('_')[:2])
    full_df = pd.read_csv(fp)
    df_split = [group for query, group in full_df.groupby('query')]
    context_recall_scores = []
    # For each query, calculate the estimated context recall using the retrieved contexts
    for df in df_split:
      question = df['query'].iloc[0]
      reference = qna[question]
      contexts = list(df['retrieved_doc'])[:k]
      sample = SingleTurnSample(
          user_input=question,
          response="blank",
          reference=reference,
          retrieved_contexts=contexts,
      )
      context_recall_scores.append(await context_recall.single_turn_ascore(sample))
      print(context_recall_scores[-1])
    experiment_2_estimated_context_recall[f_name] = np.mean(context_recall_scores)

1.0
0.1111111111111111
0.875
0.1111111111111111


In [33]:
max_value_map = max(experiment_2_estimated_context_recall.values())
best_maps = {key: value for key, value in experiment_2_estimated_context_recall.items() if value == max_value_map}
print('Between no-reranking vs re-ranking, the method with higher average estimated context recall is:')
for k,v in best_maps.items():
    print(f'{k} with a value of {v}')

Between no-reranking vs re-ranking, the method with higher average estimated context recall is:
normal retriever with a value of 0.5555555555555556
