# **Retriever Evaluation**

## **Pre-requisites**

1. For each experiment that you want to evaluate, you have ran the experiment code to get an experiment output

2. For each experiment that you want to evaluate, you have added the experiment output folder to ```experiment_outputs```

In [1]:
import os
import pandas as pd
import random
import evaluation
import numpy as np

## **Evaluate Experiment 1: Hybrid/Ensemble Retriever**

### **Binary Relevance**

```
experiment_1_binary_relevance_folder = os.path.join(os.getcwd(),'experiment_outputs','experiment_1_output','binary_relevance')
f_name = 'hybrid_retriever_bm25_0.0_faiss_1.0_binary_relevance.csv'
f_name_split = f_name.split('_')
bm25_val = f_name_split[3]
faiss_val = f_name_split[5]
fp = os.path.join(experiment_1_binary_relevance_folder, f_name)
full_df = pd.read_csv(fp)
df_split = [group for query, group in full_df.groupby('query')]
relevance_scores = []
for df in df_split:
    # Assign random 1s or 0s to the relevant col for testing purposes
    df['relevant'] = [random.choice([0, 1]) for _ in range(len(df))]
    relevance_scores.append(list(df['relevant']))
evaluation.mean_average_precision(relevance_scores)
```

In [2]:
experiment_1_binary_relevance_folder = os.path.join(os.getcwd(),'experiment_outputs','experiment_1_output','binary_relevance')

# Initialise dictionaries: { 'bm25_bm25weight_faiss_faiss_weight' = mean average precision}, { 'bm25_bm25weight_faiss_faiss_weight' = mean reciprocal rank}
experiment_1_mean_ave_precision_res = {}
experiment_1_mean_reciprocal_rank_res = {}
# For each bm25 weight and faiss weight combination,
# Calculate the mean average precision and mean reciprocal rank over all the queries
for f_name in os.listdir(experiment_1_binary_relevance_folder):
    f_name_split = f_name.split('_')
    bm25_val = f_name_split[3]
    faiss_val = f_name_split[5]
    fp = os.path.join(experiment_1_binary_relevance_folder, f_name)
    full_df = pd.read_csv(fp)
    df_split = [group for query, group in full_df.groupby('query')]
    relevance_scores = []
    for df in df_split:
        # TOCHANGE: Assign random 1s or 0s to the relevant col for testing purposes
        df['relevant'] = [random.choice([0, 1]) for _ in range(len(df))]
        relevance_scores.append(list(df['relevant']))
    experiment_1_mean_ave_precision_res[f'bm25_{bm25_val}_faiss_{faiss_val}'] = evaluation.mean_average_precision(relevance_scores)
    experiment_1_mean_reciprocal_rank_res[f'bm25_{bm25_val}_faiss_{faiss_val}'] = evaluation.mean_reciprocal_rank(relevance_scores)

In [3]:
max_value_map = max(experiment_1_mean_ave_precision_res.values())
best_maps = {key: value for key, value in experiment_1_mean_ave_precision_res.items() if value == max_value_map}
print('The best weightage for the mean average precision using binary relevance is:')
for k,v in best_maps.items():
    print(f'Weightage with bm25: {k.split('_')[1]}, faiss: {k.split('_')[-1]} with a value of {v}')

The best weightage for the mean average precision using binary relevance is:
Weightage with bm25: 0.4, faiss: 0.6 with a value of 0.8092687074829932


In [4]:
max_value_mrr = max(experiment_1_mean_reciprocal_rank_res.values())
best_mrrs = {key: value for key, value in experiment_1_mean_reciprocal_rank_res.items() if value == max_value_mrr}
print('The best weightage for the mean reciprocal rank using binary relevance is:')
for k,v in best_mrrs.items():
    print(f'Weightage with bm25: {k.split('_')[1]}, faiss: {k.split('_')[-1]} with a value of {v}')

The best weightage for the mean reciprocal rank using binary relevance is:
Weightage with bm25: 0.4, faiss: 0.6 with a value of 1.0
Weightage with bm25: 0.8, faiss: 0.2 with a value of 1.0
Weightage with bm25: 0.9, faiss: 0.1 with a value of 1.0


### **Score Relevance**

Put at k at 5 first

In [6]:
k = 5
experiment_1_score_relevance_folder = os.path.join(os.getcwd(),'experiment_outputs','experiment_1_output','score_relevance')
# Initialise dictionaries: { 'bm25_bm25weight_faiss_faiss_weight' = mean normalised discounted cumulative gain}
experiment_1_mean_normalised_discounted_cumulative_gain_res = {}
# For each bm25 weight and faiss weight combination,
# Calculate the normalised discounted cumulative gain over all the queries
for f_name in os.listdir(experiment_1_score_relevance_folder):
    f_name_split = f_name.split('_')
    bm25_val = f_name_split[3]
    faiss_val = f_name_split[5]
    fp = os.path.join(experiment_1_score_relevance_folder, f_name)
    full_df = pd.read_csv(fp)
    df_split = [group for query, group in full_df.groupby('query')]
    relevance_scores = []
    for df in df_split:
        # TOCHANGE: Assign random score between 0 and 5 to the relevant col for testing purposes
        df['relevant'] = [random.choice([0, 1, 2, 3, 4, 5]) for _ in range(len(df))]
        relevance_scores.append(evaluation.ndcg_at_k(list(df['relevant']),k))
    experiment_1_mean_normalised_discounted_cumulative_gain_res[f'bm25_{bm25_val}_faiss_{faiss_val}'] = np.mean(relevance_scores)

In [7]:
max_value_map = max(experiment_1_mean_normalised_discounted_cumulative_gain_res.values())
best_maps = {key: value for key, value in experiment_1_mean_normalised_discounted_cumulative_gain_res.items() if value == max_value_map}
print('The best weightage for the mean normalised discounted cumulative gain using score relevance is:')
for k,v in best_maps.items():
    print(f'Weightage with bm25: {k.split('_')[1]}, faiss: {k.split('_')[-1]} with a value of {v}')

The best weightage for the mean normalised discounted cumulative gain using score relevance is:
Weightage with bm25: 0.7, faiss: 0.3 with a value of 0.7836464358137009
