# Answer Matching Analysis

In [12]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import pickle
import torch
import matplotlib.pyplot as plt
import seaborn as sns

import os
import sys
sys.path.append("..")

from modules.utils.text_processing import process_text
from modules.extraction.preprocessing import DocumentProcessing
from modules.extraction.embedding import Embedding
from modules.retrieval.index.bruteforce import FaissBruteForce
from modules.retrieval.search import FaissSearch
from modules.retrieval.reranker import Reranker
from modules.generator.question_answering import QA_Generator
from sentence_transformers import util


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# Global Variables
STORAGE_DIR = '../storage/'
QUESTION_PATH = '../qa_resources/question.tsv'
FAISS_INDEX_DIR = '../storage/faiss_index/'

# Processing settings
CHUNK_METHODS = [
    {'chunk_strategy': 'sentence', 'num_sentences': 10, 'overlap_size': 2},
    {'chunk_strategy': 'fixed-length', 'fixed-length': 150, 'overlap_size': 1}, #TODO add fixed-length
]
PREPROC_METHODS = {
    'stem': [True, False],
    'lemma': [False, True],
}
EMBEDDING_MODELS = {
    "all-MiniLM-L6-v2": Embedding(model_name="all-MiniLM-L6-v2"),
    "multi-qa-mpnet-base-cos-v1": Embedding(model_name="multi-qa-mpnet-base-cos-v1"),
}
DISTANCE_METRIC = 'cosine' # Distance metric for similarity search in FAISS search
INDEX_TOP_K_LIST = [10, 50] # Top K results to retrieve from the index
RERANKER_TYPES = ['NA', 'cross_encoder', 'tfidf', 'bow'] # Reranker types to use, 'NA' means no reranking
TRANSFORMER_MATCH_THRESHOLD = 0.6 # Threshold for transformer-based reranker to consider a match

# API Keys
MISTRAL_API_KEY = "J39mhyh5SiYD5HAeZFZYIPF9zbozad5H"

In [3]:
# Helper functions
def read_text_file(file_path: str) -> str:
    """
    Reads the content of a text file.

    Args:
        file_path (str): The path to the text file.

    Returns:
        str: The content of the text file or an error message if an issue occurs.
    """
    try:
        with open(file_path, "r", encoding="utf-8") as file:
            content = file.read()
        return content
    except FileNotFoundError:
        return f"The file at {file_path} was not found."
    except Exception as e:
        return f"An error occurred: {e}"

# Chunking
def chunking(doc_path, chunk_method):
    dp = DocumentProcessing()
    if chunk_method['chunk_strategy'] == 'sentence':
        chunks = dp.sentence_chunking(doc_path, chunk_method['num_sentences'], chunk_method['overlap_size'])
    elif chunk_method['chunk_strategy'] == 'fixed-length':
        chunks = dp.fixed_length_chunking(doc_path, chunk_method['fixed-length'], chunk_method['overlap_size'])
    else:
        raise ValueError("Invalid chunking strategy.")

    return chunks

def get_chunk_config(chunk_method):
    num_sentences = chunk_method['num_sentences'] if 'num_sentences' in chunk_method else 0
    fixed_length = chunk_method['fixed-length'] if 'fixed-length' in chunk_method else 0
    chunk_config = f"{chunk_method['chunk_strategy']}_{num_sentences}_{fixed_length}_{chunk_method['overlap_size']}"

    return chunk_config
    

# Loading FAISS index
def load_faiss_index(filepath):
    with open(filepath, 'rb') as f:
        instance = pickle.load(f)
    return instance

### Task 4

In a notebook called notebooks/answer_matching_analysis.ipynb, analyze and design your overall system's ability to produce relevant answers. You should argue your points using graphs and other visualizations.

* What configurations (e.g., reranker, chunk size, top-k) improved performance?
* How is the performance related to the difficulty of the question?
* Compare answer quality with and without reranking. Is reranking necessary?
* How important is retrieval performance on the generation performance
* Do EM and TransformerMatch agree on what’s “correct”? How well do the two metrics measure how the answers align with the ground truth?

#### Data Preparation
Here corpus will be generated from all the 150 files under storage directory.

Due to the limitation on QA generator usage, a subset of questions will be sampled by difficulty level for the downstream processing and answer generation.

In [4]:
# Load the data
doc_paths = sorted([os.path.join(STORAGE_DIR, f) for f in os.listdir(STORAGE_DIR) if f.endswith('.txt.clean')])
corpus = [read_text_file(doc_path) for doc_path in doc_paths]
print(f"{len(doc_paths)} total document paths loaded for corpus.")

# Load questions and remove empty questions, answers, or articles
questions_df = pd.read_csv(QUESTION_PATH, sep="\t")
questions_df = questions_df.dropna(subset=['Question','Answer','ArticleFile'])

# Remove rows with duplicated questions, answers, or articles
questions_df = questions_df.map(lambda x: x.lower() if isinstance(x, str) else x) # Normalize to lowercase
questions_df['Answer'] = questions_df['Answer'].str.rstrip('.') # Remove trailing periods from 'Answer' column
questions_df = questions_df.drop_duplicates(subset=['Question'], keep='first') # keep first occurrence

# Sample a subset of questions by difficulty level
questions_sample = questions_df.groupby('DifficultyFromAnswerer', group_keys=False).sample(frac=0.03, random_state=42).reset_index(drop=True)
print(f"{len(questions_sample)} questions sampled.")

# Display the top quesetions
questions_sample.head(10)

150 total document paths loaded for corpus.
16 questions sampled.


Unnamed: 0,ArticleTitle,Question,Answer,DifficultyFromQuestioner,DifficultyFromAnswerer,ArticleFile
0,finland,are there cathedrals scattered all across finl...,yes,easy,easy,s08_set2_a4
1,leopard,is the leopard -lrb- panthera pardus -rrb- an ...,yes,,easy,s08_set1_a2
2,grover_cleveland,is he buried in the princeton cemetery of the ...,yes,,easy,s08_set3_a6
3,james_watt,was watt ranked 22nd in michael h. hart 's lis...,yes,,easy,s08_set4_a2
4,amedeo_avogadro,is avogadro 's number commonly used to compute...,yes,,easy,s08_set4_a8
5,indonesia,is it true that indonesia has vast areas of wi...,yes,,easy,s08_set2_a10
6,james_monroe,was monroe anticlerical?,no,,easy,s08_set3_a2
7,calvin_coolidge,why did coolidge not attend law school?,it was too expensive,hard,hard,s08_set3_a9
8,gray_wolf,what type of tools do biologists use to captur...,darting and foot hold traps,hard,hard,s08_set1_a6
9,gerald_ford,have more than five presidents lived past the ...,no,hard,hard,s08_set3_a10


#### Extraction and Indexing
Extraction: the corpus will be chunked, preprocessed and generated into embeddings. 

Indexing: FAISS indices will be built for the text embeddings in the corpus, based on the given combination of chunking, preprocessing and embedding method.

In [5]:
# Chunking and Preprocessing
context_chunks = []

for idx, doc_path in enumerate(doc_paths):
    if idx % 20 == 0:
        print(f"Processing document {idx + 1}/{len(doc_paths)}: {os.path.basename(doc_path)}")
    # Chunking
    for chunk_method in CHUNK_METHODS:
        chunks = chunking(doc_path, chunk_method)
        for chunk in chunks:
            # Preprocessing
            for preproc_name, preproc_params in PREPROC_METHODS.items():
                use_stemming, use_lemmatization = preproc_params[0], preproc_params[1]
                preproc_chunk = process_text(chunk, use_stemming, use_lemmatization)

                result = {
                    'chunk_config': get_chunk_config(chunk_method),
                    'preproc_name': preproc_name,
                    'preproc_chunk': preproc_chunk,
                    'doc_name': os.path.basename(doc_path)[:-10].lower(),  # Remove '.txt.clean' from the filename
                }
                context_chunks.append(result)

context_chunks = pd.DataFrame(context_chunks)
context_chunks.head()

Processing document 1/150: S08_set1_a1.txt.clean
Processing document 21/150: S08_set3_a1.txt.clean
Processing document 41/150: S09_set1_a1.txt.clean
Processing document 61/150: S09_set3_a1.txt.clean
Processing document 81/150: S09_set5_a1.txt.clean
Processing document 101/150: S10_set2_a1.txt.clean
Processing document 121/150: S10_set4_a1.txt.clean
Processing document 141/150: S10_set6_a1.txt.clean


Unnamed: 0,chunk_config,preproc_name,preproc_chunk,doc_name
0,sentence_10_0_2,stem,kangaroo a kangaroo is a marsupi from the fami...,s08_set1_a1
1,sentence_10_0_2,lemma,kangaroo a kangaroo is a marsupial from the fa...,s08_set1_a1
2,sentence_10_0_2,stem,the kangaroo is an australian icon it is featu...,s08_set1_a1
3,sentence_10_0_2,lemma,the kangaroo is an australian icon it is featu...,s08_set1_a1
4,sentence_10_0_2,stem,the local respond kangaroo mean i don t unders...,s08_set1_a1


In [6]:
# Embedding and Indexing
for chunk_method in CHUNK_METHODS:
    chunk_config = get_chunk_config(chunk_method)

    for preproc_name in PREPROC_METHODS: 
        filtered_chunks = context_chunks[
            (context_chunks['chunk_config'] == chunk_config) &
            (context_chunks['preproc_name'] == preproc_name)
        ]
        
        # Error checking
        if filtered_chunks.empty:
            print(f"No chunks found for chunk_config: {chunk_config}, preprocessor: {preproc_name}")
            continue
        
        for model_name, embedding_model in EMBEDDING_MODELS.items():
            chunk_texts = filtered_chunks['preproc_chunk'].tolist()
            doc_names = filtered_chunks['doc_name'].tolist()
            embedding_chunks = [embedding_model.encode(chunk) for chunk in chunk_texts]
            
            # Create FAISS index 
            index_path = FAISS_INDEX_DIR + f"{chunk_config}/{preproc_name}/{model_name}/"
            os.makedirs(index_path, exist_ok=True)

            # Indexing the eembeddings and save them
            faiss_index = FaissBruteForce(dim=len(embedding_chunks[0]), metric=DISTANCE_METRIC)
            metadata = [{'doc_name': doc_name, 'chunk_text': chunk_text} for doc_name, chunk_text in zip(doc_names, chunk_texts)]
            faiss_index.add_embeddings(np.array(embedding_chunks), metadata=metadata)
            faiss_index.save(index_path + "faiss_index.pkl")
            print(f"FAISS index saved in {index_path}faiss_index.pkl")


FAISS index saved in ../storage/faiss_index/sentence_10_0_2/stem/all-MiniLM-L6-v2/faiss_index.pkl
FAISS index saved in ../storage/faiss_index/sentence_10_0_2/stem/multi-qa-mpnet-base-cos-v1/faiss_index.pkl
FAISS index saved in ../storage/faiss_index/sentence_10_0_2/lemma/all-MiniLM-L6-v2/faiss_index.pkl
FAISS index saved in ../storage/faiss_index/sentence_10_0_2/lemma/multi-qa-mpnet-base-cos-v1/faiss_index.pkl
FAISS index saved in ../storage/faiss_index/fixed-length_0_150_1/stem/all-MiniLM-L6-v2/faiss_index.pkl
FAISS index saved in ../storage/faiss_index/fixed-length_0_150_1/stem/multi-qa-mpnet-base-cos-v1/faiss_index.pkl
FAISS index saved in ../storage/faiss_index/fixed-length_0_150_1/lemma/all-MiniLM-L6-v2/faiss_index.pkl
FAISS index saved in ../storage/faiss_index/fixed-length_0_150_1/lemma/multi-qa-mpnet-base-cos-v1/faiss_index.pkl


#### Retrieval
- Search: Perform FAISS search to retrieve text chunks relavant to the question
- ReRank: Reorder the retrieved chunks to improve the context provided to the language model

In [7]:
# Perform FAISS search for retrieval
retrieval = []

# Iterate through each question and perform the search
for idx, row in questions_sample.iterrows():
    question = row['Question']

    # Load the FAISS index for the current chunk config, preprocessor, and model
    for chunk_method in CHUNK_METHODS:
        chunk_config = get_chunk_config(chunk_method)

        for preproc_name in PREPROC_METHODS:
            for model_name, embedding_model in EMBEDDING_MODELS.items():
                # Get the FAISS index for the current configuration
                index_path = FAISS_INDEX_DIR + f"{chunk_config}/{preproc_name}/{model_name}/"
                faiss_index = load_faiss_index(index_path + "faiss_index.pkl")
                faiss_search = FaissSearch(faiss_index, metric=DISTANCE_METRIC)

                # Encode the question
                query_vector = embedding_model.encode(question) 
                
                # Perform the search for each top k value
                for k in INDEX_TOP_K_LIST:
                    try:
                        distances, indices, metadata = faiss_search.search(query_vector, k=k)
                    except Exception as e:
                        print(f"Error message: {e}")
                        print(f"query shape: {query_vector.shape}")
                        print(f"index shape: {faiss_index.dim}")
                        print(f"model_name: {model_name}")
                        print(f"index_path: {index_path}")
                    
                    # Extract the retrieved chunks and their corresponding files
                    retrieved_chunks = [m['chunk_text'] for m in metadata]
                    retrieved_files = list(set([m['doc_name'] for m in metadata]))

                    if distances is None or indices is None or retrieved_chunks is None:
                        print(f"Search failed for question: {question}")
                        continue

                    # Reranking if applicable
                    for reranker_type in RERANKER_TYPES:
                        if reranker_type == 'NA':
                            # No reranking 
                            reranked_chunks = retrieved_chunks
                        else:
                            # Generate the answer using the specified reranker
                            reranker = Reranker(reranker_type)
                            reranked_chunks, reranked_indices, reranked_scores = reranker.rerank(question, retrieved_chunks)

                        # Store the result
                        retrieval.append({
                            'question': question,
                            'difficulty_from_questioner': row['DifficultyFromQuestioner'],
                            'difficulty_from_answerer': row['DifficultyFromAnswerer'],
                            'target_answer': row['Answer'],
                            'target_file': row['ArticleFile'].lower(),
                            'chunk_config': chunk_config,
                            'preprocessor': preproc_name,
                            'embedding_model': model_name,
                            'index_top_k': k,
                            'retrieved_files': retrieved_files,
                            'reranker': reranker_type,
                            'reranked_chunks': reranked_chunks,
                        })

    if idx % 5 == 0:
        print(f"Processed question {idx + 1}/{len(questions_sample)}: {question}")

# Check the length of the retrieval list
assert len(retrieval) == len(questions_sample) * len(CHUNK_METHODS) * len(PREPROC_METHODS) * len(EMBEDDING_MODELS) * len(INDEX_TOP_K_LIST) * len(RERANKER_TYPES)

# Convert the result list to a DataFrame  
retrieval = pd.DataFrame(retrieval)
print(f"Total retrieval records: {len(retrieval)}")

Processed question 1/16: are there cathedrals scattered all across finland?
Processed question 6/16: is it true that indonesia has vast areas of wilderness?
Processed question 11/16: what did ford receive on april 13, 1942?
Processed question 16/16: what method is used by kangaroos to travel?
Total retrieval records: 1024


#### Generator
In this case, a cloud based LLM model `MISTRAL` will be applied to generate answers with retrieved/reranked context.

Performance metrics will be calculated based on the retrieved files vs expected files, generated answers vs expected answers.

In [13]:
# Initializing the answer generator
generator = QA_Generator(api_key=MISTRAL_API_KEY)

# Generate answers for the retrieved chunks
results = []

# Iterate through each retrieval record and generate answers 
for idx, row in retrieval.iterrows():
    question = row['question']
    reranked_chunks = row['reranked_chunks']
    
    # Generate the answer using the generator
    answer = generator.generate_answer(question, reranked_chunks)
    emb_expected = torch.from_numpy(embedding_model.encode(row['target_answer']))
    emb_generated = torch.from_numpy(embedding_model.encode(answer))

    # Compute similarity score
    # apply a simple heuristic for yes/no answers to avoid misleading cosine similarity
    if row['target_answer'].lower().startswith("yes") and answer.lower().startswith("yes"): 
        similarity_score = 1.0
    elif row['target_answer'].lower().startswith("no") and answer.lower().startswith("no"):
        similarity_score = 1.0
    
    # If the answer is not a simple yes/no, compute the cosine similarity
    else:
        similarity_score = util.cos_sim(emb_expected, emb_generated).item()
        
    # Store the result
    results.append({
        # -- question metadata and processing configs -- #
        'question': question,
        'difficulty_from_questioner': row['difficulty_from_questioner'],
        'difficulty_from_answerer': row['difficulty_from_answerer'],
        'target_answer': row['target_answer'],
        'target_file': row['target_file'],
        'chunk_config': row['chunk_config'],
        'preprocessor': row['preprocessor'],
        'embedding_model': row['embedding_model'],
        'index_top_k': row['index_top_k'],
        'retrieved_files': row['retrieved_files'],
        'reranker': row['reranker'],
        'generated_answer': answer,
        # -- performance metrics -- #
        'retrieval_match': row['target_file'] in row['retrieved_files'], # Check if the target file is in the retrieved files
        'exact_match': answer == row['target_answer'],
        'transformers_score': similarity_score,
        'transformer_match': similarity_score > TRANSFORMER_MATCH_THRESHOLD, 
    })

    if idx % 100 == 0:
        print(f"Processed retrieved item {idx + 1}/{len(retrieval)}: {question}")

# Convert the result list to a DataFrame 
results = pd.DataFrame(results)

# Sanity check to ensure the number of unique combinations matches the expected length
unique_combinations = (
    len(results['question'].unique()) *
    len(results['chunk_config'].unique()) *
    len(results['preprocessor'].unique()) *
    len(results['embedding_model'].unique()) *
    len(results['index_top_k'].unique()) *
    len(results['reranker'].unique())
)

assert unique_combinations == len(retrieval), "The number of unique combinations does not match the expected length of results."

Processed retrieved item 1/1024: are there cathedrals scattered all across finland?
Retrying in 1.64 seconds...
Retrying in 1.07 seconds...
Retrying in 1.67 seconds...
Processed retrieved item 101/1024: is the leopard -lrb- panthera pardus -rrb- an old world mammal of the felidae family and the smallest of the four (`` ` big cats ('' ' of the genus panthera , along with the tiger , lion , and jaguar?
Retrying in 1.10 seconds...
Retrying in 1.14 seconds...
Retrying in 1.67 seconds...
Retrying in 1.34 seconds...
Retrying in 1.58 seconds...
Retrying in 1.58 seconds...
Retrying in 1.17 seconds...
Retrying in 1.01 seconds...
Retrying in 1.29 seconds...
Retrying in 1.71 seconds...
Retrying in 1.65 seconds...
Retrying in 1.99 seconds...
Retrying in 1.64 seconds...
Retrying in 1.63 seconds...
Retrying in 1.93 seconds...
Retrying in 1.32 seconds...
Retrying in 1.21 seconds...
Retrying in 1.77 seconds...
Retrying in 1.89 seconds...
Processed retrieved item 201/1024: was watt ranked 22nd in micha

In [14]:
# Display the top 5 results
results.head(5)

Unnamed: 0,question,difficulty_from_questioner,difficulty_from_answerer,target_answer,target_file,chunk_config,preprocessor,embedding_model,index_top_k,retrieved_files,reranker,generated_answer,retrieval_match,exact_match,transformers_score,transformer_match
0,are there cathedrals scattered all across finl...,easy,easy,yes,s08_set2_a4,sentence_10_0_2,stem,all-MiniLM-L6-v2,10,"[s08_set2_a4, s09_set3_a10]",,"Yes, there are cathedrals scattered all across...",True,False,1.0,True
1,are there cathedrals scattered all across finl...,easy,easy,yes,s08_set2_a4,sentence_10_0_2,stem,all-MiniLM-L6-v2,10,"[s08_set2_a4, s09_set3_a10]",cross_encoder,"Yes, there are cathedrals scattered all across...",True,False,1.0,True
2,are there cathedrals scattered all across finl...,easy,easy,yes,s08_set2_a4,sentence_10_0_2,stem,all-MiniLM-L6-v2,10,"[s08_set2_a4, s09_set3_a10]",tfidf,"Yes, there are cathedrals scattered all across...",True,False,1.0,True
3,are there cathedrals scattered all across finl...,easy,easy,yes,s08_set2_a4,sentence_10_0_2,stem,all-MiniLM-L6-v2,10,"[s08_set2_a4, s09_set3_a10]",bow,"Yes, there are cathedrals scattered all across...",True,False,1.0,True
4,are there cathedrals scattered all across finl...,easy,easy,yes,s08_set2_a4,sentence_10_0_2,stem,all-MiniLM-L6-v2,50,"[s10_set5_a5, s09_set3_a9, s09_set3_a8, s08_se...",,"Yes, there are cathedrals scattered all across...",True,False,1.0,True


### Performance Evaluation
In this project, we sampled 16 questions that were processed with 2 chunking methods, 2 tokenizing approaches, 2 embedding models, 2 indexing_top_k options and 4 rerankers.
* Chunking Methods:
    * {'chunk_strategy': 'sentence', 'num_sentences': 10, 'overlap_size': 2},
    * {'chunk_strategy': 'fixed-length', 'fixed-length': 150, 'overlap_size': 1}, 

* Preprocessors:
    * Stemming
    * Lemmatization

* Embedding models:
    * all-MiniLM-L6-v2,
    * multi-qa-mpnet-base-cos-v1

* Rerankers:
    * 'NA' (no reranking)
    * 'cross_encoder'
    * 'tfidf'
    * 'bow' 

* Indexing Top K  (Top K results to retrieve from the index)
    * 10
    * 50

* DISTANCE_METRIC = 'cosine' # Distance metric for similarity search in FAISS search
* TRANSFORMER_MATCH_THRESHOLD = 0.6 # Threshold for transformer-based reranker to consider a match


In addition, 3 metrics were generated for performance evaluation:
* **retrieval_match(RM)**: True if the expected file is retrieved
* **exact_match(EM)**: True if the provided answer is literaly the same as the expected.
* **transformer_match(TM)**: True if the provided answer is semantically matching to the expected.

#### Q1: What configurations (chunker, preprocessor, embedding_models, rerankers, indexing_top_k) improved performance?

To answer this question, `ANOVA test` is used to analyze all the variables at once and tell which ones have the most significant impact on the outcome. 

In essence, it provides the effect of each "apple" (e.g., each chunking method) while statistically controlling for all the other "apples" in the experiment. It can help us understand which components are the main drivers of performance and which ones have little to no effect.

Here the metric of `transformer_score` is applied for evaluation, since:
1. ANOVA test requires continuous outcomes
2. The "Exact Match" result is generally 0 across the board, making it less measurable to reflect data variations.


In [15]:
import statsmodels.formula.api as smf

model = smf.mixedlm("transformers_score ~ C(chunk_config) + C(preprocessor) + C(embedding_model) + C(index_top_k) + C(reranker)",
                    data=results,
                    groups=results["question"]) # Grouping by question to account for repeated measures
outcome = model.fit()
print(outcome.summary())


                           Mixed Linear Model Regression Results
Model:                     MixedLM          Dependent Variable:          transformers_score
No. Observations:          1024             Method:                      REML              
No. Groups:                16               Scale:                       0.0601            
Min. group size:           64               Log-Likelihood:              -69.0198          
Max. group size:           64               Converged:                   Yes               
Mean group size:           64.0                                                            
-------------------------------------------------------------------------------------------
                                                 Coef.  Std.Err.   z    P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------
Intercept                                         0.512    0.074  6.887 0.000  0.366  0.658
C(chunk_config)

**Significant Factors:**

* chunk_config (p < 0.001): This is the most important factor.

    * Effect: Switching to the `sentence_10_0_2` chunking method increases the transformers_score by 0.154 on average, compared to the chunker with fixed-length. This is a large and highly significant positive effect.

**Insignificant Factors:**

The model confirms that the following components had no statistically significant effect on the transformers_score:

* preprocessor (p = 0.167): Using the stem preprocessor does not significantly impact the transformers_score.
* index_top_k (p = 0.738): Increasing the top-k from its baseline (10) to 50 did not produce a meaningful change in the final score.
* embedding_model (p = 0.602): The difference between the two embedding models is negligible.
* reranker (all p > 0.3): The results show no significant difference between using the bow, cross_encoder, or tfidf rerankers compared to the baseline reranker (no reranking).

In summary, the chunk method of `{'chunk_strategy': 'sentence', 'num_sentences': 10, 'overlap_size': 2}` significantly helped improve the model performance. 

> Note that only 16 questions are sampled in this experiment, due to the increased execution time and limitation of inquires by MISTRAL. More samples can be included for more solid results.

#### Q2: Compare answer quality with and without reranking. Is reranking necessary?


**Key Insights**

According to the result above, the rerankers have limited impact on the final answer generation, comparing to the other system settings. Here are a few assumptions:

* **The retriever is already doing a good job.**

    In this case, the reranker has less room to improve. In other words, retrieved chunks might already contain the relevant context, making reranking redundant.

* **Top-k too small for retrieval**

    Rerankers often shine when top-k retrieval casts a wide net, as it can produce more precised ranking and demote the irrelevant context. 

    In this case, the indexing top-k is set as 10 and 50, which could be too small for the reranker to make a significant contribution.

* **Noise in evaluation metric**

    `transformers_score` (similarity score) might not be sensitive enough to small quality improvements. In addition, the similarity score based on embedding models may not be accurate whether the expected and generated answer match or not. There could be flaws too. 

    * e.g. For question "are there cathedrals scattered all across finland?", the expected answer is "yes" and the generated answer is "Based on the provided context, yes, there are cathedrals scattered all across Finland...". With the embedding model `all-MiniLM-L6-v2`, the similarity score is low (0.15), which is then taken as a non-match. But actually it should be taken as a true case for transformer match. 
    


#### Q3: How is the performance related to the difficulty of the question?

In [16]:
# Mixed model with random effect by question
model = smf.mixedlm("transformers_score ~ C(difficulty_from_answerer)", data=results, groups=results["question"])
outcome = model.fit()
print(outcome.summary())


                     Mixed Linear Model Regression Results
Model:                  MixedLM      Dependent Variable:      transformers_score
No. Observations:       1024         Method:                  REML              
No. Groups:             16           Scale:                   0.0660            
Min. group size:        64           Log-Likelihood:          -98.3868          
Max. group size:        64           Converged:               Yes               
Mean group size:        64.0                                                    
--------------------------------------------------------------------------------
                                      Coef.  Std.Err.   z    P>|z| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept                              0.696    0.107  6.480 0.000  0.485  0.906
C(difficulty_from_answerer)[T.hard]   -0.175    0.196 -0.895 0.371 -0.560  0.209
C(difficulty_from_answerer)[T.medium] -0.228    0.

**Key insights**

1. No statistically significant effect of `difficulty_from_answerer` on transformers_score:

    * Although the model estimates lower performance for medium and hard questions, these drops are not statistically reliable (p > 0.05).

    * The system performs similarly regardless of how difficult the answerers considered the question — at least within this dataset.

2. Baseline (easy) questions average 0.696 score — this is the reference point.


**Improvement**
* Increase power: With only 16 unique questions, it's limited in detecting subtle effects of difficulty. 

> Note that `difficulty_from_answer` is applied for analysis here as this is the group to stratified sample the questions. 

#### Q4: How important is retrieval performance on the generation performance?

In [17]:
# Check the distribution of the transformer scores and retrieval matches
results.groupby(['retrieval_match', 'transformer_match']).size()

retrieval_match  transformer_match
False            False                 37
                 True                  51
True             False                539
                 True                 397
dtype: int64

In [18]:
# Check the case when the transformer match is True but retrieval match is False
results[(results['transformer_match']==True) & (results['retrieval_match']==False)].value_counts('question')

question
is avogadro 's number commonly used to compute the results of chemical reactions ?    51
Name: count, dtype: int64

In [19]:
print("For question: 'is avogadro's number commonly used to compute the results of chemical reactions ?'")
print(f"Retrieved Files: {results[results['question']== "is avogadro 's number commonly used to compute the results of chemical reactions ?"]['retrieved_files'].values[0]}")
print(f"Expected Files: {results[results['question']== "is avogadro 's number commonly used to compute the results of chemical reactions ?"]['target_file'].values[0]}")

For question: 'is avogadro's number commonly used to compute the results of chemical reactions ?'
Retrieved Files: ['s08_set4_a1', 's10_set4_a4', 's10_set4_a8', 's09_set4_a1', 's08_set1_a2']
Expected Files: s08_set4_a8


The result above identified 51 cases where the answer was right even though the retrieval was wrong. Interestingly, all 48 of these cases came from the same exact question, suggesting this is a specific edge case rather than a general capability.

Upon further analysis of these 51 records, we compared the retrieved files to the labeled target file, `s08_set4_a8`. Our investigation revealed that a retrieved file, `s10_set4_a8`, is highly similar and also contains the correct context to answer the question. This indicates a labeling error in our ground truth, as `s10_set4_a8` should also be considered a valid target. Accordingly, these 51 records do not represent genuine retrieval failures and have been excluded from this analysis.

The corrected distribution of the transformer and retrieval matches are as below 

In [20]:
# Remove the negative cases due to missing labels.
results_filtered = results[~((results['transformer_match']==True) & (results['retrieval_match']==False))]
results_filtered.groupby(['retrieval_match', 'transformer_match']).size()

retrieval_match  transformer_match
False            False                 37
True             False                539
                 True                 397
dtype: int64

**Key Insights**

The cleaned-up results demonstrate that the retrieval stage is the primary bottleneck for the system. A failure in retrieval consistently led to a failure in generation, indicating that enhancing the quality of the retrieved context is the most critical path to improving overall performance.

Even with the right documents, other settings still matter. The analysis above confirmed that factors like the chunking strategy and the choice of tokenizer have a statistically significant impact on how well the generator utilizes the provided context.

Separately, the evaluation metric itself, transformer_match, relies on the semantic embedding model to compare the generated output against the ground truth answer.

In summary, there is clear dependency that if the retrieval is wrong, the final answer is wrong, meaning efforts should be prioritized on strengthening this foundational retrieval stage.

#### Q5: Do Exact Match and TransformerMatch agree on what’s “correct”? How well do the two metrics measure how the answers align with the ground truth?

In [None]:
# Check the distribution of exact matches and transformer matches
results.groupby(['exact_match', 'transformer_match']).size()

exact_match  transformer_match
False        False                576
             True                 448
dtype: int64

**Key Insights**

In this test, the ExactMatch is constantly False across the board. That is, all the generated answers do not literaly match the expected one. However, the TransformerMatch demonstrated 463 positive cases, where the answers aligned with the ground truth.

For example, for question "are there cathedrals scattered all across finland?", the expected answer is "yes" while the generated answer is "Yes, there are cathedrals scattered all across Finland.". Those two are semantically aligned but not literally identical. 

In this case, using ExactMatch for evaluation would underscore the system performance.