# Intro
This notebook is part of a mini-series that aims to build a simple RAG system for learning purposes.

This notebook assumes:
- we have an index of embeddings representing documents
- We have a dataset of questions which can be answered by looking at information in specific chunk_ids (to simplify, we'll assume there's only one relevant chunk_id), **with embeddings as well that match target documents** `question (str) | answer (str) | relevant_chunk_id (int) | embedding (tensor(N))`

We'd like now from a question (that we'd manually input or from the dataset of questions) to generate an answer back to
the user by providing the matching documents as context to an LLM.

Plan:
- The first section implements the end-to-end flow to manually play with the system
- The second section runs the questions dataset through the flow to evaluate the end-to-end system. We'll evaluate the system in two ways:
  1) With exact metrics.
  2) With an evaluation made by a different LLM.
- The third section will try to tune a few parameters to improve the evaluation results 

In [None]:
import os
import re
from typing import Dict, Any
from tqdm import tqdm

import numpy as np
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
from rouge_score import rouge_scorer
from sentence_transformers import SentenceTransformer
from huggingface_hub import InferenceClient
from datasets import load_from_disk

import faiss

LOCAL_DATASET_FOLDER = "local_datasets"

In [None]:
CORPUS_DATASET_NAME = "wiki-data-chunked-recursive-CS300"
QUESTIONS_DATASET_NAME = "wiki-data-chunked-recursive-CS300-questions-llm"

# We are loading the corpus to retrieve the actual chunks to add as context in
# the response generator model. In a production setting, the chunk would probably
# be retrieved from the index directly.
ds_corpus = load_from_disk(
    os.path.join(LOCAL_DATASET_FOLDER, CORPUS_DATASET_NAME)
)

# Index contains embeddings and ids for al chunks in the corpus, but not the content of those chunks themselves
index = faiss.read_index(os.path.join(LOCAL_DATASET_FOLDER, f"flat_index_{CORPUS_DATASET_NAME}.index"))

# Questions dataset used for evaluation
ds_qas = load_from_disk(
    os.path.join(LOCAL_DATASET_FOLDER, QUESTIONS_DATASET_NAME + "with_simple_embeddings")
)

# The questions dataset has embeddings ready to use, but if you'd life to create new embeddings, load the retriever
# model as well
retriever_emb_model = SentenceTransformer(
    os.path.join(LOCAL_DATASET_FOLDER, f"embedding_model_{CORPUS_DATASET_NAME}.model")
)


In [None]:
# On text generation, are we trying to get conversational? creative? (in that case we might want to go for a causal modeling approach)
# or rather simply extract the answer from the context? (in that case we might want to got for a QuestionAnswering trained model approach)

# For the purpose of this exercise, we'll go with the causal modeling approach, but it really depends on the end product
# the one you would prefer to use
model_name = "HuggingFaceTB/SmolLM2-135M-Instruct"
device = "cpu" # Speed of small models on CPU is pretty cool

tokenizer = AutoTokenizer.from_pretrained(model_name)
generator_model = AutoModelForCausalLM.from_pretrained(model_name).to(device)


In [None]:
# Look at the maximum context size we'll be able to put in the model prompt
# 8k tokens should be completely fine for this small project
# as we tried to have text_chunks with fewer than 300 words in a previous notebook
# but more ambitious projects will probably require bigger context sizes
generator_model.config.max_position_embeddings

# End-to-End response generation

## Test with random user questions

In [None]:
# This method holds the end-to-end processing.
# it could be exported to a standalone app
def generate_response(question: str, number_of_chunks_in_context:int = 1):
    # generate question embedding
    embedding = torch.Tensor(retriever_emb_model.encode([question]))
    
    # query the index
    distances, indices = index.search(embedding, k=number_of_chunks_in_context)

    # retrieve the relevant chunk ids
    text_chunks = ds_corpus.select(indices[0, :])["text_chunk"]

    # build a prompt for the response generator model
    system_prompt = "You are a helpful assistant that will answer users questions. Before their question, users will give you some text content as context that will help you answer the questions. The answer to their question should be inside the context. Please answer accurately and concisely."
    user_prompt = ""
    for i, text_chunk in enumerate(text_chunks):
        user_prompt += f"[Context #{i}] {text_chunk}."
    user_prompt += f"[Question] {question}"

    # generate response from prompt
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    
    input_tokens=tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt").to(device)

    outputs = generator_model.generate(
        input_tokens,
        max_new_tokens=50,
        temperature=0.2, # [Exercise] how does this influence this answer? How would you tune it?
        top_p=0.9, # [Exercise] how does this influence this answer? How would you tune it?
        do_sample=True,
    )

    # Parse output, need to manually skip the 'assistant' token added which is not special (+4)
    answer = tokenizer.decode(outputs[0, len(input_tokens[0])+4:], skip_special_tokens=True)

    return answer

# Running it a few times, we notice that the answer is sometimes truncated -> we are overflowing the context
# size! -> [Exercise] What would you suggest to improve this
generate_response("Who is Abraham Lincoln?", number_of_chunks_in_context=1)

## Test on our questions dataset

In [None]:
# Generate answers from our e2e system for our questions dataset
# Map method could benefit from batching, but fortunately our dataset is pretty small
ds_qas = ds_qas.map(
    lambda element: {"rag_answer": generate_response(element["question"], number_of_chunks_in_context=1)}
)

In [None]:
# eye-ball a few question/answer/rag_answer triplets
for i in range(10):
    print("--")
    print(ds_qas.select_columns(["question", "answer", "rag_answer"])[i])

In [None]:
# Looks like we have some wins and some misses! 

# End-to-End Evaluation

We are looking into automatically evaluating the quality of our answer here (without human involvment) for the end-to-end system (question embedding -> index retrieval -> answer generation).
We will reuse our dataset of questions which already contain answers that we can compare our output to.

What are we trying to measure in the answer we generate? we'd like to know if:
- it contains the answer to the question as it was written in the groundtruth. (recall)
- not much else, unless we want the answer to be pretty conversational/welcome more details. (precision)


We will then compare our answer to the groundtruth via 3 methods:
- Exact metrics (eg: ROUGE) which apply simple methods to rate the answer
- 'LLM as a judge' metrics which use models to rate the answer
- Semantic similarity which use models to compare the semantic meaning of the answer provided vs groundtruth

All methods have their pros and cons and are complimentary of each other.

## Exact Metrics

### Recall

To evaluate Recall we'd like to know if our answer contains the most relevant 'keywords' that are also in the groundtruth.
A popular metric that does this is ROUGE (R stands for Recall!). It compares the overlap between the answer provided and the groundtruth, through n-grams (so, not only looking at keywords but also pair of words etc.), and computes the % of expected-n grams that appeared in our response. For simplicity here we'll only look at uni-grams, but looking at more metrics is always useful.

The higher the better!

In [None]:
# Only take unigrams
r_scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)

In [None]:
# to get better intution on the metrics, display a few answers with their respective ROUGE score
for i in range(5):
    print("--")
    print(f"Groundtruth: {ds_qas['answer'][i]}")
    print(f"RAG Prediction: {ds_qas['rag_answer'][i]}")
    print(f"ROUGE score individual predictions: {r_scorer.score(target=ds_qas['answer'][i], prediction=ds_qas['rag_answer'][i])}")

In [None]:
# ROUGE recall is able to flag clear misses... but also gives pretty low score to a perfectly good enough answer
# simply because the groundtruth says 'and many others' when the generated answer lists all relevant tokens.

In [None]:
# Note that the ROUGE package also outputs a _precision_ metric!

# it's the % of n-grams that we output in our answer that is also in the groundtruth. Obviously, the more irrelevant tokens
# we add, the lower this goes (as opposed to recall, in which adding irrelevant tokens doesn't change the metric)

In [None]:
# Add individual ROUGE metrics
def compute_rouge_metrics(ds_qa_element: Dict[str, Any]):
    r_metrics = r_scorer.score(
        target=ds_qa_element['answer'],
        prediction=ds_qa_element['rag_answer']
    )

    return {
        "rouge_recall": r_metrics["rouge1"].recall,
        "rouge_precision": r_metrics["rouge1"].precision
    }

ds_qas = ds_qas.map(compute_rouge_metrics)

In [None]:
print(f"Mean ROUGE Recall: {np.mean(ds_qas['rouge_recall'])}")
print(f"Mean ROUGE Precision: {np.mean(ds_qas['rouge_precision'])}")

In [None]:
# This could suggest we manage to retrieve relevant information ~ half of the time but add too many things
# to the answer
# [Exercise] How would you tune our system to be better on ROUGE Precision? Ask the generator to be even more concise?

### Precision

Unless we are interested in additional details, we'd like here to know if our model is as concise as possible.
If we assume the answers in our groundtruth to be concise. A really straightforward (but not without flaws) 
to measure 'Precision' is to compare the length of our generated answer to the groundtruth, and assume that
answers that are way too long are likely to lack precision.

Another popular metric is BLEU, but it's computation is 'similar' to the ROUGE precision we've computed above

In [None]:
ds_qas = ds_qas.map(
    lambda element: {"length_diff_words_answers": len(element["rag_answer"].split(" ")) - len(element["answer"].split(" "))}
)

In [None]:
ratio_answer_shorter_than_gt = len(ds_qas.filter(lambda el: el["length_diff_words_answers"] < 0)) / len(ds_qas)
print(f"Our answer is shorter than the groundtruth {100*ratio_answer_shorter_than_gt:.2f}% of the time.")

mean_length_longer_answers = np.mean(
    ds_qas.filter(
        lambda el: el["length_diff_words_answers"] > 0
    )["length_diff_words_answers"]
)
print(f"When our answer is longer, it has on average {mean_length_longer_answers} more words than the GT")

## LLM-as-a-judge

In [None]:
# For convenience (and because our dataset is small), let's do it with HF Inference APi
token = os.environ["HF_TOKEN_SERVERLESS_API"] # ADD YOUR TOKEN TO YOUR ENV! (It's a free service)
client = InferenceClient(
    token=token,
)

In [None]:
# Let's simply ask the LLM whether they think the answers are equivalent
def ask_llm_opinion(question, answer, rag_answer):
    response = client.chat_completion(
    	model="meta-llama/Meta-Llama-3-8B-Instruct",
    	messages=[
            # Prompt can be improved, LLM sometimes outputs things like "Who are notable figures mentioned in this list?"
            # which obviously doesnt work as we won't have access to the list... What would you suggest we change?
            {"role": "user", "content": "You are a helpful assistant. You will receive a question and its answer. You will then receive an alternative answer for the same question and need to determine if the alternative answer contains the same facts as the orignal answer. You can ignore other details. Please only answer with yes or no."},
            {"role": "assistant", "content": "Sure! understood."},
            {"role": "user", "content": f"Does the alternative answer contains the same fact as the original answer? Here is the triplet: [QUESTION] {question} [ANSWER] {answer} [ALTERNATIVE ANSWER] {rag_answer}"}],
    	max_tokens=4,
    )

    llm_output = response.choices[0]["message"]["content"]
    
    if "yes" in llm_output.lower():
        return True
    else:
        return False # All the other answer will map to False!

ask_llm_opinion(
    question="What is 10+10?",
    answer="20",
    rag_answer="The answer is 20"
)

In [None]:
ds_qas = ds_qas.map(
    lambda element: {
        "llm_judge_answers_equivalent_opinion": ask_llm_opinion(
            question=element["question"],
            answer=element["answer"],
            rag_answer=element["rag_answer"]
        )
    }
)
            

In [None]:
n_correct = len(ds_qas.filter(lambda el: el["llm_judge_answers_equivalent_opinion"]))

print(f"The Judge LLM thinks RAG answers contain the relevant facts {(100* n_correct / len(ds_qas)):.2f}% of the time")

In [None]:
# Very close to our ROUGE Recall value! :yay:

## Sentence similarity

In [None]:
# Left as an [Exercise]

# Specific response generation evaluation metrics
Additionally, we could also evaluate the quality of the response generation specifically.
That is, assuming the prompt contains accurate context, how likely are we to generate a good answer
in return? (Assessing Recall = we are retrieving all the relevant answers IN THE CONTEXT, and Precision = we are only returning
relevant answers)

To do this we would need to build a dataset that contains with certainty the right context for the question (which is
'probably' the case if we use the 'question' / 'text_chunk' from the ds_qas variable but not necessarily, we are talking
about an LLM produced dataset here...)