# **07 - RAGAS (RAG Assessment)**

In this hands-on work, we will **evaluate a Retrieval Augmented Generation (RAG) system**.

To do this, we will first **build a synthetic dataset** to generate questions/answers with their contexts from the database of our RAG system. We'll use an LLM.  
*The synthetic dataset generation part was inspired by: https://huggingface.co/learn/cookbook/rag_evaluation*.

We will then evaluate it using **LLM-as-a-judge metrics** defined in `ragas`.

In [None]:
import json
import os
import random
import re
from pathlib import Path

import datasets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from langchain.docstore.document import Document as LangchainDocument
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import VLLM
from langchain_community.vectorstores import Chroma
from tqdm import tqdm

pd.set_option('display.max_colwidth', None)
DSDIR = Path(os.environ["DSDIR"])

---

## **RAG system data**

In our RAG system, we will have the **HuggingFace documentation**. We will only use a **subset of the dataset** for relatively quick calculations (in particular to initialise the database).

In [None]:
dataset = datasets.load_dataset(str(DSDIR / "HuggingFace/m-ric/huggingface_doc"), split="train").select(range(500))
dataset

Each example consists of the **text of the current documentation and the path to the file**.

You can look at some examples of data.

In [None]:
dataset[0]

## **Build a synthetic dataset for evaluation purpose**

We will first **build a synthetic dataset using an LLM and our document base**.  
Each element of our dataset must contain the following elements:
- `question`
- `ground_truth`

To generate a synthetic dataset, we could use the `ragas` python module directly but we're going to do the various implementations by hand to show you how it works, and also because `ragas` is under development and has issues with open-source LLMs. For APIs like GPT, it's ok. We will use `ragas` only to compute the final metrics. In this case, it works with open-source LLMs.

### **Preparation of our document base**

We start by initializing our `Langchain` documents.

In [None]:
docs = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in tqdm(dataset)
]

They are split so that only one question is generated per context. A context here corresponds to a split document.

In [None]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    separators=["\n\n", "\n", r"(?<=\. )",  " ", "",],
)

splitted_docs = splitter.split_documents(docs)
print(f"After splitting our document base, we have {len(splitted_docs)} contexts.")

In [None]:
splitted_docs[1]

### **Question generation from an LLM**

To generate our questions, we'll use the LLM `Mistral-7B-Instruct-v0.2`, a good LLM and it's an `Instruct` model so it should do well for our tasks.

In [None]:
MODEL_PATH = DSDIR / "HuggingFace_Models/mistralai/Mistral-7B-Instruct-v0.2"

In [None]:
llm = VLLM(
    model=str(MODEL_PATH),
    trust_remote_code=True, 
    max_new_tokens=100,
    gpu_memory_utilization=0.75,
)  # we use vLLM via a LangChain wrapper, which will enable us to use ragas afterwards

Now we can **prepare a prompt to generate our questions**. We'll then see if this prompt alone is enough to generate quality dataset.

In [None]:
qa_prompt = PromptTemplate.from_template("""
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::""")

To complete it, use `qa_prompt.format(context={value})`.

In [None]:
print(qa_prompt.format(context="a context"))

We're now going to write the `generate_qa_dataset` function to generate our synthetic dataset.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to complete the code for the `generate_qa_dataset` function. This function takes the following parameters:
> - `llm`: to generate the questions/answers from the context,
> - `docs`: our list of contexts
> - `prompt`: our prompt template to be completed for each example
> - `nb_questions`: the number of questions to generate.
>
> The aim is to return a list of dictionaries with the keys `context_synthetic_dataset`, `question`, `ground_truth`. The `context_synthetic_dataset` is only then used to check from the dataset whether our questions/answers are relevant.
The function will first generate all the prompts and then send them to our LLM, which can parallelize the generation of responses.

In [None]:
def generate_qa_dataset(llm, docs, prompt, nb_questions):
    dataset = []
    contexts = random.sample(docs, nb_questions)  # we randomly select `nb_questions` contexts

    # generate all our prompts in advance
    ############ Complete here ############
    prompts = 
    #######################################
    # send our prompts to the llm who will parallelize the generation
    outputs = llm.generate(prompts=prompts)

    for context, output in tqdm(zip(contexts, outputs.generations), total=nb_questions):
        output_text = output[0].text

        # Parse the output
        try:
            ############ Complete here ############
            question = 
            answer = 
            #######################################

            if len(answer) >= 300: # Only relatively short answers are retained to facilitate analysis.
                raise ValueError("The answer is too long")

            ############ Complete here ############
            dataset.append(
                {
                    "context_synthetic_dataset": 
                    "question": 
                    "ground_truth": 
                }
            )
            #######################################
        except:
            print(f"{'#'*50} Problem {'#'*50}\n{output_text}\n{'#'*100}\n\n")
            continue
            
    return dataset

**Solution:**

In [None]:
generated_dataset = generate_qa_dataset(
    llm=llm,
    docs=splitted_docs,
    prompt=qa_prompt,
    nb_questions=100
)

You can view the dataset generated and try to spot any weaknesses.

In [None]:
print(f"Number of questions/answers generated: {len(generated_dataset)}")
pd.DataFrame(generated_dataset).head()

As you may have noticed, some of the questions/answers are problematic.

For example, the following questions can be generated:
- ```How many tokens are in the input sequence "I want to buy a car"?\n```
  - This question cannot be answered. It depends entirely on the context in which it was generated. The answer will depend on the tokenizer used.
- ```Who made their first contribution in pull request 1004?\n```
  - From a ML user point of view, this question is pointless. This question is not relevant to users.
 
**The questions and answers generated should be checked. So we're going to use a critical agent (LLM) to do it for us.**

### **LLM critical agent**

We're going to ask our LLM to check a number of classic criteria to ensure the quality of our dataset:
- **Groudedness**: can the question be answered from the context?
  - The generation LLM can sometimes hallucinate and generate a question whose answer is not in the context.
- **Relevance**: is the question relevant to my target users?
  - In our case, we're talking about ML users.
- **Stand-alone**: is the question understandable without a given context?
  - If the question depends entirely on the generation context, then it's not usable.
 
Other criterias can be imagined.

We're going to **ask our critical LLM to evaluate these 3 criteria for each of the elements in our dataset**.  
We'll ask our LLM to generate a **score between 1 and 5 and then explain his reasoning with a feedback**.  
Note that ideally the reverse would be better, but given that we're limiting the size of our outputs to 100 tokens, it's less serious to lose a bit of explanation rather than the score. 

Here are the prompts. Of course, these could be different and improved, notably with few-shot learning (giving a few examples).

In [None]:
groundedness_prompt = PromptTemplate.from_template("""
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5. Here are some score examples:
- 1 means that the question cannot be answered with the given context
- 2 means that the question can be partially answered with the given context
- 5 means that the question is clearly and unambiguously answerable with the context. Don't make deductions. The answer must be in context.

Provide your answer as follows:

Answer:::
Total rating: (your rating, as a number between 1 and 5)
Evaluation: (your rationale for the rating, as a text)

You MUST provide values for 'Total rating:' and 'Evaluation:' in your answer.

Now here are the question and context.

Question: {question}\n
Context: {context}\n
Answer::: """)

relevant_prompt = PromptTemplate.from_template("""
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful for a machine learning developer.

Provide your answer as follows:

Answer:::
Total rating: (your rating, as a number between 1 and 5)
Evaluation: (your rationale for the rating, as a text)
You MUST provide values for 'Total rating:' and 'Evaluation:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """)

standalone_prompt = PromptTemplate.from_template("""
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "How many tokens are in the input sequence "I want to buy a car"?" should receive a 1, since there is an implicit mention of a context. Indeed, without knowing the context tokenizer, we can't know the answer. Thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Total rating: (your rating, as a number between 1 and 5)
Evaluation: (your rationale for the rating, as a text)

You MUST provide values for 'Total rating:' and 'Evaluation:' in your answer.

Now here is the question.

Question: {question}\n
Answer::: """)

We can try a groudedness prompt on an example

In [None]:
prompt = groundedness_prompt.format(
    context="France won the Handball EURO 2024 against Denmark.", 
    question="During the EURO 2024 handball final, what was the final score?"
)
print(f"{'#'*50} PROMPT {'#'*50}\n{prompt}\n{'-'*100}\n")

answer = llm.generate(
    prompts=[prompt],
).generations[0][0].text,

print(f"{'#'*50} ANSWER {'#'*50}\n{answer}{'-'*100}")

Now let's write a function to generate the scores for the various criteria on our examples.

<hr style="border:1px solid red"> 

> <span style="color:red">**Task**:</span> Your task is to complete the `generate_critic_scores` function. It has the following parameters:
> - llm
> - dataset: our dictionary list representing our synthetic dataset
> - groudedness_prompt, relevant_prompt, standalone_prompt: our different templates for criteria prompts
>
> The function will **first generate all prompts** in the `all_prompts` variable. **For each element in our dataset, we'll have 3 prompts**. For dataset element 0, the critical prompts are associated with `all_prompts` indices 0, 1 and 2.
> Prompt responses are then generated. The results of these critical prompts will then have to be parsed to retrieve the score and feedback from our LLM agent. We then update the dictionary for the current element by adding the score and feedback.

In [None]:
def generate_critic_scores(llm, dataset, groundedness_prompt, relevant_prompt, standalone_prompt):
    all_prompts = []

    for element in tqdm(dataset, desc="prompt generation"):
        ############ Complete here ############
        g_prompt = 
        r_prompt = 
        s_prompt = 
        #######################################

        all_prompts.extend([g_prompt, r_prompt, s_prompt])

    all_evaluations = llm.generate(
        prompts=all_prompts,
    )
    
    for i in range(len(dataset)):
        start_index = i * 3
        groundedness_evaluation = all_evaluations.generations[start_index][0].text
        relevant_evaluation = all_evaluations.generations[start_index + 1][0].text
        standalone_evaluation = all_evaluations.generations[start_index + 2][0].text

        for criteria, evaluation in zip(["groundedness", "relevant", "standalone"], 
                                        [groundedness_evaluation, relevant_evaluation, standalone_evaluation]):
            try:
                # Parse the LLM output
                ############ Complete here ############
                score = int(evaluation.split("Total rating: ")[-1].split()[0])
                feedback = 
                #######################################

                ############ Complete here ############
                dataset[i].update(
                    {

                        
                    }
                #######################################
                )
            except Exception as e:
                print(f"{'#'*50} Problem {'#'*50}\n{evaluation}\n{'#'*100}\n\n")
                continue
                
    return dataset

**Solution:**

In [None]:
generated_dataset = generate_critic_scores(
    llm=llm,
    dataset=generated_dataset,
    groundedness_prompt=groundedness_prompt,
    relevant_prompt=relevant_prompt,
    standalone_prompt=standalone_prompt
)

In [None]:
generated_dataset = pd.DataFrame.from_dict(generated_dataset)
generated_dataset.head()

Now that we have all our scores, **let's filter our dataset**. We'll only keep examples with a **score >= 4 for each criteria, for example**.

In [None]:
generated_dataset_filtered = generated_dataset.loc[
    (generated_dataset["groundedness_score"] >= 4) &
    (generated_dataset["relevant_score"] >= 4) &
    (generated_dataset["standalone_score"] >= 4)
]

print(f"Number of elements remaining in our synthetic dataset: {len(generated_dataset_filtered)}")
generated_dataset_filtered.head()

In [None]:
dataset_ragas = datasets.Dataset.from_pandas(
    generated_dataset_filtered, split="train", preserve_index=False
)

dataset_ragas

## **RAG ASessment**

Now that we have our little evaluation dataset, we can try to evaluate a RAG system.

### **RAG system initialization**

**We'll quickly create our RAG system in a similar way to yesterday's tutorial.**

First, we create our documents in `Langchain` format as earlier.

In [None]:
from langchain.docstore.document import Document as LangchainDocument

docs = [
    LangchainDocument(page_content=doc["text"], metadata={"source": doc["source"]})
    for doc in tqdm(dataset)
]

A `split_documents` function is created. It splits our list of Documents according to a certain `chunk_size` and `chunk_overlap`, **which we'll then adjust** during the evaluation.

Compared with the previous split for dataset generation, we split our documents into **smaller chunks**. We want the chunks to be **neither too small** to be sufficient to answer a question, **nor too large** to avoid getting lost in a mass of information.

A recursive splitter is used again. The splitter could be varied during evaluation to see what works best. The recursive splitter attempts to preserve the structure of the document, by treating it in a tree-like way, first dividing the largest units and then recursively dividing the smallest units (paragraphs, sentences) so as to have chunks of max `chunk_size` length.

In [None]:
def split_documents(chunk_size, chunk_overlap, documents):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["\n\n", "\n", r"(?<=\. )",  " ", "",]
    )

    splitted_docs = splitter.split_documents(documents)
    print(f"Number of documents before deleting duplicates: {len(splitted_docs)}")

    # Remove duplicates
    unique_texts = set()
    docs_processed_unique = []
    for doc in splitted_docs:
        if doc.page_content not in unique_texts:
            unique_texts.add(doc.page_content)
            docs_processed_unique.append(doc)
    print(f"After: {len(docs_processed_unique)}")


    return docs_processed_unique

In [None]:
docs_processed = split_documents(
    chunk_size=200,
    chunk_overlap=20,
    documents=docs
)

We then initialize our **vector database**. It will be a **Chroma database** like the RAG practical work.

In [None]:
HF_MODELS_PATH = DSDIR / "HuggingFace_Models"
EMBEDDING_PATH = HF_MODELS_PATH / "intfloat/multilingual-e5-large"
VDB_PATH = Path("./vector_db_ragas")

In [None]:
def create_vdb(docs, embedding, vdb_path):
        """Create a vector database from the documents"""
    
        if vdb_path.exists():
            if any(vdb_path.iterdir()):
                raise FileExistsError(
                    f"Vector database directory {vdb_path} is not empty"
                )
        else:
            vdb_path.mkdir(parents=True)

        vectordb = Chroma.from_documents(
            documents=docs,
            embedding=embedding,
            persist_directory=str(vdb_path),  # Does not accept Path
        )
        vectordb.persist()  # Save database to use it later

        print(f"vector database created in {vdb_path}")
        return vectordb

In [None]:
embedding = HuggingFaceEmbeddings(
    model_name=str(EMBEDDING_PATH),
    model_kwargs={"device": "cuda"},
)

In [None]:
vectordb = create_vdb(docs_processed, embedding, VDB_PATH)

To reuse a Chrome vector database that we have already set up, use the following code:

In [None]:
vectordb = Chroma(
    embedding_function=embedding,
    persist_directory=str(VDB_PATH)
)

Our database contains just as many elements as our split document database.

In [None]:
assert len(vectordb.get()['ids']) == len(docs_processed)

### **Answer generation**

We're now going to **generate the answers to our questions using our RAG system**.

To do this, we again initialize a template that will give the LLM **the contexts found by the RAG system and the question**.

In [None]:
rag_prompt_template = PromptTemplate.from_template("""Using the information contained in the context, give a comprehensive answer to the question.
Respond only to the question asked, response should be concise and relevant to the question.
If the answer cannot be deduced from the context, please say you don't know.

Contexts:::
{context}

Now here is the question you need to answer.

Question: 
{question}
""")

This time, we give you the function for generating responses: `generate_answer`.
It takes as parameters:
- `eval_dataset`: our synthetic dataset,
- `llm`: to answer the questions,
- `vectordb`: to retrieve the contexts associated with the question,
- `prompt_template`,
- `num_retrieved_docs`: the number of contexts to be retrieved with the RAG.
- `verbose`: whether or not to display the first prompt generated. This gives you an idea of the prompt.

This function **generates answers for each question using our RAG system** and also **adds to the dataset the contexts that were used to generate the answer**.

In [None]:
def generate_answer(eval_dataset, llm, vectordb, prompt_template, num_retrieved_docs = 3, verbose=True):
    all_prompts = []
    docs_page_content = []
    
    for i, element in tqdm(enumerate(eval_dataset), total=len(eval_dataset), desc="prompt generation"):
        obtained_docs = vectordb.similarity_search(element["question"], k=num_retrieved_docs)
        obtained_docs_page_content = [context.page_content for context in obtained_docs]
        docs_page_content.append(obtained_docs_page_content)
        context = "\n\n".join(
            [f"Context:\n" + doc for doc in obtained_docs_page_content]
        )
        prompt = prompt_template.format(context=context, question=element["question"])
        all_prompts.append(prompt)

        if i==0 and verbose:
            print(prompt)
    
    all_answers = llm.generate(
        prompts=all_prompts
    )
    all_answers = [answer[0].text for answer in all_answers.generations]
    
    if "answer" in eval_dataset.column_names:
        eval_dataset = eval_dataset.remove_columns("answer")
    eval_dataset = eval_dataset.add_column("answer", all_answers)
    if "contexts" in eval_dataset.column_names:
        eval_dataset = eval_dataset.remove_columns("contexts")
    eval_dataset = eval_dataset.add_column("contexts", docs_page_content)

    return eval_dataset

We generate our results.

In [None]:
dataset_ragas = generate_answer(
    eval_dataset=dataset_ragas,
    llm=llm,
    vectordb=vectordb,
    prompt_template=rag_prompt_template,
    num_retrieved_docs=3,
    verbose=True
)

You can view our updated dataset.  
We have the 4 columns we need:
- `question`
- `ground_truth`
- `answer`
- `contexts`

In [None]:
dataset_ragas.to_pandas().head()[['question', 'ground_truth', 'answer', 'contexts']]

### **Evaluation**

Now that we have our RAG and our evaluation dataset, we can evaluate it. 

**Note**: **we'll be using the same LLM for generating RAG system responses and for the evaluation LLM** to facilitate memory management in this tutorial. In practice, it would be better to use a different LLM for evaluation. Indeed, an LLM evaluator tends to prefer responses generated by an LLM, and even more so when it's the same LLM.

We're using `ragas` this time, because implementing the metrics presented in the course by hand is more tedious.  
**Results can take a long time to achieve even for a few examples**.

In terms of metrics, we're going to use only `context_relevancy`. You can test others. In theory, you could use several metrics at once in the `metrics=[list_of_metrics]` parameter, but I've had some surprising bugs. So test the metrics one by one. 

The output is likely to be very verbose because of tqdm in vLLM. You can **collapse the output**.

In [None]:
from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness, context_recall, context_relevancy

score = evaluate(
    dataset_ragas,
    metrics=[context_relevancy],
    llm=llm,
    embeddings=embedding,
    raise_exceptions=False
)

Let's take a look at our results.

In [None]:
score_df = score.to_pandas()
score_df[['question', 'ground_truth', 'answer', 'contexts', 'context_relevancy']].head()

Several NaN values can be noted. These are the **limits of `ragas` at the moment, especially with open-source LLMs** where there seem to be problems with output parsing. With GPT models, these problems are rare, as the module has mainly been developed from these LLMs.

We could also write LLM-as-a-judge metrics by hand, as we did earlier. For example, we could have given to our evaluator LLM the following inputs: the question, the generated answer and the expected answer. We could then ask for a score between 1 and 5, detailing what we expect for each score and we would ask the LLM's reasoning.

Finally, we can **calculate the average for each of our metrics**.

In [None]:
score_df[["context_relevancy"]].isna().sum()

In [None]:
ax = score_df[["context_relevancy"]].mean(skipna=True).plot(kind="bar")
ax.bar_label(ax.containers[0])
plt.show()

Once these results have been obtained, **the idea would be to vary various parameters such as chunk_size, splitter, document retrieval, reranker...** The assessment would then be repeated.