# Ragas Evaluation

Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. It provides several modules which come handy for evaluating RAG. The two main ones - and the ones we are using in this guide are:

- *TestsetGenerator*: This module is responsible for generating test sets for evaluating RAG pipelines. It loads a bunch of documents or text chunks and then uses an LLM to generate potential questions based on these documents as well as answers for these questions - based on the provided documents. The generated answers are then used as "ground truth" to evaluate the RAG pipeline in a subsequent step.

- *evaluate*: This module is responsible for evaluating RAG pipelines using the generated test sets. uses the questions of the generated test set to evaluate the RAG pipeline. It again uses an LLM to validate the answers given from your RAG pipeline based on the questions provided in the test set. The LLM also is asked to validate how good the provided contexts fit the questions. And finally, the LLM answer is compared to the ground truth answer (which is part of the test set) to evaluate the LLM itself. It provides a variety of evaluation metrics, including answer relevancy, faithfulness, context recall, and context precision

In [1]:
# LIBRARIES IMPORT
import random
import numpy as np
import seaborn as sns
from pandas import DataFrame
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap

from ragas import adapt
from ragas import evaluate
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context, conditional
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)
from datasets import Dataset
from langchain_core.documents import Document
from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings


In [2]:
# MODULES IMPORT
from query_chatbot import load_database
from query_chatbot import generate_answer, search_similarity_in_database

## Tooling


In [3]:
# Function to load the database of vectors
def load_database_from_path(vector_db_path: str) -> Chroma:
    """ Load the database of vectors from a given path """
    # Load the database
    vector_db = load_database(vector_db_path)[0]

    return vector_db

In [4]:
# Function to reconstruct the chunks from the vector database
def reconstruct_chunks(vector_db: Chroma) -> list:
    """Reconstruct the chunks from the vector database.
    Parameters
    ----------
    vector_db : Chroma
        The vector database to reconstruct the chunks.
    Returns
    -------
    chunks : list of Document
        List of text chunks reconstructed from the vector database.
        format : [{"page_content": str, "metadata": dict}, ...]
    """
    chunks = []
    vector_collections = vector_db.get()
    total_chunks = len(vector_collections["ids"])
    for i in range(total_chunks):
        chunk = {
            "page_content": vector_collections["documents"][i],
            "metadata": vector_collections["metadatas"][i],
        }
        # ** used to unpack a dictionary into keyword arguments.
        chunk = Document(**chunk)
        chunks.append(chunk)

    print(
        f"Reconstructed {total_chunks} chunks from the vector database successfully.\n"
    )

    return chunks

In [5]:
# Function to create the testset
def create_testset(chunks: list, generator_llm_name: str, critic_llm_name: str, language: str, nb_query: int = 10) -> DataFrame:
    """Create the testset for the evaluation.
    
    Parameters
    ----------
    chunks : list of Document
        List of text chunks to generate the testset.
    generator_llm_name : str
        The name of the generator language model for simple questions.
    critic_llm_name : str
        The name of the critic language model for reasoning questions.
    language : str
        The language of the testset.
    
    Returns
    -------
    testset : DataFrame
        The generated testset.
    """
    # Get nb_query random chunks
    selected_chunks = random.sample(chunks, nb_query)
    
    # Create the generator from openai models
    generator_llm = ChatOpenAI(model=generator_llm_name)
    critic_llm = ChatOpenAI(model= critic_llm_name)
    embeddings = OpenAIEmbeddings()

    # Define the testset generator
    generator = TestsetGenerator.from_langchain(generator_llm, critic_llm, embeddings)
    # adapt the generator to the French language
    generator.adapt(language, evolutions=[simple, reasoning,conditional,multi_context])
    # save the generator
    generator.save(evolutions=[simple, reasoning, multi_context,conditional])

    # Generate the testset
    testset = generator.generate_with_langchain_docs(
        selected_chunks, # chunks to generate the testset
        test_size=nb_query, # number of samples to generate
        distributions={simple: 0.4, reasoning: 0.2, multi_context: 0.2, conditional: 0.2} # distribution of the testset for each type of question
    )
    # convert the testset to a dataframe
    testset_df = testset_df = testset.to_pandas()

    return testset_df

In [6]:
# Function to get the answers from the rag pipeline
def get_answers(testset_df: DataFrame, vector_db: Chroma, llm_name: str) -> DataFrame:
    """Get the answers from the RAG pipeline.

    Parameters
    ----------
    testset_df : DataFrame
        The testset dataframe.
    vector_db : Chroma
        The vector database.
    llm_name : str
        The name of the language model to use for the answer generation.

    Returns
    -------
    dataset : DataFrame
        The dataset containing the questions, answers, contexts, and ground truth.
    """
    # Create list of questions and ground truth
    questions = testset_df["question"].to_list()
    ground_truth = testset_df["ground_truth"].to_list()

    # Initialize the data dictionary
    data = {"question": [], "answer": [], "contexts": [], "ground_truth": ground_truth}

    # Generate answers for the questions and store query, answer, and relevant contexts in the data dictionary
    for query in questions:
        relevant_chunks = search_similarity_in_database(vector_db, query, 3, logger_flag=False)
        answer = generate_answer(query, "", relevant_chunks, llm_name, logger_flag=False)
        data["question"].append(query)
        data["answer"].append(answer)
        data["contexts"].append([doc.page_content for doc in relevant_chunks])

    # Create a dataset from the data dictionary
    dataset = Dataset.from_dict(data)
    # Convert the dataset to pandas dataframe
    dataset.to_pandas()

    return dataset

In [7]:
# Function to evaluate the answers
def evaluate_answers(dataset: DataFrame, metrics) -> dict:
    """Evaluate the answers using the RAG evaluation metrics.
    
    Parameters
    ----------
    dataset : DataFrame
        The dataset containing the questions, answers, contexts, and ground truth.
    metrics : list
        The list of evaluation metrics to use.
    
    Returns
    -------
    results : dict
        The evaluation results.
    """
    # Evaluate the dataset
    evaluation = evaluate(
        dataset = dataset,
        metrics = metrics
    )
    print(f"For the testset, the evaluation results are as follows:\n")
    print(evaluation)

    return evaluation

In [8]:
# Function to plot the evaluation results
def plot_evaluation_results(evaluation_results: DataFrame, metrics: list):
    """Plot the evaluation results.
    
    Parameters
    ----------
    evaluation_results : DataFrame
        The evaluation results dataframe.
    metrics : list
        The list of evaluation metrics to plot.
    """
    # Create a heatmap of the evaluation results
    heatmap_data = evaluation_results[metrics]
    # Gradient color map
    cmap = LinearSegmentedColormap.from_list('green_red', ['red', 'green'])
    plt.figure(figsize=(10, 8))
    sns.heatmap(heatmap_data, annot=True, fmt=".2f", linewidths=.5, cmap=cmap)

    # Create labels for y-ticks as "chunks {index+1}"
    y_labels = [f'chunks {i+1}' for i in evaluation_results.index]
    # Use these labels for y-ticks
    plt.yticks(ticks=range(len(evaluation_results.index)), labels=y_labels, rotation=0)
    # save the plot
    plt.savefig("evaluation_heatmap.png")
    plt.show()

    

In [9]:
# Function to analyze the impact of the language model on the evaluation
def analysis_llm_impact(testset: DataFrame, vector_db: Chroma, llm_models: list, metrics: list) -> dict:
    """Analyze the impact of the language model on the evaluation of the RAG pipeline."""
    evaluation_df = {}
    for llm_name in llm_models:
        answers_dataset = get_answers(testset, vector_db, llm_name)
        print(f"Answers generated using the {llm_name} language model.\n")
        evaluation_results = evaluate_answers(answers_dataset, metrics)
        evaluation_df[llm_name] = evaluation_results
    
    return evaluation_df
    

In [10]:
# Function to plot the analysis of the impact of the language model on the evaluation
def plot_analysis_llm_impact(evaluation_dfs: dict, metrics: list):
    """
    Analyze and plot the distribution of evaluation metrics for multiple LLM models.
    
    Parameters
    ----------
    evaluation_dfs : dict
        A dictionary where keys are model names and values are DataFrames containing evaluation metrics.
    metrics : list
        A list of evaluation metrics to plot.
    """
    sns.set_style("whitegrid")
    model_names = list(evaluation_dfs.keys())
    model_eval = list(evaluation_dfs.values())
    model_df = [model_eval.to_pandas() for model_eval in model_eval]
    model_eval_df = [model_df[metrics] for model_df in model_df]

    _fig, axs = plt.subplots(1, len(metrics), figsize=(20, 5))
   
    for i, col in enumerate(model_eval_df[0].columns):
        sns.kdeplot(data=[model_eval_df[col].values for model_eval_df in model_eval_df],legend=False,ax=axs[i],fill=True)
        axs[i].set_title(f'{col} scores distribution')
        axs[i].legend(labels=model_names)
    plt.tight_layout()
    # save the plot
    plt.savefig("evaluation_distribution_llms.png")
    plt.show()

In [11]:
# Function to load and create testsets
def load_and_create_testsets(db_paths, num_samples=50, generator_llm_name="gpt-3.5-turbo", critic_llm_name="gpt-4o", language="fr"):
    """Load databases, reconstruct chunks, randomly sample chunks, and create testsets."""
    vector_dbs = []
    testsets_size = []
    
    for db_path in db_paths:
        # Load the database
        vector_db = load_database_from_path(db_path)
        vector_dbs.append(vector_db)
        
        # Reconstruct the chunks
        chunks = reconstruct_chunks(vector_db)
        
        # Randomly sample chunks
        random_chunks = np.random.choice(chunks, num_samples, replace=False)
        
        # Create the testset
        testset_df = create_testset(
            chunks=random_chunks,
            generator_llm_name=generator_llm_name,
            critic_llm_name=critic_llm_name,
            language=language
        )
        
        # Store the testset in the dictionary
        testsets_size.append(testset_df)
    
    return vector_dbs, testsets_size

In [12]:
# Function to analyze the impact of the vector database on the evaluation
def analysis_size_impact(testsets: list, vectors_db: list, llm_model: str, metrics: list) -> dict:
    """Analyze the impact of the vector database on the evaluation of the RAG pipeline."""
    size = [200, 500, 1000, 2000, 3000]
    evaluation_size_df = {}
    for i in range(len(testsets)):
        answers_dataset = get_answers(testsets[i], vectors_db[i], llm_model, logger_flag=False)
        print(f"Answers generated using the {llm_model} language model.\n")
        evaluation_df = evaluate_answers(answers_dataset, metrics)
        evaluation_size_df[f"db_{size[i]}"] = evaluation_df
    
    return evaluation_size_df

In [13]:
# Function to plot the analysis of the impact of the vector database on the evaluation
def plot_analysis_size_impact(evaluation_dfs: dict, metrics: list):
    """
    Analyze and plot the distribution of evaluation metrics for multiple vector databases.
    
    Parameters
    ----------
    evaluation_dfs : dict
        A dictionary where keys are vector database names and values are DataFrames containing evaluation metrics.
    metrics : list
        A list of evaluation metrics to plot.
    """
    sns.set_style("whitegrid")
    db_names = list(evaluation_dfs.keys())
    db_df = list(evaluation_dfs.values())
    db_eval_df = [db_df[metrics] for db_df in db_df]

    _fig, axs = plt.subplots(1, len(metrics), figsize=(20, 5))
   
    for i, col in enumerate(db_eval_df[0].columns):
        sns.kdeplot(data=[db_eval_df[col].values for db_eval_df in db_eval_df],legend=False,ax=axs[i],fill=True)
        axs[i].set_title(f'{col} scores distribution')
        axs[i].legend(labels=db_names)
    plt.tight_layout()
    plt.show()

## Exploratory Data Analysis


### Load data

In [14]:
# Load Chroma database
vector_db = load_database_from_path("../chroma_db")

2024-06-04 20:34:24.923 | INFO     | query_chatbot:load_database:166 - Loading the vector database.
2024-06-04 20:34:25.692 | INFO     | query_chatbot:load_database:176 - Chunks in the database: 890
2024-06-04 20:34:25.693 | SUCCESS  | query_chatbot:load_database:178 - Vector database prepared successfully.



In [15]:
# Get the chunks
chunks = reconstruct_chunks(vector_db)

Reconstructed 890 chunks from the vector database successfully.



### Create the test set

In [16]:
testset_df = create_testset(chunks=chunks, generator_llm_name="gpt-4o", critic_llm_name="gpt-4o", language="fr", nb_query=500)

embedding nodes:   0%|          | 0/1000 [00:00<?, ?it/s]

In [None]:
# save the testset
testset_df.to_csv("testset.csv", index=False, encoding='utf-8', sep=',', quoting=1)
testset_df

### Get the answers from the RAG pipeline

In [None]:
answers_dataset = get_answers(testset_df, vector_db, llm_name="gpt-4o")

In [None]:
answers_df = answers_dataset.to_pandas()
# Save the DataFrame to a CSV file with special characters handled
answers_df.to_csv('answers_df.csv', index=False, encoding='utf-8', sep=',', quoting=1)
# Display it
answers_df

### Evaluate the model

In [None]:
metrics = [faithfulness, context_precision, context_recall, answer_relevancy]
evaluation_results = evaluate_answers(answers_dataset, metrics)

- **Faithfulness** measures the factual accuracy of the generated answer. The number of correct statements from the given contexts is divided by the total number of statements in the generated answer. This metric uses the question, contextsand the answer.

- **Context precision** measures the signal-to-noise ratio of the retrieved context. This metric is computed using the question and the contexts.

- **Context recall** measures if all the relevant information required to answer the question was retrieved. This metric is computed based on the ground_truth (this is the only metric in the framework that relies on human-annotated ground truth labels) and the contexts.

- **Answer relevancy** measures how relevant the generated answer is to the question. This metric is computed using the question and the answer. For example, the answer “France is in western Europe.” to the question “Where is France and what is it’s capital?” would achieve a low answer relevancy because it only answers half of the question.

All metrics are scaled to the range [0, 1], with higher values indicating a better performance.

In [None]:
evaluation_df = evaluation_results.to_pandas()
# Save the DataFrame to a CSV file with special characters handled
evaluation_df.to_csv('evaluation_df.csv', index=False, encoding='utf-8', sep=',', quoting=1)
evaluation_df

### Visualize the results


In [None]:
# Define the metrics to plot
metrics_str = ['faithfulness', 'context_precision', 'context_recall', 'answer_relevancy']
plot_evaluation_results(evaluation_df, metrics_str)

## Analysis

Now we want to compare the performance of the RAG pipeline with differentes openai models. We will use the same test set for all the models and then compare the results.

This will answer the questions:
> Is the performance of the RAG pipeline model dependent on the underlying LLM model? And if so, which model performs best?

In [None]:
llm_models = ["gpt-3.5-turbo", "gpt-4o", "gpt-4-turbo"]
evaluations_llm_df = analysis_llm_impact(testset_df, vector_db, llm_models, metrics)

In [None]:
print(f"Evaluation dataframes for different language models:\n")
print(f"Language model: gpt-3.5-turbo")
eval_df_gpt_3_5_turbo = list(evaluations_llm_df.values())[0].to_pandas()
# Save the DataFrame to a CSV file with special characters handled
eval_df_gpt_3_5_turbo.to_csv('eval_df_gpt_3_5_turbo.csv', index=False, encoding='utf-8', sep=',', quoting=1)
eval_df_gpt_3_5_turbo

In [None]:
print(f"Language model: gpt-4o")
eval_df_gpt_4o = list(evaluations_llm_df.values())[1].to_pandas()
# Save the DataFrame to a CSV file with special characters handled
eval_df_gpt_4o.to_csv('eval_df_gpt_4o.csv', index=False, encoding='utf-8', sep=',', quoting=1)
eval_df_gpt_4o

In [None]:
print(f"Language model: gpt-4-turbo")
eval_df_gpt_4_turbo = list(evaluations_llm_df.values())[2].to_pandas()
# Save the DataFrame to a CSV file with special characters handled
eval_df_gpt_4_turbo.to_csv('eval_df_gpt_4_turbo.csv', index=False, encoding='utf-8', sep=',', quoting=1)
eval_df_gpt_4_turbo

In [None]:
# Plot the analysis of the impact of the language model on the evaluation
plot_analysis_llm_impact(evaluations_llm_df, metrics_str)

Then we will compare the performance of the RAG pipeline using different test sets, each generated with a different database. Each database will be created with a different chunk size.

This will answer the questions:
> Is the choice of the chunk size for the test set generation impacting the performance of the RAG pipeline? And if so, which chunk size performs best?

In [None]:
"""# List of database paths
db_paths = [
    "../chroma_db_500",
    "../chroma_db_1000",
    "../chroma_db_2000",
    "../chroma_db_3000"
]

# Create testsets for all databases
vector_dbs, testsets_sizes = load_and_create_testsets(db_paths)"""

In [None]:
"""# Generate answers and evaluate them for all databases
evaluations_size_df = analysis_size_impact(testset_df, llm_model="gpt-4o", metrics=metrics)"""

In [None]:
"""# Plot the analysis of the impact of the vector database on the evaluation
plot_analysis_size_impact(evaluations_size_df, metrics_str)"""