# RAG System Evaluation

## Evaluation Methodology

This document outlines the methodology used to evaluate the performance of our Retrieval-Augmented Generation (RAG) system. Our goal is to quantitatively measure the quality and reliability of the generated answers.

### Golden Dataset Creation

To establish a ground truth for our evaluation, a "golden dataset" of question-and-answer pairs was created. This process was done manually by uploading the four source PDFs directly into the Gemini chat interface. For each document, a series of diverse and relevant questions were asked, and the resulting answers and verbatim source references were carefully collected. This hands-on approach, followed by a human review, ensures the dataset is high-quality and accurately reflects the content of the source material.

### Evaluation Approach

Our evaluation uses an **"LLM-as-a-Judge"** framework. The process is orchestrated following these steps:

1.  **Querying**: Each question from the golden dataset is sent to our RAG system's API to retrieve its generated answer and source references.
2.  **Judging**: A powerful, impartial LLM is then used as a judge to compare our RAG system's output against the ground truth from the golden dataset.
3.  **Scoring**: The judge scores each response on two key metrics:
    * **Faithfulness**: Assesses if the generated answer is strictly supported by the retrieved context, penalizing any hallucinations.
    * **Answer Relevance**: Measures how well the answer addresses the user's original question.

The aggregated scores from this process provide a quantitative measure of our RAG system's performance.

### 1. Setup & Imports

First, we'll set up the notebook's environment. This block imports all the necessary Python libraries for file handling (`os`, `json`), data manipulation (`pandas`), and making API calls (`requests`). We also define key configuration variables, like the API endpoint and file paths, in one central place for easy modification.

In [5]:
import os
import json
import requests
import pandas as pd
import time
import uuid
from tqdm.auto import tqdm
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

# --- Configuration ---
RAG_API_URL = "http://localhost:8000/api/question"
DATASET_DIR = "evaluation/datasets"
RESULTS_FILE = "evaluation/evaluation_results.csv"

### 2. Configure the LLM-as-a-Judge

This block sets up the core logic for our evaluation: the **"LLM-as-a-Judge."** This automated evaluator will score our RAG system's performance consistently.

First, we define a `JudgeScores` schema using Pydantic. This forces the LLM to provide its feedback in a clean, predictable JSON format, which is crucial for collecting and analyzing the results. We then initialize a powerful model (`gpt-4o`) to act as the judge, with its `temperature` set to `0.0` for maximum objectivity.

Finally, we create a detailed prompt that serves as the instruction set for our judge, defining the key metrics: **Faithfulness** (is the answer based on the sources?) and **Answer Relevance** (does it answer the question?). These components are combined into a single `evaluation_chain` that is now ready to receive data and return a structured score.

In [None]:
class JudgeScores(BaseModel):
    """A Pydantic model to define the structured output for the judge LLM."""
    faithfulness_score: float = Field(
        description="The score for faithfulness (0.0 for hallucination, 1.0 for fully faithful)."
    )
    relevance_score: float = Field(
        description="The score for relevance (0.0 for irrelevant, 0.5 for partial, 1.0 for fully relevant)."
    )
    reason: str = Field(
        description="A brief justification for the assigned scores."
    )

llm = ChatOpenAI(model="gpt-4o", temperature=0.0)

prompt_template = ChatPromptTemplate.from_template(
    """You are an impartial judge evaluating the quality of a RAG system's response.
    Your task is to score the system's "Faithfulness" and "Answer Relevance".
    - Faithfulness: Does the answer strictly rely on the provided context? Score 1 if yes, 0 if no (hallucination).
    - Answer Relevance: Does the answer directly and completely address the user's question? Score 1 for a complete answer, 0.5 for a partial answer, and 0 for a non-relevant answer.

    Provide your response as a valid JSON object with "faithfulness_score", "relevance_score", and a brief "reason".

    ---
    User Question: "{question}"
    ---
    Retrieved Context (for Faithfulness check): "{rag_references}"
    ---
    RAG System Answer: "{rag_answer}"
    ---
    Ground Truth Answer (for Relevance check): "{ground_truth_answer}"
    """
)

evaluation_chain = prompt_template | llm.with_structured_output(JudgeScores)

### 3. Helper Function: Query RAG API

This helper function serves as the direct interface between our evaluation notebook and the live RAG application. Its purpose is to take a single `question`, send it to the RAG API, and return the system's response.

A new, unique `thread_id` is generated for every call, ensuring each question is treated as an independent, single-turn query without any conversational history. The function includes a crucial `time.sleep(10)` delay to respect the API rate limits of the underlying services (like the Cohere reranker). Finally, it's wrapped in robust error handling to prevent the entire evaluation from crashing if a single API call fails.

In [None]:
def query_rag_api(question: str) -> dict:
    """Sends a question to the RAG API and returns the response."""
    payload = {"question": question, "thread_id": str(uuid.uuid4())}
    try:
        # Add time sleep to comply with free API key rate limit from Coherence
        time.sleep(10)
        response = requests.post(RAG_API_URL, json=payload, timeout=60)
        response.raise_for_status()
        return response.json()
    except requests.RequestException as e:
        print(f"API call failed for question '{question}': {e}")
        return {"answer": None, "references": None}

### 4. Helper Function: LLM Judge

This is the final and most critical helper function, which executes the **"LLM-as-a-Judge"** logic.

It takes all the relevant information for a single data point—the question, the RAG system's answer, its retrieved references, and the ground truth answer—and passes them to the `evaluation_chain` we configured earlier. The function then returns a dictionary containing the judge's scores for **Faithfulness** and **Relevance**, along with a brief reason for its decision. Crucially, the entire call is wrapped in robust error handling to ensure that our evaluation loop can continue running even if a specific API call to the judge model fails.

In [None]:
def get_llm_judge_score(question: str, rag_answer: str, rag_references: list, ground_truth_answer: str) -> dict:
    """
    Uses a LangChain-powered LLM to judge the RAG system's output.
    """
    # Ensure the list of references is formatted as a clean string for the prompt
    references_str = json.dumps(rag_references, indent=2)

    try:
        # Invoke the chain with the required inputs
        result_obj = evaluation_chain.invoke({
            "question": question,
            "rag_answer": rag_answer,
            "rag_references": references_str,
            "ground_truth_answer": ground_truth_answer
        })
        # The result is a Pydantic object, so we convert it to a dictionary
        return result_obj.model_dump()
    except Exception as e:
        # Handle potential API errors or parsing failures
        print(f"An error occurred during LLM judging for question '{question}': {e}")
        return {
            "faithfulness_score": None,
            "relevance_score": None,
            "reason": f"Error: {e}"
        }

In [18]:
# Initialize an empty DataFrame to store all evaluation results
if os.path.exists(RESULTS_FILE):
    print(f"Loading existing results from {RESULTS_FILE}...")
    all_results_df = pd.read_csv(RESULTS_FILE)
else:
    print("Initializing a new DataFrame for evaluation results.")
    all_results_df = pd.DataFrame()

Loading existing results from evaluation/evaluation_results.csv...


### 5. Main Evaluation Function

This is the central function that runs a complete evaluation for a single PDF. It takes the filename of a PDF as input and performs a three-step process:

1.  **Load Data**: It first finds and loads the corresponding "golden" JSON dataset.
2.  **Query RAG**: It then iterates through every question in that dataset, calling the `query_rag_api` function to get the live response from your RAG application.
3.  **Judge Results**: Finally, it sends the question, the RAG's response, and the ground truth answer to the `get_llm_judge_score` function to get a performance score.

The function returns a complete Pandas DataFrame containing all this information, which can then be saved and analyzed.

In [7]:
def evaluate_dataset(dataset_dir: str, pdf_filename: str):
    """
    Loads a golden dataset using a string path, queries the RAG system, 
    and returns the judged results as a DataFrame.
    """
    print(f"--- Starting evaluation for {pdf_filename} ---")
    
    # Construct the path using simple string joining
    dataset_filename = pdf_filename.replace('.pdf', '.json')
    dataset_path = f"{dataset_dir}/{dataset_filename}"
    
    try:
        # Load dataset and add source document key
        with open(dataset_path, 'r') as f:
            records = json.load(f)
    except FileNotFoundError:
        print(f"Error: Could not find dataset file at path: {dataset_path}")
        return None
        
    batch_df = pd.DataFrame(records)
    batch_df['source_document'] = pdf_filename
    
    # Query RAG API for each question
    tqdm.pandas(desc=f"Querying for {pdf_filename}")
    rag_results = batch_df["question"].progress_apply(query_rag_api)
    batch_df["rag_answer"] = [r.get("answer") for r in rag_results]
    batch_df["rag_references"] = [r.get("references") for r in rag_results]

    # Use LLM as a Judge
    eval_scores = []
    for index, row in tqdm(batch_df.iterrows(), total=len(batch_df), desc=f"Judging for {pdf_filename}"):
        scores = get_llm_judge_score(
            row["question"],
            row["rag_answer"],
            row["rag_references"],
            row["ground_truth_answer"]
        )
        eval_scores.append(scores)
    
    scores_df = pd.DataFrame(eval_scores)
    final_batch_df = pd.concat([batch_df, scores_df], axis=1)
    
    print(f"--- Finished evaluation for {pdf_filename} ---\n")
    return final_batch_df

### 6. Execute Evaluation for a Specific PDF

This is the main action cell for our manual workflow. After we have restarted your RAG application and uploaded a specific PDF, we run this cell to kick off the evaluation for that document.

It calls our main `evaluate_dataset` function with the specific filename we want to evaluate. If the evaluation is successful, it appends the results to our master `all_results_df` DataFrame and immediately saves the progress to the CSV file. This ensures that even if a later step fails, the results from this run are safely stored.

In [16]:
pdf1_results = evaluate_dataset(dataset_dir=DATASET_DIR, pdf_filename="LB5001.pdf")
if pdf1_results is not None:
    all_results_df = pd.concat([all_results_df, pdf1_results], ignore_index=True)
    all_results_df.to_csv(RESULTS_FILE, index=False) # Save progress

--- Starting evaluation for LB5001.pdf ---


Querying for LB5001.pdf:   0%|          | 0/7 [00:00<?, ?it/s]

Judging for LB5001.pdf:   0%|          | 0/7 [00:00<?, ?it/s]

--- Finished evaluation for LB5001.pdf ---



In [12]:
pdf2_results = evaluate_dataset(dataset_dir=DATASET_DIR, pdf_filename="MN414_0224.pdf")
if pdf2_results is not None:
    all_results_df = pd.concat([all_results_df, pdf2_results], ignore_index=True)
    all_results_df.to_csv(RESULTS_FILE, index=False)

--- Starting evaluation for MN414_0224.pdf ---


Querying for MN414_0224.pdf:   0%|          | 0/7 [00:00<?, ?it/s]

Judging for MN414_0224.pdf:   0%|          | 0/7 [00:00<?, ?it/s]

--- Finished evaluation for MN414_0224.pdf ---



In [10]:
pdf3_results = evaluate_dataset(dataset_dir=DATASET_DIR, pdf_filename="WEG-CESTARI-manual-iom-guia-consulta-rapida-50111652-pt-en-es-web.pdf")
if pdf3_results is not None:
    all_results_df = pd.concat([all_results_df, pdf3_results], ignore_index=True)
    all_results_df.to_csv(RESULTS_FILE, index=False)

--- Starting evaluation for WEG-CESTARI-manual-iom-guia-consulta-rapida-50111652-pt-en-es-web.pdf ---


Querying for WEG-CESTARI-manual-iom-guia-consulta-rapida-50111652-pt-en-es-web.pdf:   0%|          | 0/7 [00:0…

Judging for WEG-CESTARI-manual-iom-guia-consulta-rapida-50111652-pt-en-es-web.pdf:   0%|          | 0/7 [00:00…

--- Finished evaluation for WEG-CESTARI-manual-iom-guia-consulta-rapida-50111652-pt-en-es-web.pdf ---



In [8]:
pdf4_results = evaluate_dataset(dataset_dir=DATASET_DIR, pdf_filename="WEG-motores-eletricos-guia-de-especificacao-50032749-brochure-portuguese-web.pdf")
if pdf4_results is not None:
    all_results_df = pd.concat([all_results_df, pdf4_results], ignore_index=True)
    all_results_df.to_csv(RESULTS_FILE, index=False)

--- Starting evaluation for WEG-motores-eletricos-guia-de-especificacao-50032749-brochure-portuguese-web.pdf ---


Querying for WEG-motores-eletricos-guia-de-especificacao-50032749-brochure-portuguese-web.pdf:   0%|          …

Judging for WEG-motores-eletricos-guia-de-especificacao-50032749-brochure-portuguese-web.pdf:   0%|          |…

--- Finished evaluation for WEG-motores-eletricos-guia-de-especificacao-50032749-brochure-portuguese-web.pdf ---



### 7. Final Analysis and Results

This is the final step of our evaluation process. After running the evaluation for all individual documents, this block calculates and displays the aggregate results. It provides a high-level summary of the RAG system's performance by calculating the overall average scores for **Faithfulness** and **Relevance** across all questions. It also computes a per-document breakdown, allowing us to see how the system performed on each specific document's context.

In [19]:
print("--- Overall Evaluation Results ---")
print(f"Total questions evaluated: {len(all_results_df)}")

# Overall average scores
avg_faithfulness = all_results_df["faithfulness_score"].mean()
avg_relevance = all_results_df["relevance_score"].mean()

print(f"\nOverall Average Faithfulness Score: {avg_faithfulness:.2f}")
print(f"Overall Average Answer Relevance Score: {avg_relevance:.2f}")

# Per-document average scores
print("\n--- Scores by Document ---")
per_doc_scores = all_results_df.groupby('source_document')[['faithfulness_score', 'relevance_score']].mean()
per_doc_scores

--- Overall Evaluation Results ---
Total questions evaluated: 28

Overall Average Faithfulness Score: 1.00
Overall Average Answer Relevance Score: 0.84

--- Scores by Document ---


Unnamed: 0_level_0,faithfulness_score,relevance_score
source_document,Unnamed: 1_level_1,Unnamed: 2_level_1
LB5001.pdf,1.0,0.928571
MN414_0224.pdf,1.0,0.857143
WEG-CESTARI-manual-iom-guia-consulta-rapida-50111652-pt-en-es-web.pdf,1.0,0.857143
WEG-motores-eletricos-guia-de-especificacao-50032749-brochure-portuguese-web.pdf,1.0,0.714286
