## 3. Test-First Framework for RAG Evaluation

In the previous sections, we explored how to load data, create vector embeddings, and use helper libraries like Langchain to build the foundational components of a Retrieval Augmented Generation (RAG) system. We successfully set up a retriever capable of fetching relevant documents from our InterSystems IRIS database.

However, building a RAG system is an iterative process. How do we know if our retriever is fetching the *most* relevant documents? How do we measure the quality of the answers generated by our system? This is where evaluation comes in.

In this notebook, we will introduce the concept of a **test-first framework** for RAG systems. This approach emphasizes creating evaluation mechanisms *before* or *alongside* development, allowing us to continuously measure and improve our system's performance. We will leverage **Ragas**, a powerful open-source library specifically designed for evaluating RAG pipelines.

### Objectives:
1. Understand the importance of evaluation in RAG systems.
2. Learn about the test-first methodology for iterative improvement.
3. Introduce Ragas and its capabilities for RAG evaluation.
4. Generate a synthetic test dataset using Ragas.
5. Implement and understand key Ragas evaluation metrics (e.g., faithfulness, answer relevancy, context precision, context recall).
6. Evaluate our current RAG setup using the generated dataset and Ragas.
7. Discuss how to interpret evaluation results to guide improvements.
8. ?? actually improve something based on first results..

### 1. Setting up the Environment and Dependencies

First, let's install Ragas and ensure our environment is ready. We'll also need to import libraries from the previous notebook to re-establish our connection to IRIS and our document retriever.

In [None]:
%pip install ragas==0.1.7 pandas langchain-openai python-dotenv

In [None]:
import os
import getpass
from dotenv import load_dotenv
import pandas as pd

# Load environment variables (ensure your .env file has OPENAI_API_KEY)
load_dotenv(override=True)

if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key: ")

Now, let's import the necessary Langchain and IRIS components, similar to Notebook 3, to access our data.

In [None]:
from langchain.docstore.document import Document
from langchain.document_loaders import JSONLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings # Using FastEmbed as in Notebook 3
from langchain_iris import IRISVector
from langchain_openai import ChatOpenAI # For Ragas generator and evaluation

# Database connection details (same as Notebook 3)
username = '_SYSTEM'
password = 'SYsysS'
hostname = 'localhost' # Assuming IRIS is running locally in the workshop environment
port = 1972
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

COLLECTION_NAME = "case_reports" # Same collection as Notebook 3

# Initialize embeddings
embeddings = FastEmbedEmbeddings()

# Initialize IRISVector store
db = IRISVector(
    embedding_function=embeddings,  # ← Change "embedding" to "embedding_function"
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING
)

# Get the retriever
retriever = db.as_retriever()

print(f"Retriever initialized: {retriever}")

We also need the documents that were loaded in Notebook 3 to generate our test set. If you haven't run Notebook 3 to populate the `case_reports` collection and create chunks, you might need to adapt the following or ensure that data exists.

In [None]:
# Load documents again to be used by Ragas TestsetGenerator
# This assumes the data loading and chunking process from Notebook 3 has been performed
# or that the chunks are accessible/recreatable.

loader = JSONLoader(
    file_path='./data/healthcare/augmented_notes_100.jsonl',
    jq_schema='.note',
    json_lines=True
)
documents_for_testgen = loader.load() # These are whole documents

# We need the actual text content for the generator
# Ragas testset generator works well with Document objects from Langchain
print(f"Loaded {len(documents_for_testgen)} documents for test set generation.")

### 2. Understanding RAG Evaluation & The Test-First Mindset

A RAG system has two main parts: **Retrieval** and **Generation**.
- **Retrieval**: How well does our system find relevant information from the knowledge base (our IRIS database)?
- **Generation**: Given the retrieved information, how well does our system synthesize a coherent, accurate, and relevant answer?

**Why evaluate?**
- **Identify Weaknesses**: Pinpoint whether issues lie in retrieval, generation, or both.
- **Measure Progress**: Quantify improvements as we tune parameters (e.g., chunk size, embedding models, prompts).
- **Prevent Hallucinations**: Ensure generated answers are grounded in the provided context.
- **Ensure Relevance**: Verify that answers directly address the user's query.

The **Test-First Mindset** encourages us to:
1. **Define Success**: What does a "good" RAG response look like for our use case?
2. **Establish Baselines**: Measure the performance of our initial RAG setup.
3. **Iterate and Measure**: As we make changes (e.g., try different embedding models, adjust chunking strategies, refine prompts), re-evaluate to see if performance improves.

This iterative loop of `Develop -> Test -> Analyze -> Refine` is key to building robust and reliable RAG applications.

### 3. Introducing Ragas

[Ragas](https://docs.ragas.io/) is a framework that helps you evaluate your RAG pipelines. It provides a set of metrics tailored for RAG systems and can even help generate synthetic test data.

**Key Ragas Metrics we'll explore:**

**Retrieval-focused:**
- `context_precision`: Measures the signal-to-noise ratio of the retrieved contexts. Answers: *Are all the retrieved contexts relevant?*
- `context_recall`: Measures the ability of the retriever to retrieve all necessary information needed to answer the question. Answers: *Did we retrieve all the relevant contexts?*

**Generation-focused (conditioned on retrieved context):**
- `faithfulness`: Measures the factual consistency of the generated answer against the given context. Answers: *Is the answer grounded in the provided context, or is it hallucinating?*
- `answer_relevancy`: Measures how relevant the generated answer is to the input question. Answers: *Does the answer directly address the question?*

Some metrics, like `faithfulness` and `answer_relevancy`, require an LLM (e.g., GPT-3.5/4) to perform the evaluation, as they assess semantic qualities.

### 4. Generating a Synthetic Test Dataset with Ragas

To evaluate our RAG system, we need a test dataset consisting of questions, ground truth answers, and the contexts that should ideally be retrieved. Ragas can help us generate such a dataset from our existing documents.

The `TestsetGenerator` from Ragas uses an LLM (generator_llm) and a critic LLM (critic_llm) along with an embedding model to create question-context-answer triplets from your documents.

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings as RagasOpenAIEmbeddings # Ragas works well with OpenAI embeddings for generation

# Initialize the LLMs and Embeddings for the TestsetGenerator
# Note: Testset generation can be resource-intensive and may take time.
# For the workshop, we might use a small subset of documents or a pre-generated set if time is a constraint.
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k") # or gpt-4 if available and budget allows
critic_llm = ChatOpenAI(model="gpt-4-turbo-preview") # Critic often benefits from a stronger model
# Ragas testset generator uses its own embeddings, often OpenAI for consistency in generation quality
ragas_embeddings = RagasOpenAIEmbeddings()

test_generator = TestsetGenerator.from_langchain(
    generator_llm=generator_llm,
    critic_llm=critic_llm,
    embeddings=ragas_embeddings
)

# Let's generate a small test set from a few documents to see how it works.
# distributions: how many questions of each type to generate
# For a real scenario, you'd use more documents and generate a larger test set.
num_documents_for_test_generation = 5 # Adjust as needed for speed
if len(documents_for_testgen) > num_documents_for_test_generation:
    sample_documents_for_testgen = documents_for_testgen[:num_documents_for_test_generation]
else:
    sample_documents_for_testgen = documents_for_testgen

print(f"Generating testset from {len(sample_documents_for_testgen)} documents...")

# This step can take a while and will make calls to the OpenAI API.
try:
    testset = test_generator.generate_with_langchain_docs(
        documents=sample_documents_for_testgen, 
        test_size=10, # Number of question/answer pairs to generate
        distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}
    )
    print("Testset generation complete.")
    test_df = testset.to_pandas()
    print(test_df.head())
except Exception as e:
    print(f"Error during testset generation: {e}")
    print("This might be due to API limits, network issues, or document content.")
    print("For the workshop, we might proceed with a placeholder or pre-generated dataset if this fails.")
    # Create a dummy dataframe for workshop continuation if generation fails
    test_df = pd.DataFrame({
        'question': ['What is the primary symptom of condition X?', 'How is Y treated?'],
        'contexts': [['Relevant context for X...'], ['Relevant context for Y...']],
        'ground_truth': ['The primary symptom is Z.', 'Y is treated with A and B.'],
        'evolution_type': ['simple', 'simple']
    })

The generated `test_df` DataFrame typically contains columns like:
- `question`: The synthetically generated question.
- `contexts`: The chunk(s) of text from your documents that are relevant to the question.
- `ground_truth`: The synthetically generated 'ideal' answer to the question, based on the contexts.
- `evolution_type`: The method used to generate the question (e.g., simple, reasoning).

### 5. Evaluating the RAG System with Ragas

Now that we have a test set (even if it's a small or dummy one for now), we can evaluate our RAG system. To do this with Ragas, we need to define how our RAG system retrieves contexts and generates answers.

Ragas `evaluate` function expects the test data in a specific format (typically a Hugging Face dataset or a pandas DataFrame with 'question', 'contexts', 'answer', 'ground_truth' columns). Our retriever will provide the 'contexts' and our generator (LLM) will provide the 'answer'.

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)
from datasets import Dataset
import asyncio # Ragas evaluation can be async

# Define our RAG chain components for Ragas
# 1. Retriever: We already have `retriever` from IRISVector
# 2. Generator: An LLM that takes a question and context to produce an answer.

llm_for_ragas_eval = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# We need to prepare our data for the Ragas evaluate function.
# It expects 'question', 'ground_truth', and then it will use our retriever and generator
# to get 'contexts' and 'answer'.

# Let's adapt our test_df. Ragas expects 'ground_truth' for some metrics.
# The 'contexts' in test_df are the ideal contexts used to generate ground_truth.
# For evaluation, Ragas will call *our* retriever to get the actual retrieved 'contexts'.

if 'test_df' not in locals() or test_df.empty:
    print("Test DataFrame is not available. Skipping evaluation.")
else:
    questions = test_df["question"].tolist()
    ground_truths = test_df["ground_truth"].tolist()

    # Simulate RAG pipeline to get answers and retrieved contexts
    answers = []
    retrieved_contexts_list = []

    for question in questions:
        # 1. Retrieve contexts using our IRISVector retriever
        retrieved_docs = retriever.get_relevant_documents(question)
        retrieved_contexts = [doc.page_content for doc in retrieved_docs]
        retrieved_contexts_list.append(retrieved_contexts)
        
        # 2. Generate answer using LLM with retrieved_contexts
        context_str = "\n".join(retrieved_contexts)
        prompt = f"Question: {question}\nContext:\n{context_str}\nAnswer:"
        response = llm_for_ragas_eval.invoke(prompt)
        answers.append(response.content)

    # Create a Hugging Face Dataset for Ragas
    eval_data = {
        'question': questions,
        'answer': answers,
        'contexts': retrieved_contexts_list,
        'ground_truth': ground_truths # from the generated testset
    }
    dataset = Dataset.from_dict(eval_data)

    print("Dataset prepared for Ragas evaluation:")
    print(dataset)

    # Define metrics for evaluation
    metrics = [
        faithfulness,  # Requires 'question', 'answer', 'contexts'
        answer_relevancy, # Requires 'question', 'answer', 'contexts'
        context_precision, # Requires 'question', 'ground_truth', 'contexts'
        context_recall,    # Requires 'question', 'ground_truth', 'contexts'
    ]

    print("\nStarting Ragas evaluation...")
    # Note: This will make LLM calls for metrics like faithfulness and answer_relevancy.
    # Ensure your OpenAI API key is set up and has sufficient quota.
    try:
        # Ragas evaluation can be run asynchronously
        # If running in a Jupyter notebook, you might need to handle the event loop
        # For simplicity here, we try to run it, but in some envs `asyncio.run` might be needed
        # or use `evaluate(..., is_async=False)` if available and suitable for the Ragas version
        results = evaluate(
            dataset=dataset,
            metrics=metrics,
            llm=llm_for_ragas_eval, # LLM for metrics that need it
            embeddings=ragas_embeddings # Embeddings for metrics that need it (e.g. answer_relevancy with embedding distance)
            # is_async=False # Check Ragas documentation for synchronous execution if needed
        )
        print("Ragas evaluation complete.")
        results_df = results.to_pandas()
        print(results_df.head())
    except Exception as e:
        print(f"Error during Ragas evaluation: {e}")
        print("This could be due to API calls, data format, or async issues in this environment.")
        results_df = pd.DataFrame() # Empty dataframe if evaluation fails


### 6. Visualizing and Interpreting Results

The `results_df` DataFrame contains the scores for each metric for every question-answer pair. We can calculate average scores to get an overall sense of performance.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

if not results_df.empty:
    print("Overall Average Scores:")
    # Calculate mean scores, ensuring we only average numeric columns (the metrics)
    metric_columns = [m.name for m in metrics] # Get metric names
    average_scores = results_df[metric_columns].mean().reset_index()
    average_scores.columns = ['Metric', 'Average Score']
    print(average_scores)

    # Plotting the average scores
    plt.figure(figsize=(10, 6))
    sns.barplot(x='Average Score', y='Metric', data=average_scores, palette='viridis')
    plt.title('Average RAGS Evaluation Scores')
    plt.xlabel('Average Score (0 to 1)')
    plt.ylabel('Metric')
    plt.xlim(0, 1) # Scores are typically between 0 and 1
    plt.show()
else:
    print("No evaluation results to display.")

**Interpreting the Scores:**

- **High `faithfulness` (e.g., >0.8):** Good. The answers are factually consistent with the retrieved context.
- **Low `faithfulness`:** Bad. The model might be hallucinating or making up information not present in the context. *Potential Fixes: Improve prompts, use a more capable LLM for generation, ensure retrieved context is sufficient.*

- **High `answer_relevancy` (e.g., >0.8):** Good. The answers are relevant to the questions.
- **Low `answer_relevancy`:** Bad. The answers might be off-topic or not address the user's intent. *Potential Fixes: Better prompts, ensure retrieved context is highly relevant to the question.*

- **High `context_precision` (e.g., >0.8):** Good. Most of the retrieved context is relevant to the question.
- **Low `context_precision`:** Bad. The retriever is fetching a lot of irrelevant information, which can confuse the generator. *Potential Fixes: Tune retriever (e.g., `k` value for number of docs), improve chunking, use better embedding model, refine search query formulation.*

- **High `context_recall` (e.g., >0.8):** Good. The retriever is finding all the necessary pieces of information from the knowledge base.
- **Low `context_recall`:** Bad. The retriever is missing important information, leading to incomplete answers. *Potential Fixes: Improve document coverage, better chunking strategy, ensure all relevant information is indexed, use different retrieval strategies (e.g., hybrid search).*

The ideal scores depend on the specific application. For critical applications, you'd aim for very high scores across the board.

### 7. Using Evaluation Results for Improvement (The Test-First Loop)

This is where the test-first framework shines. The evaluation results are not just a report card; they are a diagnostic tool.

**Example Iteration Cycle:**

1.  **Analyze Results:** Suppose our initial `context_recall` is low (e.g., 0.6). This means our retriever isn't finding all the relevant information.
2.  **Formulate Hypothesis:** Perhaps our document chunks are too large, causing specific details to be buried. Or maybe the embedding model isn't capturing the nuances of our domain well.
3.  **Implement Change:** 
    *   Try a smaller `chunk_size` in `RecursiveCharacterTextSplitter` (as explored in Notebook 3).
    *   Experiment with a different embedding model (e.g., switch from `FastEmbedEmbeddings` to `OpenAIEmbeddings` or a domain-specific Hugging Face model for the main RAG pipeline, not just Ragas eval).
4.  **Re-Evaluate:** Run the Ragas evaluation again with the *exact same test set*.
5.  **Compare:** Did `context_recall` improve? Did other metrics change (sometimes improving one metric can slightly degrade another)?

This iterative process of `Evaluate -> Hypothesize -> Change -> Re-evaluate` is central to systematically improving your RAG system. In the upcoming notebooks (like Notebook 4: Connecting Chat to Vectors and beyond), we will be making changes to our RAG pipeline. This evaluation framework will be invaluable for measuring the impact of those changes.

### 8. Conclusion and Next Steps

In this notebook, we've laid the groundwork for a test-first approach to RAG development by:
- Understanding the critical role of evaluation.
- Introducing Ragas as a powerful tool for RAG assessment.
- Generating a synthetic test dataset.
- Implementing key Ragas metrics to evaluate our retriever (from IRISVector) and a generator.

This framework provides us with a quantitative way to measure the performance of our RAG system. As we proceed to build more sophisticated chat applications and tune our retrieval and generation strategies in subsequent notebooks, we can continuously refer back to these evaluation techniques to guide our development and ensure we are making tangible improvements.

**Next:** In Notebook 4, "Connecting Chat to Vectors," we will focus on building a more interactive chat application. The evaluation methods learned here will be crucial for assessing how well that chat application performs in terms of retrieval accuracy and response quality.