## 5. Creating a Test-First Framework for RAG Evaluation

In this section, we'll implement a comprehensive evaluation framework using DeepEval, a modern evaluation library specifically designed for LLM applications and RAG systems.

### Why DeepEval for RAG Evaluation?

DeepEval provides several advantages:

1. **RAG-Specific Metrics:** Built-in metrics for answer relevancy, faithfulness, and contextual recall
2. **Synthetic Data Generation:** Automatically generate test cases from your knowledge base
3. **LLM-as-a-Judge:** Uses advanced LLMs to evaluate responses intelligently
4. **Easy Integration:** Simple API that works well with existing RAG pipelines

While this evaluation is not tightly integrated with the app you built in Sections 1-4, it evaluates a retrieval mechanism from the same data set and could be repurposed later in your own use cases for experimentation.

Run the module below to import the required libraries, create the InterSystems IRIS connection, and initialize the data retrieval mechanism.

In [None]:
# Import required libraries
import os
import pandas as pd
from dotenv import load_dotenv

# DeepEval imports
from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    ContextualRelevancyMetric,
    ContextualRecallMetric
)
from deepeval.test_case import LLMTestCase
from deepeval.synthesizer import Synthesizer

# Langchain imports for our RAG system
from langchain.docstore.document import Document
from langchain.document_loaders import JSONLoader
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain_iris import IRISVector
from langchain_openai import ChatOpenAI

# Database connection details
username = '_SYSTEM'
password = 'SYS'
hostname = 'IRIS'
port = 1972
namespace = 'IRISAPP'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"
COLLECTION_NAME = "case_reports"

# Initialize components
embeddings = FastEmbedEmbeddings()
db = IRISVector(
    embedding_function=embeddings,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING
)
retriever = db.as_retriever()

print(f"Retriever initialized: {retriever}")

Next, run the module below to load an OpenAI API key that has been prepared for this section, as well as set the OpenAI model that this section will use.

In [None]:
# Load environment variables and check API key
load_dotenv(override=True)

# For this segment we will set a new API Key
from utils import LLM_MODEL
os.environ['OPENAI_API_KEY'] = os.environ['OPENAI_API_KEY2']

if not os.getenv("OPENAI_API_KEY"):
    print("⚠️ Warning: OPENAI_API_KEY not found. Please set your OpenAI API key.")
else:
    print("✅ OpenAI API key found. DeepEval is ready to use.")

# Initialize the LLM
llm = ChatOpenAI(model=LLM_MODEL, temperature=0)

### Step 1: Replicate Your App's RAG Pipeline
Now let's replicate the exact RAG pipeline from the chat application you have build during this workshop. We'll use this RAG pipeline to generate Q&A pairs for our data set. First, we'll create a pre-baked question and utilize RAG to retrieve the answer from our case reports data.

Inspect the block of code below, then run it to create the first Q&A pair.

In [None]:
# Import our shared RAG module
from rag_module import WorkshopRAG

# Initialize the RAG system (same as used in chat app)
print("🔧 Initializing shared RAG system...")
rag_system = WorkshopRAG(
    collection_name="case_reports",
    llm_model=LLM_MODEL,
    temperature=0.0
)

def workshop_chat_rag_pipeline(question: str) -> tuple[str, list[str]]:
    """
    RAG pipeline that uses the exact same module as the chat application.
    This ensures we're evaluating the identical system users interact with.
    """
    return rag_system.query(question)

# Test the pipeline
test_question = "What are common symptoms of knee problems in adult patients?"
test_answer, test_contexts = workshop_chat_rag_pipeline(test_question)

print(f"Question: {test_question}")
print(f"Answer: {test_answer}")
print(f"Retrieved {len(test_contexts)} contexts")

### Step 2: Load Chunked Data for Synthetic Test Generation
Next, let's create three manual Q&A pairs and load them into a data structure. Read through the Q&A pairs below; for your own use cases, you might leverage domain experts to curate relevant Q&A pairs that can serve as "gold-standard" examples of the results your retrieval mechanism should yield.

Run the block below to load these three manual pairs.

In [None]:
# Create manual test cases for evaluation
manual_test_cases = [
    {
        "input": "What are common symptoms of knee problems in adult patients?",
        "expected_output": "Common symptoms include pain, swelling, limited range of motion, and difficulty with weight-bearing activities.",
    },
    {
        "input": "How are fractures typically treated in elderly patients?",
        "expected_output": "Treatment often involves surgical fixation, pain management, and careful consideration of the patient's overall health status.",
    },
    {
        "input": "What diagnostic methods are used for abdominal pain?",
        "expected_output": "Common diagnostic methods include physical examination, CT scans, ultrasound, and laboratory tests.",
    }
]

print(f"Created {len(manual_test_cases)} test cases for evaluation")

Now let's augment our three manual Q&A pairs with some generated Q&A pairs. To do this, we'll retrieve 20 chunks from our data set to use as the basis for creating three more test questions and answers.

Run the block below to load chunks from our data set.

In [None]:
# Load chunked documents from IRIS database for synthetic data generation
# Using the chunked data that's already in our database (SQLUser.case_reports_chunked)
import sqlalchemy as sa
from sqlalchemy import create_engine, text

# Connect to IRIS and get chunked documents
engine = create_engine(CONNECTION_STRING)

# Query the chunked data table
query = """
SELECT TOP 20 document, metadata 
FROM SQLUser.\"case_reports-chunked\" 
WHERE LENGTH(document) > 100
ORDER BY id
"""

try:
    with engine.connect() as conn:
        result = conn.execute(text(query))
        chunked_data = result.fetchall()

    # Convert to documents format for DeepEval
    chunked_documents = []
    for document, metadata in chunked_data:
        doc = Document(
            page_content=document,
            metadata={"source_metadata": metadata}
        )
        chunked_documents.append(doc)

    print(f"✅ Loaded {len(chunked_documents)} chunked documents from IRIS database")
    if chunked_documents:
        print(f"Sample chunk preview: {chunked_documents[0].page_content[:200]}...")
        print(f"Metadata: {str(chunked_documents[0].metadata['source_metadata'])[:100]}...")
        
except Exception as e:
    print(f"❌ Error loading chunked data: {e}")
    print("Will use manual test cases instead.")
    chunked_documents = []

### Step 3: Generate Synthetic Test Cases
Instead of manually writing test questions, we can use AI to automatically generate realistic test cases from our actual medical data. Here's how it works:

1. **Input**: Real medical case chunks from our IRIS database
2. **Process**: DeepEval's Synthesizer uses GPT-4 to create questions that could realistically be asked about this data
3. **Output**: Question-answer pairs with expected responses

**Why is this useful?**
- Creates test cases that match your actual data
- Saves time compared to manual test creation
- Generates diverse question types you might not think of
- Ensures evaluation covers real scenarios your users will encounter

Let's see this in action:

In [None]:
# Try to generate synthetic test cases using DeepEval's Synthesizer
synthetic_test_cases = []

if chunked_documents and os.getenv("OPENAI_API_KEY"):
    try:
        print("\n📊 STEP 2: Preparing our medical data for AI analysis...\n   → We have {len(chunked_documents)} chunks of medical case data")
        
        # Initialize the DeepEval synthesizer
        print(f"   → AI synthesizer ready (powered by {LLM_MODEL})")
        synthesizer = Synthesizer(model=LLM_MODEL)
        
        # Select chunks for test question generation
        print("\n🧠 STEP 3: AI is analyzing medical data and creating test questions...")
        print("   (This may take 10-30 seconds as GPT-4 reads and understands the medical content)")
        contexts_for_synthesis = [[doc.page_content] for doc in chunked_documents[:3]]
        # Use first 3 chunks for synthesis (to manage API costs)
        print(f"   → Selected {len(contexts_for_synthesis)} chunks for test generation (to manage costs)")
        
        # Generate synthetic test cases using correct format: List[List[str]]
        synthetic_test_cases = synthesizer.generate_goldens_from_contexts(
            contexts=contexts_for_synthesis,  # List of lists of strings
            max_goldens_per_context=1  # Generate 1 test case per context
        )
        
        print(f"\n✅ SUCCESS: Generated {len(synthetic_test_cases)} synthetic test cases!")
        print("   → Each test case contains: Question + Expected Answer + Source Context")
        
        # 📋 STEP 4: Show what the AI created
        if synthetic_test_cases:
            print("\n📋 STEP 4: Let's examine what the AI generated...")
            sample = synthetic_test_cases[0]
            print("\n" + "="*60)
            print("🔍 EXAMPLE SYNTHETIC TEST CASE:")
            print("="*60)
            print(f"❓ QUESTION: {sample.input}")
            print(f"\n💡 EXPECTED ANSWER: {sample.expected_output}")
            print(f"\n📄 SOURCE CONTEXT: {sample.context[:150]}...")
            print("="*60)
            print("\n💭 Notice how the AI:")
            print("   • Created a realistic medical question from the data")
            print("   • Generated an appropriate expected answer")
            print("   • Linked it to the specific source context")
            print("   • This mimics real user questions about medical cases!")
            
    except Exception as e:
        print(f"❌ Error generating synthetic test cases: {e}")
        print("This might be due to API rate limits. Using manual test cases instead.")
        synthetic_test_cases = []
else:
    print("⚠️ Skipping synthetic generation (no chunked data or API key). Using manual test cases.")

Run the block below to combine our manual Q&A pairs and the synthetically generated ones into a single data structure.

In [None]:
# 🔄 STEP 1: Create our master test suite
print("🔄 Combining AI-generated and manual test cases...")
all_test_cases = []

# Add synthetic test cases if available
if synthetic_test_cases:
    print(f"   → Adding {len(synthetic_test_cases)} AI-generated test cases")
    for case in synthetic_test_cases:
        all_test_cases.append({
            "input": case.input,
            "expected_output": case.expected_output,
            "source": "synthetic"  # Mark as AI-generated
        })
else:
    print("   → No synthetic test cases available (using manual only)")

# 📝 STEP 2: Add manually crafted test cases
print("\n📝 Adding manually crafted test cases...")
manual_test_cases = [
    {
        "input": "What are common symptoms of knee problems in young patients?",
        "expected_output": "Common symptoms include pain, swelling, limited range of motion, and difficulty with weight-bearing activities.",
        "source": "manual"
    },
    {
        "input": "How are fractures typically treated in elderly patients?",
        "expected_output": "Treatment often involves surgical fixation, pain management, and careful consideration of the patient's overall health status.",
        "source": "manual"
    },
    {
        "input": "What diagnostic methods are used for abdominal pain?",
        "expected_output": "Common diagnostic methods include physical examination, CT scans, ultrasound, and laboratory tests.",
        "source": "manual"
    }
]

all_test_cases.extend(manual_test_cases)

print(f"\n📋 Total test cases: {len(all_test_cases)}")
synthetic_count = len([tc for tc in all_test_cases if tc['source'] == 'synthetic'])
manual_count = len([tc for tc in all_test_cases if tc['source'] == 'manual'])
print(f"  - Synthetic: {synthetic_count}")
print(f"  - Manual: {manual_count}")

### Step 4: Run RAG Pipeline on Test Cases

Now we'll test our chat application's RAG pipeline with each test case. This step:

1. **Feeds each question** to our workshop chat RAG pipeline
2. **Collects the actual answers** generated by our system
3. **Captures the retrieved contexts** used for each answer
4. **Prepares data** for DeepEval's evaluation metrics

**What we're testing:** The exact same RAG system users interact with in the chat application!

In [None]:
# 🧪 STEP 1: Run our chat RAG pipeline on each test case
print("🧪 Testing our chat application with all test cases...")
print("   (This will take a few moments as we query the LLM for each test case)")
evaluation_results = []

for i, test_case in enumerate(all_test_cases):
    print(f"\n📝 Processing test case {i+1}/{len(all_test_cases)}...")
    print(f"   Question: {test_case['input'][:80]}...")
    
    question = test_case["input"]
    expected_answer = test_case["expected_output"]
    
    try:
        actual_answer, retrieved_contexts = workshop_chat_rag_pipeline(question)
        
        evaluation_results.append({
            "question": question,
            "expected_answer": expected_answer,
            "actual_answer": actual_answer,
            "retrieved_contexts": retrieved_contexts
        })
        
    except Exception as e:
        print(f"Error processing test case {i+1}: {e}")
        continue

print(f"\n✅ SUCCESS: Processed {len(evaluation_results)} test cases!")
print("   → Each test case now has: Question + Expected Answer + Actual Answer + Retrieved Context")
print("   → Ready for DeepEval metrics evaluation!")

### Step 5: Evaluate with DeepEval Metrics
We've now processed the six test questions with a simple RAG retrieval and generation to answer the test questions. Now, we can leverage DeepEval to test the retrieval and answers. In the block below, we'll initialize the DeepEval metrics we will measure.

These DeepEval metrics are designed to evaluate the quality of responses generated by a language model, particularly in retrieval-augmented generation (RAG) systems.

- **Answer Relevancy** measures how well the generated answer addresses the original question.
- **Faithfulness** assesses whether the answer accurately reflects the retrieved context, ensuring it doesn’t hallucinate or introduce unsupported information.
- **Contextual Relevancy** checks how relevant the retrieved context is to the question.
- **Contextual Recall** evaluates whether all key pieces of information needed to answer the question are present in the retrieved context.

Together, these metrics help ensure that the model's answers are accurate, grounded, and contextually appropriate. Run the block below to initialize these.

In [None]:
# Initialize DeepEval metrics
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7, model=LLM_MODEL)
faithfulness_metric = FaithfulnessMetric(threshold=0.7, model=LLM_MODEL)
contextual_relevancy_metric = ContextualRelevancyMetric(threshold=0.7, model=LLM_MODEL)
contextual_recall_metric = ContextualRecallMetric(threshold=0.7, model=LLM_MODEL)

# Create LLMTestCase objects for DeepEval
test_cases_for_evaluation = []

for result in evaluation_results:
    test_case = LLMTestCase(
        input=result["question"],
        actual_output=result["actual_answer"],
        expected_output=result["expected_answer"],
        retrieval_context=result["retrieved_contexts"]
    )
    test_cases_for_evaluation.append(test_case)

print(f"Created {len(test_cases_for_evaluation)} test cases for DeepEval evaluation")

Now it's time to run the evaluation. This process may take several minutes; after running the block of code, feel free to grab a cup of coffee while these six test cases are evaluated for our desired metrics using DeepEval.

In [None]:
# Run evaluation with DeepEval
print("Running DeepEval evaluation...")

try:
    evaluation_scores = {
        "answer_relevancy": [],
        "faithfulness": [],
        "contextual_relevancy": [],
        "contextual_recall": []
    }
    
    for i, test_case in enumerate(test_cases_for_evaluation):
        print(f"Evaluating test case {i+1}/{len(test_cases_for_evaluation)}...")
        
        # Evaluate each metric
        answer_relevancy_metric.measure(test_case)
        evaluation_scores["answer_relevancy"].append(answer_relevancy_metric.score)
        
        faithfulness_metric.measure(test_case)
        evaluation_scores["faithfulness"].append(faithfulness_metric.score)
        
        contextual_relevancy_metric.measure(test_case)
        evaluation_scores["contextual_relevancy"].append(contextual_relevancy_metric.score)
        
        contextual_recall_metric.measure(test_case)
        evaluation_scores["contextual_recall"].append(contextual_recall_metric.score)
    
    print("✅ Evaluation completed successfully!")
    
except Exception as e:
    print(f"❌ Error during evaluation: {e}")
    # Create dummy scores for demonstration
    evaluation_scores = {
        "answer_relevancy": [0.8, 0.7, 0.9],
        "faithfulness": [0.85, 0.75, 0.8],
        "contextual_relevancy": [0.7, 0.8, 0.85],
        "contextual_recall": [0.75, 0.7, 0.8]
    }
    print("Using dummy scores for demonstration.")

### Step 6: Analyze and Visualize Results
In addition to calculating the evaluation scores, it can be very helpful to visualize the results to quickly spot strengths and weaknesses in your RAG system's performance. Graphs like bar charts and radar plots make it easier to compare metrics side by side, highlight areas that may need improvement (such as low faithfulness or contextual recall), and communicate findings more effectively to others. Before diving into numerical details, visualizations offer an intuitive overview that supports more informed analysis and debugging.

Run the block below to visualize your evaluation results. This step also may take a few moments to complete.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Calculate average scores
avg_scores = {}
for metric, scores in evaluation_scores.items():
    avg_scores[metric] = np.mean(scores) if scores else 0

print("📊 RAG System Evaluation Results:")
print("=" * 40)
for metric, avg_score in avg_scores.items():
    print(f"{metric.replace('_', ' ').title()}: {avg_score:.3f}")

# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
fig.suptitle('RAG System Evaluation Results', fontsize=16, fontweight='bold')

# Bar chart
metrics = list(avg_scores.keys())
scores = list(avg_scores.values())
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']

bars = ax1.bar(metrics, scores, color=colors)
ax1.set_title('Average Evaluation Scores')
ax1.set_ylabel('Score')
ax1.set_ylim(0, 1)
ax1.tick_params(axis='x', rotation=45)

for bar, score in zip(bars, scores):
    height = bar.get_height()
    ax1.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{score:.3f}', ha='center', va='bottom')

# Radar chart
angles = np.linspace(0, 2 * np.pi, len(metrics), endpoint=False)
scores_radar = list(avg_scores.values())
scores_radar += scores_radar[:1]
angles = np.concatenate((angles, [angles[0]]))

ax2.plot(angles, scores_radar, 'o-', linewidth=2, color='#FF6B6B')
ax2.fill(angles, scores_radar, alpha=0.25, color='#FF6B6B')
ax2.set_xticks(angles[:-1])
ax2.set_xticklabels([m.replace('_', ' ').title() for m in metrics])
ax2.set_ylim(0, 1)
ax2.set_title('RAG Performance Radar')
ax2.grid(True)

plt.tight_layout()
plt.show()

### Step 7: Summary and Recommendations
To wrap up the evaluation, it’s useful to generate a concise performance summary that highlights your system’s strongest and weakest areas. This overview helps prioritize improvements by identifying which metrics are performing well and which may need more attention. By pairing each score with actionable recommendations, you can start to make targeted adjustments—whether that means refining your retrieval process, improving prompt construction, or tweaking how documents are chunked. The summary below provides both a quick snapshot and practical next steps for improving your RAG system’s overall effectiveness.

What do you notice about these results? Consider why a given metric might be low in this scenario. We'll touch on that in the conclusion, after running the code block below to generate a summary and recommendations.

In [None]:
# Performance summary
print("\n🔍 Performance Analysis:")
print("=" * 30)

best_metric = max(avg_scores, key=avg_scores.get)
worst_metric = min(avg_scores, key=avg_scores.get)
overall_avg = np.mean(list(avg_scores.values()))

print(f"🎯 Best Performing Metric: {best_metric.replace('_', ' ').title()}")
print(f"   Score: {avg_scores[best_metric]:.3f}")
print(f"\n🔧 Needs Improvement: {worst_metric.replace('_', ' ').title()}")
print(f"   Score: {avg_scores[worst_metric]:.3f}")
print(f"\n📊 Overall Average: {overall_avg:.3f}")

print("\n💡 Improvement Recommendations:")
print("• Scores > 0.8: Excellent performance")
print("• Scores 0.7-0.8: Good performance")
print("• Scores < 0.7: Needs improvement")

if avg_scores['answer_relevancy'] < 0.7:
    print("\n🔧 Answer Relevancy Tips:")
    print("  - Improve prompt engineering")
    print("  - Add question classification")

if avg_scores['faithfulness'] < 0.7:
    print("\n🔧 Faithfulness Tips:")
    print("  - Improve retrieval quality")
    print("  - Add explicit context adherence instructions")

if avg_scores['contextual_relevancy'] < 0.7:
    print("\n🔧 Contextual Relevancy Tips:")
    print("  - Optimize embedding model")
    print("  - Tune retrieval parameters")

if avg_scores['contextual_recall'] < 0.7:
    print("\n🔧 Contextual Recall Tips:")
    print("  - Increase number of retrieved documents")
    print("  - Improve document chunking strategy")

### Conclusion

You may have noticed that our **Contextual Relevancy** score was quite unimpressive in this scenario, and the overall results need improvement. This is a reflection of the lightweight demo environment we are using -- with such a limited data set in this sample exercise (we have only 100 case reports stored), and a lightweight OpenAI model being used. This is unlikely to be enough to answer generalized questions about patient conditions. Using evaluation tools like this can help to highlight pieces of your application that need attention.

This test-first framework using DeepEval provides:

1. **Objective Measurement**: Quantitative metrics for RAG system performance
2. **Systematic Improvement**: Data-driven insights for optimization
3. **Regression Detection**: Ability to catch performance degradation
4. **Comparative Analysis**: Framework for comparing different approaches

Use this evaluation framework throughout your RAG development process to ensure consistent quality and continuous improvement.