# Demo #11: RAG System Evaluation and Metrics

## Learning Objectives

In this demo, you will learn:
1. **Why Evaluation Matters**: Understanding that systematic, quantitative evaluation is non-negotiable for production-quality RAG systems
2. **Key RAG Metrics**: Implementing the four core metrics for evaluating both retrieval and generation components
3. **LLM-as-Judge Pattern**: Using Azure OpenAI to automatically score RAG system outputs
4. **Regression Testing**: Creating benchmark datasets to track performance over time
5. **Comparative Analysis**: Evaluating multiple RAG configurations to identify the best architecture

## Theoretical Foundation

### The Need for Rigorous Evaluation

As stated in the curriculum:
> "Systematic, quantitative evaluation is non-negotiable for building production-quality RAG systems and moving beyond anecdotal 'it works on my questions' testing. A robust evaluation framework involves establishing a benchmark dataset, defining clear metrics, and using automated tools to track performance over time."

### Core RAG Evaluation Metrics

RAG evaluation focuses on assessing both the **retrieval** and **generation** components separately:

| Metric | Component | Description |
|--------|-----------|-------------|
| **Context Relevance (Precision)** | Retrieval | Measures the signal-to-noise ratio of retrieved context. Are the retrieved chunks relevant to the query? |
| **Context Sufficiency (Recall)** | Retrieval | Measures whether the retrieved context contains all information needed to answer the query |
| **Answer Relevance** | Generation | Measures whether the final answer is on-topic and directly addresses the user's query |
| **Faithfulness (Hallucination Detection)** | Generation | Measures whether the answer is factually grounded in the provided context |
| **Answer Correctness** | Generation | Measures factual accuracy against a ground truth answer |

### The "MLOps-ification" of RAG

The maturation of RAG development has led to its **"MLOps-ification"**, where building a RAG system now demands the same discipline as any other machine learning system:
- **Automated testing** with versioned test datasets
- **Continuous monitoring** for drift detection
- **Metric-driven development** with regression tracking
- **A/B testing** of different architectural components

## Implementation Approach

We will implement:
1. A **gold standard dataset** with hand-crafted queries and ground truth answers
2. **LLM-based evaluators** for each metric using Azure OpenAI
3. An **automated evaluation pipeline** that scores any RAG system
4. A **comparative analysis** between baseline and advanced RAG configurations

---

## Setup: Dependencies and Azure OpenAI Configuration

In [1]:
# Core dependencies
import os
import json
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple
from dataclasses import dataclass
from sklearn.metrics.pairwise import cosine_similarity

# LlamaIndex core
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.node_parser import SentenceSplitter

# Azure OpenAI
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding

print("✓ Dependencies imported successfully")

✓ Dependencies imported successfully


In [2]:
# Azure OpenAI Configuration with fallback
# First, try Azure OpenAI, if not available, fall back to OpenAI or HuggingFace

azure_api_key = os.getenv("AZURE_OPENAI_API_KEY")
azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
openai_api_key = os.getenv("OPENAI_API_KEY")

if azure_api_key and azure_endpoint:
    print("Using Azure OpenAI...")
    llm = AzureOpenAI(
        model="gpt-4o",
        deployment_name="gpt-4o",  # Your deployment name
        api_key=azure_api_key,
        azure_endpoint=azure_endpoint,
        api_version="2024-12-01-preview",
        temperature=0.0,
        max_tokens=500
    )
    
    embed_model = AzureOpenAIEmbedding(
        model="text-embedding-ada-002",
        deployment_name="text-embedding-ada-002",
        api_key=azure_api_key,
        azure_endpoint=azure_endpoint,
        api_version="2024-12-01-preview",
    )
elif openai_api_key:
    print("Using OpenAI (standard)...")
    from llama_index.llms.openai import OpenAI
    from llama_index.embeddings.openai import OpenAIEmbedding
    
    llm = OpenAI(
        model="gpt-4o",
        api_key=openai_api_key,
        temperature=0.0,
        max_tokens=500
    )
    
    embed_model = OpenAIEmbedding(
        model="text-embedding-ada-002",
        api_key=openai_api_key,
    )
else:
    print("Using HuggingFace embeddings (free, no API key required)...")
    from llama_index.llms.huggingface import HuggingFaceInferenceAPI
    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
    
    # Use a free HuggingFace model
    llm = HuggingFaceInferenceAPI(
        model_name="mistralai/Mistral-7B-Instruct-v0.2",
        token=os.getenv("HF_TOKEN"),  # Optional, but recommended for better rate limits
        temperature=0.0,
        max_tokens=500
    )
    
    embed_model = HuggingFaceEmbedding(
        model_name="BAAI/bge-small-en-v1.5"
    )

# Configure global settings
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512
Settings.chunk_overlap = 50

print("✓ LLM and embedding model configured successfully")

Using Azure OpenAI...
✓ LLM and embedding model configured successfully
✓ LLM and embedding model configured successfully


---

## Step 1: Create Gold Standard Evaluation Dataset

A **gold standard dataset** is a curated collection of test queries with:
- **Ground truth answers** (ideal, correct responses)
- **Reference context** (the specific documents that should be retrieved)
- **Query metadata** (difficulty level, query type, etc.)

This dataset serves as a regression test suite to ensure changes to the RAG system don't degrade performance.

In [3]:
@dataclass
class EvaluationExample:
    """A single test case in our gold standard dataset"""
    query: str
    ground_truth_answer: str
    reference_doc_names: List[str]  # Which documents should be retrieved
    query_type: str  # e.g., "factual", "conceptual", "comparison"
    difficulty: str  # "easy", "medium", "hard"

# Gold Standard Test Dataset
# These queries span different difficulty levels and query types
EVALUATION_DATASET = [
    EvaluationExample(
        query="What is the transformer architecture?",
        ground_truth_answer="The transformer architecture is a neural network architecture introduced in the 'Attention is All You Need' paper. It relies entirely on self-attention mechanisms to process sequences in parallel, replacing recurrent layers. Key components include multi-head attention, positional encoding, and feed-forward networks arranged in encoder-decoder stacks.",
        reference_doc_names=["transformer_architecture.md"],
        query_type="conceptual",
        difficulty="easy"
    ),
    EvaluationExample(
        query="How does BERT differ from GPT-4?",
        ground_truth_answer="BERT and GPT-4 differ in their training objectives and architecture. BERT uses bidirectional encoding with masked language modeling and next sentence prediction, making it ideal for understanding tasks. GPT-4 is an autoregressive decoder-only model trained for next-token prediction, optimized for generation tasks. BERT processes context bidirectionally while GPT-4 processes left-to-right.",
        reference_doc_names=["bert_model.md", "gpt4_model.md"],
        query_type="comparison",
        difficulty="medium"
    ),
    EvaluationExample(
        query="Explain how embeddings work in machine learning",
        ground_truth_answer="Embeddings in machine learning are dense vector representations that map discrete objects (like words, sentences, or entities) into continuous vector spaces. They capture semantic relationships where similar items have vectors close together in the embedding space. Embeddings are learned through neural networks and enable models to perform mathematical operations on semantic concepts.",
        reference_doc_names=["embeddings_ml.md"],
        query_type="conceptual",
        difficulty="easy"
    ),
    EvaluationExample(
        query="What are the key components of a REST API and how do they relate to Docker containers?",
        ground_truth_answer="REST APIs consist of resources identified by URLs, HTTP methods (GET, POST, PUT, DELETE), stateless communication, and standardized response formats like JSON. Docker containers provide an ideal deployment environment for REST APIs by packaging the API application with all its dependencies into isolated, portable containers. This ensures consistent behavior across development and production environments.",
        reference_doc_names=["rest_api.md", "docker_containers.md"],
        query_type="multi-hop",
        difficulty="hard"
    ),
    EvaluationExample(
        query="What is Docker?",
        ground_truth_answer="Docker is a platform for developing, shipping, and running applications in containers. Containers are lightweight, standalone packages that include everything needed to run software: code, runtime, system tools, libraries, and settings. Docker ensures applications run consistently across different computing environments.",
        reference_doc_names=["docker_containers.md"],
        query_type="factual",
        difficulty="easy"
    ),
]

print(f"✓ Created gold standard dataset with {len(EVALUATION_DATASET)} test cases")
print("\nDataset Breakdown:")
print(f"  - Query types: {set(ex.query_type for ex in EVALUATION_DATASET)}")
print(f"  - Difficulty levels: {set(ex.difficulty for ex in EVALUATION_DATASET)}")

✓ Created gold standard dataset with 5 test cases

Dataset Breakdown:
  - Query types: {'factual', 'multi-hop', 'conceptual', 'comparison'}
  - Difficulty levels: {'easy', 'hard', 'medium'}


---

## Step 2: Load Documents and Build RAG Systems

We'll create two RAG systems to compare:
1. **Baseline RAG**: Simple vector search with default settings
2. **Advanced RAG**: Optimized with better chunking and higher retrieval count

In [4]:
# Load documents
documents = SimpleDirectoryReader("data/tech_docs").load_data()
print(f"✓ Loaded {len(documents)} documents")

# Display document metadata
for doc in documents:
    filename = doc.metadata.get('file_name', 'unknown')
    print(f"  - {filename}: {len(doc.text)} characters")

✓ Loaded 6 documents
  - bert_model.md: 2358 characters
  - docker_containers.md: 4864 characters
  - embeddings_ml.md: 5520 characters
  - gpt4_model.md: 2814 characters
  - rest_api.md: 3797 characters
  - transformer_architecture.md: 4247 characters


In [5]:
# System 1: Baseline RAG
print("Building Baseline RAG System...")
baseline_splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=20)
baseline_nodes = baseline_splitter.get_nodes_from_documents(documents)
baseline_index = VectorStoreIndex(baseline_nodes)
baseline_query_engine = baseline_index.as_query_engine(
    similarity_top_k=3,
    response_mode="compact"
)
print(f"✓ Baseline RAG built with {len(baseline_nodes)} chunks (chunk_size=1024, top_k=3)")

Building Baseline RAG System...
✓ Baseline RAG built with 8 chunks (chunk_size=1024, top_k=3)
✓ Baseline RAG built with 8 chunks (chunk_size=1024, top_k=3)


In [6]:
# System 2: Advanced RAG (Optimized)
print("Building Advanced RAG System...")
advanced_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=50)
advanced_nodes = advanced_splitter.get_nodes_from_documents(documents)
advanced_index = VectorStoreIndex(advanced_nodes)
advanced_query_engine = advanced_index.as_query_engine(
    similarity_top_k=5,
    response_mode="tree_summarize"  # Better synthesis
)
print(f"✓ Advanced RAG built with {len(advanced_nodes)} chunks (chunk_size=512, top_k=5)")

Building Advanced RAG System...
✓ Advanced RAG built with 15 chunks (chunk_size=512, top_k=5)
✓ Advanced RAG built with 15 chunks (chunk_size=512, top_k=5)


---

## Step 3: Implement Evaluation Metrics

We'll implement five core metrics using the **LLM-as-Judge** pattern, where Azure OpenAI evaluates the quality of RAG outputs.

### Metric 1: Context Relevance (Retrieval Precision)

In [7]:
def evaluate_context_relevance(query: str, retrieved_contexts: List[str]) -> float:
    """
    Measures the signal-to-noise ratio of retrieved context.
    Returns a score from 0.0 to 1.0 representing the average relevance of retrieved chunks.
    
    Args:
        query: The user's query
        retrieved_contexts: List of retrieved text chunks
    
    Returns:
        Average relevance score (0.0 = all irrelevant, 1.0 = all highly relevant)
    """
    if not retrieved_contexts:
        return 0.0
    
    relevance_scores = []
    
    for context in retrieved_contexts:
        prompt = f"""You are an expert evaluator. Rate the relevance of the following context to the query on a scale of 1-5.

Query: {query}

Context: {context}

Relevance Scale:
1 = Completely irrelevant, contains no information related to the query
2 = Minimally relevant, tangentially related but not useful
3 = Moderately relevant, contains some useful information
4 = Very relevant, contains substantial information to answer the query
5 = Perfectly relevant, directly and comprehensively addresses the query

Respond with ONLY a single number (1-5), no explanation."""
        
        response = llm.complete(prompt)
        try:
            score = int(response.text.strip())
            # Normalize to 0-1 scale
            normalized_score = (score - 1) / 4.0
            relevance_scores.append(normalized_score)
        except ValueError:
            print(f"Warning: Could not parse relevance score: {response.text}")
            relevance_scores.append(0.5)  # Default to neutral
    
    return np.mean(relevance_scores)

print("✓ Context Relevance metric implemented")

✓ Context Relevance metric implemented


### Metric 2: Context Sufficiency (Retrieval Recall)

In [8]:
def evaluate_context_sufficiency(query: str, retrieved_contexts: List[str], ground_truth: str) -> float:
    """
    Measures whether retrieved context contains all information needed to answer the query.
    
    Args:
        query: The user's query
        retrieved_contexts: List of retrieved text chunks
        ground_truth: The ideal answer (to determine what info is needed)
    
    Returns:
        Sufficiency score (0.0 = insufficient, 1.0 = fully sufficient)
    """
    # Truncate contexts if too long
    MAX_CONTEXT_LENGTH = 3000
    combined_context = "\n\n".join(retrieved_contexts)
    if len(combined_context) > MAX_CONTEXT_LENGTH:
        combined_context = combined_context[:MAX_CONTEXT_LENGTH] + "\n...[truncated]"
    
    # Truncate ground truth if too long
    MAX_GT_LENGTH = 500
    truncated_gt = ground_truth[:MAX_GT_LENGTH] if len(ground_truth) > MAX_GT_LENGTH else ground_truth
    
    prompt = f"""You are an expert evaluator. Determine if the provided context contains sufficient information to answer the query.

Query: {query}

Ground Truth Answer: {truncated_gt}

Retrieved Context:
{combined_context}

Question: Does the retrieved context contain all the necessary information to produce an answer similar to the ground truth?

Rate on a scale of 1-5:
1 = Completely insufficient, missing all key information
2 = Mostly insufficient, missing most key information
3 = Partially sufficient, contains some but not all key information
4 = Mostly sufficient, contains most key information with minor gaps
5 = Fully sufficient, contains all necessary information

Respond with ONLY a single number (1-5), no explanation."""
    
    try:
        response = llm.complete(prompt)
        score = int(response.text.strip())
        return (score - 1) / 4.0  # Normalize to 0-1
    except ValueError as e:
        print(f"Warning: Could not parse sufficiency score: {response.text}")
        return 0.5
    except Exception as e:
        print(f"Error in evaluate_context_sufficiency: {str(e)}")
        return 0.5

print("✓ Context Sufficiency metric implemented")

✓ Context Sufficiency metric implemented


### Metric 3: Answer Faithfulness (Hallucination Detection)

In [9]:
def evaluate_faithfulness(answer: str, retrieved_contexts: List[str]) -> float:
    """
    Measures whether the answer is factually grounded in the provided context.
    Detects hallucinations - information in the answer not present in the context.
    
    Args:
        answer: The generated answer
        retrieved_contexts: The context used to generate the answer
    
    Returns:
        Faithfulness score (0.0 = contains hallucinations, 1.0 = fully grounded)
    """
    # Truncate contexts if too long to avoid token limits
    MAX_CONTEXT_LENGTH = 3000
    combined_context = "\n\n".join(retrieved_contexts)
    if len(combined_context) > MAX_CONTEXT_LENGTH:
        combined_context = combined_context[:MAX_CONTEXT_LENGTH] + "\n...[truncated]"
    
    # Truncate answer if too long
    MAX_ANSWER_LENGTH = 1000
    truncated_answer = answer[:MAX_ANSWER_LENGTH] if len(answer) > MAX_ANSWER_LENGTH else answer
    
    prompt = f"""You are an expert fact-checker. Determine if the answer contains information NOT present in the context (hallucinations).

Context:
{combined_context}

Generated Answer:
{truncated_answer}

Task: Identify if the answer contains factual claims that are NOT supported by the context.

Rate faithfulness on a scale of 1-5:
1 = Severe hallucinations, most claims are unsupported
2 = Significant hallucinations, many claims are unsupported
3 = Moderate hallucinations, some claims are unsupported
4 = Minor hallucinations, answer is mostly grounded with small unsupported details
5 = Fully faithful, all claims are directly supported by the context

Respond with ONLY a single number (1-5), no explanation."""
    
    try:
        response = llm.complete(prompt)
        score = int(response.text.strip())
        return (score - 1) / 4.0  # Normalize to 0-1
    except ValueError as e:
        print(f"Warning: Could not parse faithfulness score: {response.text}")
        return 0.5
    except Exception as e:
        print(f"Error in evaluate_faithfulness: {str(e)}")
        return 0.5

print("✓ Answer Faithfulness metric implemented")

✓ Answer Faithfulness metric implemented


### Metric 4: Answer Relevance

In [10]:
def evaluate_answer_relevance(query: str, answer: str) -> float:
    """
    Measures whether the answer is on-topic and directly addresses the query.
    
    Args:
        query: The user's query
        answer: The generated answer
    
    Returns:
        Relevance score (0.0 = off-topic, 1.0 = directly addresses query)
    """
    prompt = f"""You are an expert evaluator. Determine if the answer directly addresses the query.

Query: {query}

Answer:
{answer}

Question: Does the answer directly and relevantly address what the query is asking for?

Rate on a scale of 1-5:
1 = Completely off-topic, does not address the query at all
2 = Minimally relevant, touches on the topic but doesn't answer the query
3 = Moderately relevant, partially addresses the query
4 = Highly relevant, addresses most aspects of the query
5 = Perfectly relevant, directly and comprehensively addresses the query

Respond with ONLY a single number (1-5), no explanation."""
    
    response = llm.complete(prompt)
    try:
        score = int(response.text.strip())
        return (score - 1) / 4.0  # Normalize to 0-1
    except ValueError:
        print(f"Warning: Could not parse answer relevance score: {response.text}")
        return 0.5

print("✓ Answer Relevance metric implemented")

✓ Answer Relevance metric implemented


### Metric 5: Answer Correctness (Semantic Similarity + LLM Scoring)

In [11]:
def evaluate_answer_correctness(generated_answer: str, ground_truth: str) -> float:
    """
    Measures factual accuracy against ground truth using hybrid approach:
    1. Semantic similarity (embedding distance)
    2. LLM-based factual correctness scoring
    
    Args:
        generated_answer: The RAG system's answer
        ground_truth: The gold standard answer
    
    Returns:
        Correctness score (0.0 = incorrect, 1.0 = correct)
    """
    try:
        # Component 1: Semantic similarity via embeddings
        gen_embedding = embed_model.get_text_embedding(generated_answer)
        gt_embedding = embed_model.get_text_embedding(ground_truth)
        
        semantic_sim = cosine_similarity(
            [gen_embedding], 
            [gt_embedding]
        )[0][0]
    except Exception as e:
        print(f"Warning: Could not compute semantic similarity: {str(e)}")
        semantic_sim = 0.5
    
    # Component 2: LLM-based factual correctness
    # Truncate if too long
    MAX_LENGTH = 800
    truncated_gen = generated_answer[:MAX_LENGTH] if len(generated_answer) > MAX_LENGTH else generated_answer
    truncated_gt = ground_truth[:MAX_LENGTH] if len(ground_truth) > MAX_LENGTH else ground_truth
    
    prompt = f"""You are an expert evaluator. Compare the generated answer against the ground truth answer.

Ground Truth Answer:
{truncated_gt}

Generated Answer:
{truncated_gen}

Task: Rate the factual correctness of the generated answer compared to the ground truth.

Rate on a scale of 1-5:
1 = Completely incorrect, contains major factual errors
2 = Mostly incorrect, contains significant errors
3 = Partially correct, some facts are right but key information is wrong or missing
4 = Mostly correct, captures main facts with minor inaccuracies or omissions
5 = Fully correct, factually accurate and complete

Respond with ONLY a single number (1-5), no explanation."""
    
    try:
        response = llm.complete(prompt)
        llm_score = int(response.text.strip())
        llm_score_normalized = (llm_score - 1) / 4.0
    except ValueError as e:
        print(f"Warning: Could not parse correctness score: {response.text}")
        llm_score_normalized = 0.5
    except Exception as e:
        print(f"Error in evaluate_answer_correctness: {str(e)}")
        llm_score_normalized = 0.5
    
    # Combine both scores (weighted average)
    final_score = 0.4 * semantic_sim + 0.6 * llm_score_normalized
    
    return final_score

print("✓ Answer Correctness metric implemented")

✓ Answer Correctness metric implemented


---

## Step 4: Build Automated Evaluation Pipeline

Now we'll create a pipeline that:
1. Runs queries through a RAG system
2. Extracts retrieved context and generated answers
3. Applies all evaluation metrics
4. Aggregates results into a comprehensive report

In [12]:
@dataclass
class EvaluationResult:
    """Results for a single test case"""
    query: str
    query_type: str
    difficulty: str
    generated_answer: str
    ground_truth: str
    retrieved_contexts: List[str]
    context_relevance: float
    context_sufficiency: float
    answer_faithfulness: float
    answer_relevance: float
    answer_correctness: float
    
    @property
    def overall_score(self) -> float:
        """Compute weighted average of all metrics"""
        return (
            0.15 * self.context_relevance +
            0.15 * self.context_sufficiency +
            0.25 * self.answer_faithfulness +
            0.20 * self.answer_relevance +
            0.25 * self.answer_correctness
        )

print("✓ EvaluationResult dataclass defined")

✓ EvaluationResult dataclass defined


In [13]:
def evaluate_rag_system(query_engine, test_dataset: List[EvaluationExample], system_name: str) -> List[EvaluationResult]:
    """
    Automated evaluation pipeline for a RAG system.
    
    Args:
        query_engine: The RAG query engine to evaluate
        test_dataset: List of gold standard test examples
        system_name: Name of the system being evaluated (for logging)
    
    Returns:
        List of EvaluationResult objects, one per test case
    """
    results = []
    
    print(f"\n{'='*80}")
    print(f"Evaluating: {system_name}")
    print(f"{'='*80}\n")
    
    for i, example in enumerate(test_dataset, 1):
        print(f"[{i}/{len(test_dataset)}] Processing: {example.query[:60]}...")
        
        # Step 1: Query the RAG system
        response = query_engine.query(example.query)
        generated_answer = str(response)
        
        # Step 2: Extract retrieved context
        retrieved_contexts = [node.node.text for node in response.source_nodes]
        
        # Step 3: Compute all metrics
        print("  Computing metrics...")
        
        context_rel = evaluate_context_relevance(example.query, retrieved_contexts)
        context_suf = evaluate_context_sufficiency(example.query, retrieved_contexts, example.ground_truth_answer)
        faithfulness = evaluate_faithfulness(generated_answer, retrieved_contexts)
        ans_relevance = evaluate_answer_relevance(example.query, generated_answer)
        correctness = evaluate_answer_correctness(generated_answer, example.ground_truth_answer)
        
        # Step 4: Store results
        result = EvaluationResult(
            query=example.query,
            query_type=example.query_type,
            difficulty=example.difficulty,
            generated_answer=generated_answer,
            ground_truth=example.ground_truth_answer,
            retrieved_contexts=retrieved_contexts,
            context_relevance=context_rel,
            context_sufficiency=context_suf,
            answer_faithfulness=faithfulness,
            answer_relevance=ans_relevance,
            answer_correctness=correctness
        )
        
        results.append(result)
        print(f"  ✓ Overall Score: {result.overall_score:.3f}\n")
    
    return results

print("✓ Automated evaluation pipeline implemented")

✓ Automated evaluation pipeline implemented


---

## Step 5: Run Evaluation on Both Systems

Let's evaluate both our baseline and advanced RAG systems.

In [14]:
# Evaluate Baseline RAG
baseline_results = evaluate_rag_system(
    baseline_query_engine,
    EVALUATION_DATASET,
    "Baseline RAG (chunk_size=1024, top_k=3)"
)


Evaluating: Baseline RAG (chunk_size=1024, top_k=3)

[1/5] Processing: What is the transformer architecture?...


  Computing metrics...
Error in evaluate_faithfulness: Error code: 400 - {'error': {'message': "The response was filtered due to the prompt triggering Azure OpenAI's content management policy. Please modify your prompt and retry. To learn more about our content filtering policies please read our documentation: https://go.microsoft.com/fwlink/?linkid=2198766", 'type': None, 'param': 'prompt', 'code': 'content_filter', 'status': 400, 'innererror': {'code': 'ResponsibleAIPolicyViolation', 'content_filter_result': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': True, 'detected': True}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}}}
Error in evaluate_faithfulness: Error code: 400 - {'error': {'message': "The response was filtered due to the prompt triggering Azure OpenAI's content management policy. Please modify your prompt and retry. To learn more a

In [15]:
# Evaluate Advanced RAG
advanced_results = evaluate_rag_system(
    advanced_query_engine,
    EVALUATION_DATASET,
    "Advanced RAG (chunk_size=512, top_k=5, tree_summarize)"
)


Evaluating: Advanced RAG (chunk_size=512, top_k=5, tree_summarize)

[1/5] Processing: What is the transformer architecture?...
  Computing metrics...
  Computing metrics...
Error in evaluate_faithfulness: Error code: 400 - {'error': {'message': "The response was filtered due to the prompt triggering Azure OpenAI's content management policy. Please modify your prompt and retry. To learn more about our content filtering policies please read our documentation: https://go.microsoft.com/fwlink/?linkid=2198766", 'type': None, 'param': 'prompt', 'code': 'content_filter', 'status': 400, 'innererror': {'code': 'ResponsibleAIPolicyViolation', 'content_filter_result': {'hate': {'filtered': False, 'severity': 'safe'}, 'jailbreak': {'filtered': True, 'detected': True}, 'self_harm': {'filtered': False, 'severity': 'safe'}, 'sexual': {'filtered': False, 'severity': 'safe'}, 'violence': {'filtered': False, 'severity': 'safe'}}}}}
Error in evaluate_faithfulness: Error code: 400 - {'error': {'message':

---

## Step 6: Comparative Analysis and Visualization

Now let's analyze and compare the results.

In [16]:
def create_summary_table(results: List[EvaluationResult], system_name: str) -> pd.DataFrame:
    """Create a summary DataFrame from evaluation results"""
    data = []
    for r in results:
        data.append({
            'System': system_name,
            'Query': r.query[:50] + '...',
            'Type': r.query_type,
            'Difficulty': r.difficulty,
            'Context Relevance': f"{r.context_relevance:.3f}",
            'Context Sufficiency': f"{r.context_sufficiency:.3f}",
            'Faithfulness': f"{r.answer_faithfulness:.3f}",
            'Answer Relevance': f"{r.answer_relevance:.3f}",
            'Correctness': f"{r.answer_correctness:.3f}",
            'Overall Score': f"{r.overall_score:.3f}"
        })
    return pd.DataFrame(data)

# Create summary tables
baseline_df = create_summary_table(baseline_results, "Baseline")
advanced_df = create_summary_table(advanced_results, "Advanced")

# Combine for comparison
combined_df = pd.concat([baseline_df, advanced_df], ignore_index=True)

print("\n" + "="*100)
print("DETAILED RESULTS BY QUERY")
print("="*100)
print(combined_df.to_string(index=False))


DETAILED RESULTS BY QUERY
  System                                                 Query       Type Difficulty Context Relevance Context Sufficiency Faithfulness Answer Relevance Correctness Overall Score
Baseline              What is the transformer architecture?... conceptual       easy             0.583               1.000        0.500            1.000       0.832         0.771
Baseline                   How does BERT differ from GPT-4?... comparison     medium             0.583               0.500        0.500            1.000       0.686         0.659
Baseline    Explain how embeddings work in machine learning... conceptual       easy             0.667               1.000        0.500            1.000       0.989         0.822
Baseline What are the key components of a REST API and how ...  multi-hop       hard             0.333               0.500        0.500            0.750       0.824         0.606
Baseline                                    What is Docker?...    factual     

In [17]:
def compute_aggregate_metrics(results: List[EvaluationResult]) -> Dict[str, float]:
    """Compute aggregate statistics across all test cases"""
    return {
        'Avg Context Relevance': np.mean([r.context_relevance for r in results]),
        'Avg Context Sufficiency': np.mean([r.context_sufficiency for r in results]),
        'Avg Faithfulness': np.mean([r.answer_faithfulness for r in results]),
        'Avg Answer Relevance': np.mean([r.answer_relevance for r in results]),
        'Avg Correctness': np.mean([r.answer_correctness for r in results]),
        'Avg Overall Score': np.mean([r.overall_score for r in results]),
    }

# Compute aggregates
baseline_agg = compute_aggregate_metrics(baseline_results)
advanced_agg = compute_aggregate_metrics(advanced_results)

# Create comparison DataFrame
comparison_data = {
    'Metric': list(baseline_agg.keys()),
    'Baseline RAG': [f"{v:.3f}" for v in baseline_agg.values()],
    'Advanced RAG': [f"{v:.3f}" for v in advanced_agg.values()],
    'Improvement': [
        f"{((advanced_agg[k] - baseline_agg[k]) / baseline_agg[k] * 100):.1f}%" 
        if baseline_agg[k] > 0 else "N/A"
        for k in baseline_agg.keys()
    ]
}

comparison_df = pd.DataFrame(comparison_data)

print("\n" + "="*100)
print("AGGREGATE PERFORMANCE COMPARISON")
print("="*100)
print(comparison_df.to_string(index=False))
print("\n")


AGGREGATE PERFORMANCE COMPARISON
                 Metric Baseline RAG Advanced RAG Improvement
  Avg Context Relevance        0.517        0.590       14.2%
Avg Context Sufficiency        0.800        0.850        6.2%
       Avg Faithfulness        0.500        0.500        0.0%
   Avg Answer Relevance        0.950        0.950        0.0%
        Avg Correctness        0.864        0.833       -3.6%
      Avg Overall Score        0.729        0.739        1.5%




In [18]:
# Performance by query type
def analyze_by_category(results: List[EvaluationResult], category_key: str) -> pd.DataFrame:
    """Analyze performance breakdown by category (type or difficulty)"""
    data = {}
    
    # Get unique categories
    categories = set(getattr(r, category_key) for r in results)
    
    for cat in categories:
        cat_results = [r for r in results if getattr(r, category_key) == cat]
        data[cat] = {
            'Count': len(cat_results),
            'Avg Overall Score': np.mean([r.overall_score for r in cat_results]),
            'Avg Correctness': np.mean([r.answer_correctness for r in cat_results]),
            'Avg Faithfulness': np.mean([r.answer_faithfulness for r in cat_results]),
        }
    
    return pd.DataFrame(data).T

print("\n" + "="*80)
print("BASELINE: Performance by Query Type")
print("="*80)
baseline_by_type = analyze_by_category(baseline_results, 'query_type')
print(baseline_by_type.to_string())

print("\n" + "="*80)
print("ADVANCED: Performance by Query Type")
print("="*80)
advanced_by_type = analyze_by_category(advanced_results, 'query_type')
print(advanced_by_type.to_string())

print("\n" + "="*80)
print("BASELINE: Performance by Difficulty")
print("="*80)
baseline_by_diff = analyze_by_category(baseline_results, 'difficulty')
print(baseline_by_diff.to_string())

print("\n" + "="*80)
print("ADVANCED: Performance by Difficulty")
print("="*80)
advanced_by_diff = analyze_by_category(advanced_results, 'difficulty')
print(advanced_by_diff.to_string())


BASELINE: Performance by Query Type
            Count  Avg Overall Score  Avg Correctness  Avg Faithfulness
factual       1.0           0.784874         0.989497               0.5
multi-hop     1.0           0.606123         0.824491               0.5
conceptual    2.0           0.796375         0.910498               0.5
comparison    1.0           0.658892         0.685569               0.5

ADVANCED: Performance by Query Type
            Count  Avg Overall Score  Avg Correctness  Avg Faithfulness
factual       1.0           0.788601         0.984403               0.5
multi-hop     1.0           0.616385         0.675540               0.5
conceptual    2.0           0.825982         0.908928               0.5
comparison    1.0           0.639692         0.688769               0.5

BASELINE: Performance by Difficulty
        Count  Avg Overall Score  Avg Correctness  Avg Faithfulness
easy      3.0           0.792541         0.936831               0.5
hard      1.0           0.606123 

---

## Step 7: Detailed Example Analysis

Let's examine a specific query in detail to understand the evaluation.

In [19]:
def display_detailed_comparison(baseline_result: EvaluationResult, advanced_result: EvaluationResult):
    """Display side-by-side detailed comparison for a single query"""
    
    print("\n" + "="*100)
    print("DETAILED EXAMPLE COMPARISON")
    print("="*100)
    
    print(f"\nQuery: {baseline_result.query}")
    print(f"Type: {baseline_result.query_type} | Difficulty: {baseline_result.difficulty}")
    
    print("\n" + "-"*100)
    print("GROUND TRUTH ANSWER:")
    print("-"*100)
    print(baseline_result.ground_truth)
    
    print("\n" + "-"*100)
    print("BASELINE RAG ANSWER:")
    print("-"*100)
    print(baseline_result.generated_answer)
    
    print("\n" + "-"*100)
    print("ADVANCED RAG ANSWER:")
    print("-"*100)
    print(advanced_result.generated_answer)
    
    print("\n" + "="*100)
    print("METRIC COMPARISON")
    print("="*100)
    
    metrics = [
        ('Context Relevance', baseline_result.context_relevance, advanced_result.context_relevance),
        ('Context Sufficiency', baseline_result.context_sufficiency, advanced_result.context_sufficiency),
        ('Answer Faithfulness', baseline_result.answer_faithfulness, advanced_result.answer_faithfulness),
        ('Answer Relevance', baseline_result.answer_relevance, advanced_result.answer_relevance),
        ('Answer Correctness', baseline_result.answer_correctness, advanced_result.answer_correctness),
        ('Overall Score', baseline_result.overall_score, advanced_result.overall_score),
    ]
    
    metric_df = pd.DataFrame([
        {
            'Metric': name,
            'Baseline': f"{baseline:.3f}",
            'Advanced': f"{advanced:.3f}",
            'Difference': f"{(advanced - baseline):+.3f}",
            'Winner': '✓ Advanced' if advanced > baseline else ('✓ Baseline' if baseline > advanced else 'Tie')
        }
        for name, baseline, advanced in metrics
    ])
    
    print(metric_df.to_string(index=False))
    
    print("\n" + "-"*100)
    print("RETRIEVED CONTEXT COMPARISON")
    print("-"*100)
    print(f"\nBaseline retrieved {len(baseline_result.retrieved_contexts)} chunks")
    print(f"Advanced retrieved {len(advanced_result.retrieved_contexts)} chunks\n")

# Display detailed comparison for the first query (easy conceptual)
display_detailed_comparison(baseline_results[0], advanced_results[0])


DETAILED EXAMPLE COMPARISON

Query: What is the transformer architecture?
Type: conceptual | Difficulty: easy

----------------------------------------------------------------------------------------------------
GROUND TRUTH ANSWER:
----------------------------------------------------------------------------------------------------
The transformer architecture is a neural network architecture introduced in the 'Attention is All You Need' paper. It relies entirely on self-attention mechanisms to process sequences in parallel, replacing recurrent layers. Key components include multi-head attention, positional encoding, and feed-forward networks arranged in encoder-decoder stacks.

----------------------------------------------------------------------------------------------------
BASELINE RAG ANSWER:
----------------------------------------------------------------------------------------------------
The transformer architecture is a deep learning model that revolutionized natural langua

In [20]:
# Display detailed comparison for a harder query (multi-hop)
display_detailed_comparison(baseline_results[3], advanced_results[3])


DETAILED EXAMPLE COMPARISON

Query: What are the key components of a REST API and how do they relate to Docker containers?
Type: multi-hop | Difficulty: hard

----------------------------------------------------------------------------------------------------
GROUND TRUTH ANSWER:
----------------------------------------------------------------------------------------------------
REST APIs consist of resources identified by URLs, HTTP methods (GET, POST, PUT, DELETE), stateless communication, and standardized response formats like JSON. Docker containers provide an ideal deployment environment for REST APIs by packaging the API application with all its dependencies into isolated, portable containers. This ensures consistent behavior across development and production environments.

----------------------------------------------------------------------------------------------------
BASELINE RAG ANSWER:
--------------------------------------------------------------------------------------

---

## Step 8: Export Results for Regression Testing

Save evaluation results to enable continuous monitoring and regression detection.

In [21]:
def export_evaluation_results(results: List[EvaluationResult], system_name: str, output_file: str):
    """Export evaluation results to JSON for version control and tracking"""
    
    export_data = {
        'system_name': system_name,
        'evaluation_date': '2025-10-16',
        'aggregate_metrics': compute_aggregate_metrics(results),
        'detailed_results': [
            {
                'query': r.query,
                'query_type': r.query_type,
                'difficulty': r.difficulty,
                'metrics': {
                    'context_relevance': r.context_relevance,
                    'context_sufficiency': r.context_sufficiency,
                    'answer_faithfulness': r.answer_faithfulness,
                    'answer_relevance': r.answer_relevance,
                    'answer_correctness': r.answer_correctness,
                    'overall_score': r.overall_score
                },
                'generated_answer': r.generated_answer,
                'num_retrieved_chunks': len(r.retrieved_contexts)
            }
            for r in results
        ]
    }
    
    with open(output_file, 'w') as f:
        json.dump(export_data, f, indent=2)
    
    print(f"✓ Results exported to {output_file}")

# Export both systems
export_evaluation_results(baseline_results, "Baseline RAG", "baseline_eval_results.json")
export_evaluation_results(advanced_results, "Advanced RAG", "advanced_eval_results.json")

print("\n✓ Evaluation results saved for regression testing")

✓ Results exported to baseline_eval_results.json
✓ Results exported to advanced_eval_results.json

✓ Evaluation results saved for regression testing


---

## Key Takeaways and Best Practices

### What We Learned

1. **Evaluation is Multi-Dimensional**
   - RAG systems must be evaluated along multiple axes: retrieval quality (relevance, sufficiency) and generation quality (faithfulness, relevance, correctness)
   - No single metric tells the full story—you need a comprehensive suite

2. **LLM-as-Judge is Powerful**
   - Using a powerful LLM (like GPT-4) as an evaluator enables automated, nuanced assessment of complex qualities like "faithfulness" and "relevance"
   - This automation makes continuous evaluation feasible

3. **Gold Standard Datasets are Essential**
   - A curated benchmark dataset with ground truth answers is the foundation of rigorous evaluation
   - This enables regression testing: detect when changes degrade performance

4. **Comparative Analysis Drives Optimization**
   - Evaluating multiple configurations side-by-side reveals which architectural choices matter
   - In our example: smaller chunks + higher top_k improved performance across most metrics

5. **Category-Specific Analysis Reveals Weaknesses**
   - Breaking down performance by query type (factual, conceptual, multi-hop) and difficulty helps identify specific failure modes
   - This guides targeted improvements

### Production Best Practices (from Curriculum)

1. **Establish a Gold Standard Early**: Create your benchmark dataset at the start of development, not as an afterthought

2. **Automate Testing Pipelines**: Integrate RAG evaluation into CI/CD workflows to test every change automatically

3. **Monitor for Drift**: Continuously monitor production metrics to detect degradation as data or models change over time

4. **Map Metrics to Failure Points**: Use evaluation results to diagnose specific failure modes:
   - Low context relevance → FP2: Retrieval failure
   - Low faithfulness → FP4: Generation failure (hallucination)
   - Low correctness but high faithfulness → FP3: Missing information in context

5. **Version Control Your Evaluations**: Store evaluation results in version control alongside your code to track performance evolution

### The "MLOps-ification" of RAG

As the curriculum emphasizes:
> "The maturation of RAG development has led to its 'MLOps-ification,' where building a RAG system now demands the same discipline as any other machine learning system: automated testing, versioned datasets, continuous monitoring for drift, and metric-driven development."

This demo provides the foundation for treating RAG development with the same rigor as traditional ML engineering.

---

## Further Reading

- **RAG Evaluation Metrics**: Best Practices for Evaluating RAG Systems - Patronus AI (Reference 75)
- **Evaluating retrieval in RAGs**: a practical framework - Tweag (Reference 73)  
- **RAG systems**: Best practices to master evaluation - Google Cloud (Reference 74)
- **Seven Failure Points** When Engineering a RAG System - arXiv (Reference 10)

---