# Evaluation Infrastructure

## Systematic Evaluation for AI Systems

This notebook explores evaluation methodology for RAG and agentic systems. We cover the complete infrastructure needed to move from "vibe checks" to systematic, metrics-driven development.

```
    ┌─────────────────────────────────────────────────────────────┐
    │                  Evaluation Infrastructure                  │
    │                                                             │
    │   ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
    │   │ Synthetic│  │  RAGAS   │  │ Quality  │  │ Metrics  │    │
    │   │   Data   │  │  Metrics │  │Validation│  │ Tracking │    │
    │   │Generation│  │          │  │          │  │          │    │
    │   └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘    │
    │        │             │             │             │          │
    │        v             v             v             v          │
    │   Test cases    Faithfulness   Filter bad    Track over     │
    │   at scale      Relevancy      test cases    time           │
    │                 Precision                                   │
    │                 Recall                                      │
    └─────────────────────────────────────────────────────────────┘
```

Topics covered:
- Synthetic data generation for test cases
- RAGAS core metrics (faithfulness, answer relevancy, context precision, context recall)
- RAGAS testset generation from documents
- Difficulty-based and adversarial test case generation
- Data quality validation and coverage verification
- Metrics-driven development practices
- LangSmith integration for dataset management

## Setup and Imports

Make sure you have the required packages installed:

```bash
uv sync
```

Note: This notebook requires RAGAS 0.4.x. The `nest_asyncio` package is needed for RAGAS to work in Jupyter notebooks.

In [None]:
import os
import json
import asyncio
from datetime import datetime, UTC
from pathlib import Path
from typing import Optional

from dotenv import load_dotenv
from pydantic import BaseModel
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.documents import Document
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# RAGAS requires nest_asyncio in Jupyter environments
import nest_asyncio
nest_asyncio.apply()

# Load environment variables
load_dotenv()

# Verify API key
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY not found in environment variables")

# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def extract_json(content: str) -> str:
    """Extract JSON from LLM response, handling markdown code blocks."""
    if "```json" in content:
        content = content.split("```json")[1].split("```")[0]
    elif "```" in content:
        content = content.split("```")[1].split("```")[0]
    return content.strip()

print("Setup complete!")

## Synthetic Data Generation

The cold-start problem: systematic evaluation requires test data, but you don't have users yet (or enough users, or users who hit edge cases). Synthetic test data lets you achieve coverage of scenarios you've identified without waiting for production traffic.

The key insight: use an LLM to generate the test cases that will evaluate your system. This works because the generation model and evaluation target can be different, and generated data can be reviewed and filtered by humans.

In [None]:
def generate_test_cases(
    topic: str, 
    num_cases: int,
    difficulty_distribution: dict,
    llm: ChatOpenAI
) -> list[dict]:
    """Generate synthetic test cases for a topic."""
    
    test_cases = []
    
    for difficulty, count in difficulty_distribution.items():
        prompt = f"""Generate {count} question-answer pairs about {topic}.

Difficulty level: {difficulty}
- easy: Basic factual questions with straightforward answers
- medium: Questions requiring explanation or connection of concepts  
- hard: Questions requiring synthesis, edge cases, or subtle distinctions

For each pair, provide:
- question: The question to ask
- expected_answer: A reference answer (doesn't need to be exact)
- key_concepts: List of concepts the answer should mention
- difficulty: {difficulty}

Format as JSON array: [{{"question": "...", "expected_answer": "...", "key_concepts": [...], "difficulty": "{difficulty}"}}]"""

        response = llm.invoke(prompt)
        content = extract_json(response.content)
        
        try:
            cases = json.loads(content)
            test_cases.extend(cases)
        except json.JSONDecodeError:
            print(f"Warning: Could not parse response for {difficulty}")
    
    return test_cases


# Generate a small test set
test_data = generate_test_cases(
    topic="retrieval-augmented generation",
    num_cases=6,
    difficulty_distribution={"easy": 2, "medium": 3, "hard": 1},
    llm=llm
)

print(f"Generated {len(test_data)} test cases:")
for tc in test_data[:3]:
    print(f"  [{tc.get('difficulty', 'unknown')}] {tc.get('question', 'N/A')[:60]}...")

## The RAGAS Framework

RAGAS (RAG Assessment) provides automated metrics specifically designed for evaluating RAG systems. It measures retrieval quality and generation quality separately, so you can identify exactly where problems occur.

### The Four Core Metrics

| Metric | What It Measures | Good | Acceptable | Needs Work |
|--------|------------------|------|------------|------------|
| **Faithfulness** | Does the answer stick to retrieved context? | > 0.9 | 0.7-0.9 | < 0.7 |
| **Answer Relevancy** | Does the response address what was asked? | > 0.8 | 0.6-0.8 | < 0.6 |
| **Context Precision** | Of retrieved docs, what proportion were relevant? | > 0.8 | 0.6-0.8 | < 0.6 |
| **Context Recall** | Does retrieved context contain needed info? | > 0.8 | 0.6-0.8 | < 0.6 |

In [None]:
from ragas import evaluate
from ragas.metrics._faithfulness import Faithfulness
from ragas.metrics._answer_relevance import ResponseRelevancy
from ragas.metrics._context_precision import LLMContextPrecisionWithoutReference
from ragas.metrics._context_recall import LLMContextRecall
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import OpenAIEmbeddings
from datasets import Dataset

# Prepare evaluation data - this simulates running your RAG system on test questions
eval_data = {
    "question": [
        "What is the ReAct pattern?",
        "How do embeddings capture semantic meaning?",
        "When should you use multi-agent systems?"
    ],
    "answer": [
        "ReAct combines reasoning and acting in a loop. The agent thinks about what to do, takes an action, observes the result, and then thinks again. This allows for iterative problem solving.",
        "Embeddings map text to dense vectors in a high-dimensional space where similar meanings are positioned close together. The model learns these representations during training on large text corpora.",
        "Multi-agent systems work best when you have distinct, specialized tasks that can be handled by focused agents. They're useful when tasks decompose naturally into parallel or sequential components."
    ],
    "contexts": [
        ["The ReAct pattern alternates between reasoning steps and action steps. In the reasoning phase, the agent analyzes the situation. In the action phase, it executes a tool or operation."],
        ["Embeddings are dense vector representations that capture semantic relationships. Words with similar meanings have similar embedding vectors.", "Semantic similarity is measured by cosine distance between embedding vectors."],
        ["Multi-agent architectures provide benefits when tasks are complex and decomposable. Each agent can specialize in a specific capability."]
    ],
    "ground_truth": [
        "ReAct (Reasoning + Acting) is a pattern where the agent alternates between thinking and taking actions, allowing iterative refinement.",
        "Embeddings capture meaning by mapping text to vectors where semantically similar content has similar vector representations.",
        "Use multi-agent systems when tasks decompose naturally into specialized subtasks that different agents can handle."
    ]
}

dataset = Dataset.from_dict(eval_data)

# Configure RAGAS using LangChain wrappers
ragas_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
ragas_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# Initialize metrics
metrics = [
    Faithfulness(llm=ragas_llm),
    ResponseRelevancy(llm=ragas_llm, embeddings=ragas_embeddings),
    LLMContextPrecisionWithoutReference(llm=ragas_llm),
    LLMContextRecall(llm=ragas_llm),
]

# Run evaluation
print("Running RAGAS evaluation (this may take a minute)...")
results = evaluate(
    dataset,
    metrics=metrics,
)

# RAGAS 0.4.x returns per-row results - convert to DataFrame and compute means
results_df = results.to_pandas()

# Get only numeric columns for metric results
metric_columns = results_df.select_dtypes(include=['float64', 'int64']).columns
print(f"\nResults (mean across {len(results_df)} samples):")
for col in metric_columns:
    print(f"  {col}: {results_df[col].mean():.3f}")

### Interpreting Results and Diagnosing Issues

Each metric points to different failure modes:

- **Low faithfulness** → Model is hallucinating beyond the context. Fix: stronger grounding prompts, lower temperature, more instruction-following model.
- **Low relevancy** → Model isn't understanding or addressing the question. Fix: check prompt structure, add examples.
- **Low precision** → Retrieval is pulling irrelevant documents. Fix: better embeddings, improved chunking, add reranking.
- **Low recall** → Missing information that should be retrieved. Fix: more documents, better search, query expansion.

## RAGAS Testset Generation

RAGAS packages the knowledge graph approach into a convenient API. It automatically extracts entities and relationships from your documents and generates diverse question types.

Note: RAGAS 0.4.x requires wrapped LangChain models via `LangchainLLMWrapper` and `LangchainEmbeddingsWrapper`.

In [None]:
from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import OpenAIEmbeddings


def load_reference_documents(docs_path: str = "documents", pattern: str = "ref-*.md") -> list[Document]:
    """Load and chunk reference documents for test generation."""
    
    documents_dir = Path(docs_path)
    guide_files = sorted(documents_dir.glob(pattern))
    
    if not guide_files:
        print(f"No files found matching {pattern} in {docs_path}")
        return []
    
    # Use larger chunks for test generation
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1500,
        chunk_overlap=200,
    )
    
    all_chunks = []
    for filepath in guide_files[:3]:  # Limit to 3 files for demo
        loader = TextLoader(str(filepath))
        documents = loader.load()
        chunks = splitter.split_documents(documents)
        all_chunks.extend(chunks)
        print(f"  Loaded {len(chunks)} chunks from {filepath.name}")
    
    return all_chunks


def create_testset_generator() -> TestsetGenerator:
    """Initialize RAGAS testset generator.
    
    Uses gpt-4o-mini to avoid rate limits - RAGAS makes many LLM calls during
    knowledge graph extraction.
    """
    # Use LangChain wrappers (same approach as assignment notebook)
    generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
    generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
    
    return TestsetGenerator(
        llm=generator_llm,
        embedding_model=generator_embeddings
    )


print("RAGAS testset generator functions defined!")

In [None]:
# Load documents
print("Loading reference documents...")
documents = load_reference_documents("documents", "ref-*.md")
print(f"Total chunks: {len(documents)}")

if documents:
    print("\nGenerating testset (this may take a few minutes)...")
    generator = create_testset_generator()
    
    # Generate a small testset for demo
    testset = generator.generate_with_langchain_docs(
        documents=documents[:10],  # Limit chunks for speed
        testset_size=5,  # Small for demo
    )
    
    # Convert to dataframe
    test_df = testset.to_pandas()
    
    # Rename columns for consistency
    if "user_input" in test_df.columns:
        test_df = test_df.rename(columns={"user_input": "question"})
    if "reference" in test_df.columns:
        test_df = test_df.rename(columns={"reference": "ground_truth"})
    
    print(f"\nGenerated {len(test_df)} test cases:")
    display_cols = ["question", "synthesizer_name"] if "synthesizer_name" in test_df.columns else ["question"]
    print(test_df[display_cols].head())
else:
    print("No documents loaded. Skipping testset generation.")
    test_df = None

## Difficulty-Based Test Generation

Different difficulty levels stress different parts of your system:
- **Easy**: Direct lookup questions where the answer appears almost verbatim
- **Medium**: Requires understanding, paraphrasing, connecting ideas
- **Hard**: Edge cases, exceptions, synthesis across multiple concepts

In [None]:
def generate_difficulty_spectrum(
    topic: str,
    documents: str,
    llm: ChatOpenAI
) -> dict[str, list]:
    """Generate questions across difficulty levels."""
    
    test_cases = {"easy": [], "medium": [], "hard": []}
    doc_sample = documents[:2000] if isinstance(documents, str) else "\n\n".join(str(d) for d in documents)[:2000]
    
    # Easy: Direct lookup questions
    easy_prompt = f"""Based on these documents about {topic}, generate 3 questions 
where the answer appears almost verbatim in the text. These should be 
straightforward retrieval tasks.

Documents:
{doc_sample}

Format as JSON: [{{"question": "...", "answer": "...", "source_quote": "..."}}]"""
    
    response = llm.invoke(easy_prompt)
    content = extract_json(response.content)
    try:
        test_cases["easy"] = json.loads(content)
    except json.JSONDecodeError:
        test_cases["easy"] = []
    
    # Medium: Requires understanding and paraphrasing
    medium_prompt = f"""Generate 3 questions about {topic} where answering requires:
- Understanding concepts, not just finding text
- Paraphrasing information from the documents
- Connecting two related ideas

The answer shouldn't be copy-pasteable from the documents.

Documents:
{doc_sample}

Format as JSON: [{{"question": "...", "answer": "...", "reasoning": "..."}}]"""
    
    response = llm.invoke(medium_prompt)
    content = extract_json(response.content)
    try:
        test_cases["medium"] = json.loads(content)
    except json.JSONDecodeError:
        test_cases["medium"] = []
    
    # Hard: Edge cases, exceptions, nuanced understanding
    hard_prompt = f"""Generate 2 challenging questions about {topic}:
- Questions about edge cases or exceptions to general rules
- Questions requiring synthesis across multiple concepts
- Questions where naive answers would be incomplete or wrong

Documents:
{doc_sample}

Format as JSON: [{{"question": "...", "answer": "...", "why_hard": "..."}}]"""
    
    response = llm.invoke(hard_prompt)
    content = extract_json(response.content)
    try:
        test_cases["hard"] = json.loads(content)
    except json.JSONDecodeError:
        test_cases["hard"] = []
    
    return test_cases


# Test difficulty spectrum generation
sample_text = """RAG (Retrieval-Augmented Generation) combines retrieval with generation.
The indexing phase involves: Load, Chunk, Embed, Store.
The query phase involves: Embed query, Search, Retrieve, Generate.
Chunking strategies include fixed-size chunks, semantic chunking, and sentence-based chunking.
Vector databases like Qdrant store embeddings for similarity search."""

difficulty_cases = generate_difficulty_spectrum("RAG systems", sample_text, llm)
print(f"Generated test cases by difficulty:")
for level, cases in difficulty_cases.items():
    print(f"  {level}: {len(cases)} questions")
    for c in cases[:1]:
        print(f"    - {c.get('question', 'N/A')[:60]}...")

## Adversarial Test Cases

Adversarial cases reveal how your system handles the unexpected:
- **Unanswerable**: Questions that seem related but can't be answered from the documents
- **Ambiguous**: Questions that could be interpreted multiple ways
- **Misleading**: Questions with false premises or incorrect assumptions
- **Out-of-scope**: Questions related to the topic but outside document coverage

In [None]:
def generate_adversarial_cases(
    topic: str,
    documents: str,
    llm: ChatOpenAI
) -> list[dict]:
    """Generate test cases designed to find weaknesses."""
    
    doc_sample = documents[:1000] if isinstance(documents, str) else "\n\n".join(str(d) for d in documents)[:1000]
    
    prompt = f"""Generate adversarial test cases for a Q&A system about {topic}.

Include:

1. UNANSWERABLE questions (2 examples)
   Questions that seem related but can't actually be answered from the documents.
   The system should recognize it doesn't have the information.

2. AMBIGUOUS questions (2 examples)
   Questions that could be interpreted multiple ways.
   The system should ask for clarification or acknowledge ambiguity.

3. MISLEADING questions (2 examples)
   Questions with false premises or incorrect assumptions.
   The system should correct the assumption, not just answer.

4. OUT_OF_SCOPE questions (2 examples)
   Questions related to {topic} but outside what the documents cover.
   The system should acknowledge limitations.

Documents summary: {doc_sample}

For each case, specify:
- question: The adversarial question
- category: Which type (unanswerable, ambiguous, misleading, out_of_scope)
- expected_behavior: How a good system should respond
- failure_mode: What a bad response would look like

Return as JSON array."""
    
    response = llm.invoke(prompt)
    content = extract_json(response.content)
    
    try:
        return json.loads(content)
    except json.JSONDecodeError:
        print("Warning: Could not parse adversarial cases")
        return []


# Generate adversarial cases
adversarial_cases = generate_adversarial_cases("RAG systems", sample_text, llm)
print(f"Generated {len(adversarial_cases)} adversarial test cases:")
for case in adversarial_cases[:4]:
    print(f"  [{case.get('category', 'unknown')}] {case.get('question', 'N/A')[:50]}...")
    print(f"    Expected: {case.get('expected_behavior', 'N/A')[:60]}...")

## Coverage Verification

After generating test cases, verify you have adequate coverage across topics and question types. Gaps indicate areas where you need additional test cases.

In [None]:
def analyze_coverage(
    test_cases: list[dict],
    expected_topics: list[str],
    expected_types: list[str]
) -> dict:
    """Analyze whether test cases cover expected dimensions."""
    
    # Track what's covered
    topic_coverage = {topic: 0 for topic in expected_topics}
    type_coverage = {t: 0 for t in expected_types}
    
    for case in test_cases:
        # Check topic coverage
        question = case.get("question", "").lower()
        for topic in expected_topics:
            if topic.lower() in question:
                topic_coverage[topic] += 1
        
        # Check type coverage
        case_type = case.get("type", case.get("category", "unknown"))
        if case_type in type_coverage:
            type_coverage[case_type] += 1
    
    # Identify gaps
    gaps = {
        "missing_topics": [t for t, c in topic_coverage.items() if c == 0],
        "underrepresented_topics": [t for t, c in topic_coverage.items() if 0 < c < 3],
        "missing_types": [t for t, c in type_coverage.items() if c == 0]
    }
    
    return {
        "topic_coverage": topic_coverage,
        "type_coverage": type_coverage,
        "gaps": gaps,
        "total_cases": len(test_cases)
    }


# Collect all test cases generated so far
all_test_cases = []

# Add adversarial cases
if adversarial_cases:
    all_test_cases.extend(adversarial_cases)

# Add difficulty-based cases
for level in ["easy", "medium", "hard"]:
    for c in difficulty_cases.get(level, []):
        all_test_cases.append({"type": level, **c})

if all_test_cases:
    coverage = analyze_coverage(
        all_test_cases,
        expected_topics=["RAG", "retrieval", "embedding", "chunking", "generation"],
        expected_types=["easy", "medium", "hard", "unanswerable", "ambiguous"]
    )
    
    print(f"Coverage Analysis ({coverage['total_cases']} total cases):")
    print(f"\nTopic coverage:")
    for topic, count in coverage["topic_coverage"].items():
        print(f"  {topic}: {count}")
    
    print(f"\nType coverage:")
    for qtype, count in coverage["type_coverage"].items():
        print(f"  {qtype}: {count}")
    
    print(f"\nGaps identified:")
    print(f"  Missing topics: {coverage['gaps']['missing_topics']}")
    print(f"  Underrepresented: {coverage['gaps']['underrepresented_topics']}")
    print(f"  Missing types: {coverage['gaps']['missing_types']}")
else:
    print("No test cases available for coverage analysis.")

## Data Quality Validation

Not every generated test case is good. Build a validation pipeline to filter out low-quality cases: questions that don't make sense, incorrect expected answers, or near-duplicates.

In [None]:
def validate_test_case(case: dict, documents: str, llm: ChatOpenAI) -> dict:
    """Validate a generated test case."""
    
    doc_sample = documents[:2000] if isinstance(documents, str) else "\n\n".join(str(d) for d in documents)[:2000]
    
    validation_prompt = f"""Evaluate this test case for quality:

Question: {case.get('question', 'N/A')}
Expected Answer: {case.get('answer', case.get('expected_answer', 'N/A'))}

Reference Documents:
{doc_sample}

Evaluate (score 1-5 each):
1. Is the question clear and unambiguous?
2. Is the expected answer actually correct based on the documents?
3. Can the question be reasonably answered from the documents?
4. Is this a useful test case (not trivial, not impossible)?

Return as JSON: {{"scores": {{"clarity": N, "correctness": N, "answerability": N, "usefulness": N}}, "issues": [...], "recommendation": "keep|revise|discard"}}"""
    
    response = llm.invoke(validation_prompt)
    content = extract_json(response.content)
    
    try:
        validation = json.loads(content)
    except json.JSONDecodeError:
        validation = {"scores": {"clarity": 3, "correctness": 3, "answerability": 3, "usefulness": 3}, 
                      "issues": ["Could not parse validation"], "recommendation": "revise"}
    
    scores = validation.get("scores", {})
    quality_score = sum(scores.values()) / len(scores) if scores else 0
    
    return {
        **case,
        "validation": validation,
        "quality_score": quality_score
    }


def filter_test_cases(
    cases: list[dict], 
    min_quality: float = 3.5
) -> list[dict]:
    """Keep only high-quality test cases."""
    return [c for c in cases if c.get("quality_score", 0) >= min_quality]


# Validate a sample test case
if all_test_cases:
    sample_case = all_test_cases[0]
    validated = validate_test_case(sample_case, sample_text, llm)
    print(f"Validation result:")
    print(f"  Question: {validated.get('question', 'N/A')[:50]}...")
    print(f"  Quality score: {validated.get('quality_score', 0):.2f}")
    print(f"  Recommendation: {validated.get('validation', {}).get('recommendation', 'N/A')}")
    print(f"  Issues: {validated.get('validation', {}).get('issues', [])}")
else:
    print("No test cases to validate.")

## Metrics-Driven Development

Evaluation isn't a one-time activity—it's about using measurements to drive systematic improvement. The loop:

1. **Measure**: Run your evaluation suite and record scores
2. **Analyze**: Identify what's causing low scores
3. **Hypothesize**: Form a theory about what would improve things
4. **Change**: Implement your proposed improvement
5. **Measure again**: See if it actually helped
6. **Iterate**: Keep going until you hit your targets

In [None]:
class EvaluationTracker:
    """Track evaluation results over time."""
    
    def __init__(self, experiment_name: str):
        self.experiment_name = experiment_name
        self.runs = []
    
    def record_run(
        self, 
        version: str, 
        metrics: dict, 
        config: dict,
        notes: str = ""
    ):
        """Record an evaluation run."""
        self.runs.append({
            "timestamp": datetime.now(UTC).isoformat(),
            "version": version,
            "metrics": metrics,
            "config": config,
            "notes": notes
        })
    
    def compare_versions(self, v1: str, v2: str) -> dict:
        """Compare metrics between two versions."""
        run1 = next((r for r in self.runs if r["version"] == v1), None)
        run2 = next((r for r in self.runs if r["version"] == v2), None)
        
        if not run1 or not run2:
            return {"error": "Version not found"}
        
        comparison = {}
        for metric in run1["metrics"]:
            old = run1["metrics"][metric]
            new = run2["metrics"][metric]
            comparison[metric] = {
                "old": old,
                "new": new,
                "change": new - old,
                "percent_change": (new - old) / old * 100 if old != 0 else float('inf')
            }
        
        return comparison
    
    def get_trend(self, metric: str) -> list:
        """Get trend of a metric over time."""
        return [
            {"version": r["version"], "value": r["metrics"].get(metric)}
            for r in sorted(self.runs, key=lambda x: x["timestamp"])
        ]
    
    def check_regression(self, metric: str, threshold: float = 0.05) -> bool:
        """Check if recent changes caused regression beyond threshold."""
        if len(self.runs) < 2:
            return False
        
        recent = self.runs[-1]["metrics"].get(metric, 0)
        previous = self.runs[-2]["metrics"].get(metric, 0)
        
        return (previous - recent) / previous > threshold if previous > 0 else False


# Demonstrate the tracker
tracker = EvaluationTracker("rag-evaluation")

# Simulate recording some runs
tracker.record_run(
    version="v1.0",
    metrics={"faithfulness": 0.78, "relevancy": 0.82, "precision": 0.71},
    config={"model": "gpt-4o-mini", "k": 3},
    notes="Baseline"
)

tracker.record_run(
    version="v1.1",
    metrics={"faithfulness": 0.85, "relevancy": 0.84, "precision": 0.75},
    config={"model": "gpt-4o-mini", "k": 5},
    notes="Increased retrieval k"
)

tracker.record_run(
    version="v1.2",
    metrics={"faithfulness": 0.88, "relevancy": 0.86, "precision": 0.79},
    config={"model": "gpt-4o-mini", "k": 5, "rerank": True},
    notes="Added reranking"
)

print("Evaluation Tracker Demo:")
print(f"\nRuns recorded: {len(tracker.runs)}")

comparison = tracker.compare_versions("v1.0", "v1.2")
print(f"\nComparison v1.0 -> v1.2:")
for metric, data in comparison.items():
    print(f"  {metric}: {data['old']:.3f} -> {data['new']:.3f} ({data['percent_change']:+.1f}%)")

print(f"\nFaithfulness trend:")
for point in tracker.get_trend("faithfulness"):
    print(f"  {point['version']}: {point['value']:.3f}")

In [None]:
class QualityThresholds:
    """Define quality thresholds for the system."""
    
    # Minimum acceptable scores (system is broken below this)
    MINIMUM = {
        "faithfulness": 0.7,
        "relevance": 0.6,
        "retrieval_precision": 0.5
    }
    
    # Target scores (what we're aiming for)
    TARGET = {
        "faithfulness": 0.9,
        "relevance": 0.85,
        "retrieval_precision": 0.8
    }
    
    # Stretch goals (excellence)
    STRETCH = {
        "faithfulness": 0.95,
        "relevance": 0.92,
        "retrieval_precision": 0.9
    }
    
    @classmethod
    def assess(cls, metrics: dict) -> str:
        """Assess current metrics against thresholds."""
        
        # Check for any below minimum
        for metric, min_val in cls.MINIMUM.items():
            if metrics.get(metric, 0) < min_val:
                return f"CRITICAL: {metric} below minimum threshold ({metrics.get(metric, 0):.3f} < {min_val})"
        
        # Check if meeting targets
        meeting_targets = all(
            metrics.get(m, 0) >= t 
            for m, t in cls.TARGET.items()
        )
        
        if meeting_targets:
            return "GOOD: Meeting all targets"
        else:
            below_target = [
                m for m, t in cls.TARGET.items() 
                if metrics.get(m, 0) < t
            ]
            return f"IMPROVING: Below target on {', '.join(below_target)}"


# Test threshold assessment
test_metrics = {"faithfulness": 0.85, "relevance": 0.88, "retrieval_precision": 0.75}
assessment = QualityThresholds.assess(test_metrics)
print(f"Metrics: {test_metrics}")
print(f"Assessment: {assessment}")

## LangSmith Integration

LangSmith provides infrastructure for managing evaluation datasets, running evaluations at scale, and tracking results over time. This section demonstrates graceful handling when LangSmith is not configured.

In [None]:
def get_langsmith_client():
    """Get LangSmith client with graceful error handling."""
    try:
        from langsmith import Client
        client = Client()
        # Test connection
        list(client.list_datasets(limit=1))
        return client
    except Exception as e:
        print(f"LangSmith not available: {e}")
        print("Skipping LangSmith-specific functionality.")
        print("To use LangSmith, set LANGSMITH_API_KEY in your environment.")
        return None


# Check LangSmith availability
langsmith_client = get_langsmith_client()

In [None]:
def create_evaluation_dataset(client, dataset_name: str, examples: list[dict]):
    """Create a LangSmith dataset for evaluation."""
    if client is None:
        print("LangSmith not available. Returning examples as local dataset.")
        return {"name": dataset_name, "examples": examples, "local": True}
    
    try:
        # Create dataset
        dataset = client.create_dataset(
            dataset_name=dataset_name,
            description="Synthetic test cases for RAG evaluation"
        )
        
        # Add examples
        client.create_examples(
            inputs=[e["inputs"] for e in examples],
            outputs=[e["outputs"] for e in examples],
            dataset_id=dataset.id
        )
        
        return {"name": dataset_name, "id": dataset.id, "local": False}
    except Exception as e:
        print(f"Error creating dataset: {e}")
        return {"name": dataset_name, "examples": examples, "local": True, "error": str(e)}


# Prepare example data
example_data = [
    {
        "inputs": {"question": "What is RAG?"},
        "outputs": {"answer": "RAG (Retrieval-Augmented Generation) combines retrieval with generation..."}
    },
    {
        "inputs": {"question": "How do embeddings work?"},
        "outputs": {"answer": "Embeddings map text to vectors in a space where similar meanings are close together..."}
    }
]

# Create dataset (will use local fallback if LangSmith unavailable)
dataset_result = create_evaluation_dataset(
    langsmith_client,
    f"rag-evaluation-demo-{datetime.now().strftime('%Y%m%d-%H%M%S')}",
    example_data
)

print(f"Dataset created: {dataset_result['name']}")
print(f"Local mode: {dataset_result.get('local', False)}")

In [None]:
def run_langsmith_evaluation(client, dataset_name: str, system_func, evaluators: list):
    """Run evaluation using LangSmith."""
    if client is None:
        print("LangSmith not available. Simulating evaluation locally...")
        # Simulate local evaluation
        return {
            "status": "simulated",
            "message": "LangSmith not configured. In production, this would run against the dataset.",
            "dataset": dataset_name
        }
    
    try:
        from langsmith.evaluation import evaluate
        
        results = evaluate(
            system_func,
            data=dataset_name,
            evaluators=evaluators,
            experiment_prefix="demo"
        )
        
        return {"status": "completed", "results": results}
    except Exception as e:
        print(f"Evaluation error: {e}")
        return {"status": "error", "error": str(e)}


# Define a simple system function for evaluation
def simple_rag_system(inputs: dict) -> dict:
    """Simple RAG system for demonstration."""
    question = inputs.get("question", "")
    # In production, this would call your actual RAG pipeline
    response = llm.invoke(f"Answer briefly: {question}")
    return {"answer": response.content, "contexts": ["[Demo context]"]}


# Run evaluation (will simulate if LangSmith unavailable)
eval_result = run_langsmith_evaluation(
    langsmith_client,
    dataset_result['name'],
    simple_rag_system,
    []  # Evaluators would be defined here
)

print(f"Evaluation status: {eval_result['status']}")

## Bringing It All Together

We've covered the complete evaluation infrastructure:

1. **Synthetic Data Generation** - Creating test cases at scale
2. **RAGAS Framework** - Automated metrics for RAG systems (faithfulness, relevancy, precision, recall)
3. **RAGAS Testset Generation** - Automatic test generation from documents using knowledge graphs
4. **Difficulty Spectrum** - Testing across easy, medium, and hard cases
5. **Adversarial Cases** - Finding system weaknesses
6. **Coverage Analysis** - Ensuring comprehensive testing
7. **Quality Validation** - Filtering low-quality test cases
8. **Metrics-Driven Development** - Using measurements to drive improvement
9. **LangSmith Integration** - Production-grade evaluation infrastructure

The key insight: **evaluation transforms intuition into data**. Instead of "I think it's working," you can say "faithfulness is 0.88 and improving."

In [None]:
print("=" * 60)
print("COMPLETE EVALUATION PIPELINE DEMO")
print("=" * 60)

# 1. Load documents
print("\n1. Loading documents...")
docs = load_reference_documents("documents", "ref-*.md")
print(f"   Loaded {len(docs)} document chunks")

# 2. Generate test cases
print("\n2. Generating test cases...")
if docs:
    # Generate a few test cases
    simple_cases = generate_test_cases(
        topic="AI Engineering",
        num_cases=4,
        difficulty_distribution={"easy": 2, "medium": 2},
        llm=llm
    )
    print(f"   Generated {len(simple_cases)} test cases")
else:
    simple_cases = []
    print("   No documents available")

# 3. Analyze coverage
print("\n3. Analyzing coverage...")
if simple_cases:
    coverage = analyze_coverage(
        simple_cases,
        expected_topics=["RAG", "embedding", "retrieval"],
        expected_types=["easy", "medium"]
    )
    print(f"   Total cases: {coverage['total_cases']}")
    print(f"   Missing topics: {coverage['gaps']['missing_topics']}")

# 4. Track metrics
print("\n4. Tracking metrics...")
demo_tracker = EvaluationTracker("demo-experiment")
demo_tracker.record_run(
    version="demo-v1",
    metrics={"faithfulness": 0.82, "relevancy": 0.85},
    config={"model": "gpt-4o-mini"},
    notes="Demo run"
)
print(f"   Recorded run: {demo_tracker.runs[-1]['version']}")

# 5. Assess against thresholds
print("\n5. Assessing against thresholds...")
assessment = QualityThresholds.assess({
    "faithfulness": 0.82,
    "relevance": 0.85,
    "retrieval_precision": 0.78
})
print(f"   {assessment}")

print("\n" + "=" * 60)
print("Evaluation infrastructure ready for production use!")
print("=" * 60)