# Scenario 07: Evaluation, Observability, and Prompt Evolution

**Estimated Time**: 45 minutes

## Learning Objectives
- Collect and analyze agent performance metrics
- Implement accuracy, latency, cost, and quality evaluators
- Apply prompt tuning with version tracking
- Run A/B tests to compare prompt variants

## Prerequisites
- Completed Scenario 01 (Simple Agent + MCP)
- Familiarity with metrics and observability concepts

In [None]:
# Setup and imports
import sys
from pathlib import Path
import time

# Add project root to path
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from src.common.evaluation import (
    MetricsCollector,
    MetricType,
    ExactMatchEvaluator,
    ContainsEvaluator,
    SemanticSimilarityEvaluator,
    OpenAICostCalculator,
    create_collector,
    evaluate_response,
    estimate_cost,
)
from src.common.prompt_tuning import (
    PromptTuner,
    PromptAnalyzer,
    PromptTemplate,
    PromptVariable,
    create_tuner,
    analyze_prompt,
    create_template,
)

print("âœ… Imports successful")

## Part 1: Metrics Collection

The `MetricsCollector` provides comprehensive metrics tracking.

In [None]:
# Create a metrics collector
collector = create_collector()

print("Created MetricsCollector")
print(f"Default evaluator: {type(collector.evaluator).__name__}")
print(f"Cost calculator: {type(collector.cost_calculator).__name__}")

In [None]:
# Measure latency of an operation
def simulate_llm_call():
    """Simulate an LLM API call."""
    time.sleep(0.1)  # Simulate 100ms latency
    return "Simulated response"

# Use context manager to measure latency
with collector.measure_latency("llm_call"):
    result = simulate_llm_call()

# Check collected metrics
latency_metrics = collector.get_metrics(metric_type=MetricType.LATENCY)
print(f"\nCollected latency metrics: {len(latency_metrics)}")
for m in latency_metrics:
    print(f"  {m.name}: {m.value:.2f}ms")

## Part 2: Accuracy Evaluation

Measure how well agent outputs match expected results.

In [None]:
# Test different evaluation methods
test_cases = [
    ("Hello world", "Hello world", "Exact match"),
    ("Paris", "The capital of France is Paris.", "Contains"),
    ("AI is transforming technology", "Artificial intelligence is revolutionizing tech", "Semantic"),
]

print("Evaluation Methods Comparison:\n")

for expected, actual, desc in test_cases:
    print(f"{desc}:")
    print(f"  Expected: '{expected}'")
    print(f"  Actual: '{actual}'")
    
    # Try different evaluators
    exact = evaluate_response(expected, actual, "exact_match")
    contains = evaluate_response(expected, actual, "contains")
    semantic = evaluate_response(expected, actual, "semantic")
    
    print(f"  Exact match: {exact.is_correct} (score: {exact.similarity_score:.2f})")
    print(f"  Contains: {contains.is_correct} (score: {contains.similarity_score:.2f})")
    print(f"  Semantic: {semantic.is_correct} (score: {semantic.similarity_score:.2f})")
    print()

In [None]:
# Record accuracy metrics
accuracy = collector.record_accuracy(
    expected="The answer is 42",
    actual="The answer is 42",
)

print(f"Accuracy metric:")
print(f"  Is correct: {accuracy.is_correct}")
print(f"  Similarity score: {accuracy.similarity_score}")
print(f"  Method: {accuracy.evaluation_method}")

## Part 3: Cost Estimation

Track and estimate token costs across different models.

In [None]:
# Estimate costs for different models
models = ["gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-3.5-turbo"]
input_tokens = 1000
output_tokens = 500

print(f"Cost comparison for {input_tokens} input + {output_tokens} output tokens:\n")

for model in models:
    cost = estimate_cost(input_tokens, output_tokens, model)
    print(f"  {model}: ${cost:.6f}")

In [None]:
# Record cost metrics
cost = collector.record_cost(
    operation="generate_response",
    input_tokens=500,
    output_tokens=250,
    model="gpt-4o-mini",
)

print(f"Cost metric:")
print(f"  Operation: {cost.operation}")
print(f"  Total tokens: {cost.total_tokens}")
print(f"  Cost USD: ${cost.cost_usd:.6f}")
print(f"  Model: {cost.model}")

## Part 4: Quality Scoring

Assess response quality across multiple dimensions.

In [None]:
# Record quality scores for different dimensions
quality_dimensions = [
    ("relevance", 0.9, "Response directly addresses the question"),
    ("clarity", 0.85, "Well-structured and easy to understand"),
    ("completeness", 0.75, "Covers most aspects but missing some detail"),
    ("accuracy", 0.95, "Factually correct information"),
]

print("Quality Scores:\n")

for dimension, score, explanation in quality_dimensions:
    quality = collector.record_quality(
        dimension=dimension,
        score=score,
        explanation=explanation,
    )
    print(f"  {dimension}: {quality.score:.0%} - {explanation}")

## Part 5: Creating Complete Evaluations

Combine all metrics into a comprehensive evaluation result.

In [None]:
# Create a complete evaluation
from src.common.evaluation import LatencyMetric, AccuracyMetric, CostMetric, QualityScore
from datetime import datetime

evaluation = collector.create_evaluation(
    run_id="run_001",
    agent_name="research_agent",
    latency=LatencyMetric(
        operation="query",
        duration_ms=150.5,
        start_time=datetime.now(),
        end_time=datetime.now(),
    ),
    accuracy=AccuracyMetric(
        expected="Expected output",
        actual="Actual output from agent",
        is_correct=True,
        similarity_score=0.85,
    ),
    cost=CostMetric(
        operation="query",
        input_tokens=200,
        output_tokens=150,
        total_tokens=350,
        cost_usd=0.00023,
        model="gpt-4o-mini",
    ),
    quality_scores=[
        QualityScore("relevance", 0.9, "Highly relevant"),
        QualityScore("clarity", 0.85, "Clear response"),
    ],
)

print(f"Evaluation Result:")
print(f"  Run ID: {evaluation.run_id}")
print(f"  Agent: {evaluation.agent_name}")
print(f"  Overall Quality: {evaluation.overall_quality:.0%}")

## Part 6: Metrics Aggregation

Aggregate metrics to understand trends and patterns.

In [None]:
# Simulate multiple operations to aggregate
import random

for i in range(10):
    # Simulate varying latencies
    latency_ms = 100 + random.random() * 100
    collector.record_custom(
        name="api_latency",
        value=latency_ms,
        metric_type=MetricType.LATENCY,
        tags={"endpoint": "generate"},
    )

# Aggregate the results
aggregated = collector.aggregate("api_latency")

print("Aggregated Latency Metrics:\n")
print(f"  Count: {aggregated.count}")
print(f"  Average: {aggregated.avg_value:.2f}ms")
print(f"  Min: {aggregated.min_value:.2f}ms")
print(f"  Max: {aggregated.max_value:.2f}ms")
print(f"  P50: {aggregated.p50_value:.2f}ms")
print(f"  P95: {aggregated.p95_value:.2f}ms")

In [None]:
# Get summary of all metrics
summary = collector.summary()

print("Metrics Summary:\n")
print(f"  Total metrics: {summary['total_metrics']}")
print(f"  Total evaluations: {summary['total_evaluations']}")
print(f"  Metric types: {summary['metric_types']}")

## Part 7: Prompt Analysis

Analyze prompts for quality and get improvement suggestions.

In [None]:
# Analyze a simple prompt
simple_prompt = "Summarize this text."

analysis = analyze_prompt(simple_prompt)

print(f"Analysis of: '{simple_prompt}'\n")
print(f"  Word count: {analysis.word_count}")
print(f"  Has role: {analysis.has_role}")
print(f"  Has examples: {analysis.has_examples}")
print(f"  Has constraints: {analysis.has_constraints}")
print(f"  Has output format: {analysis.has_output_format}")
print(f"\nSuggestions:")
for s in analysis.suggestions:
    print(f"  - [{s.improvement_type.value}] {s.description}")

In [None]:
# Analyze a more complete prompt
complete_prompt = """
You are an expert technical writer.

Summarize the following text in exactly 3 bullet points.
Focus on the key technical concepts.

Important: Do not include opinions, only facts.

Example:
Input: Python is a programming language known for readability.
Output:
- Python is a programming language
- Known for code readability
- Popular for beginners and experts

Format: Return as a markdown bullet list.
"""

analysis = analyze_prompt(complete_prompt)

print(f"Analysis of complete prompt:\n")
print(f"  Word count: {analysis.word_count}")
print(f"  Has role: {analysis.has_role}")
print(f"  Has examples: {analysis.has_examples}")
print(f"  Has constraints: {analysis.has_constraints}")
print(f"  Has output format: {analysis.has_output_format}")
print(f"  Complexity: {analysis.complexity_score:.2f}")
print(f"\nSuggestions: {len(analysis.suggestions)}")

## Part 8: Prompt Version Tracking

Track prompt versions and iterate improvements.

In [None]:
# Create a prompt tuner
tuner = create_tuner()

# Register initial prompt version
v1 = tuner.create_prompt(
    name="summarize",
    content="Summarize this text.",
    metadata={"author": "team"},
)

print(f"Created prompt:")
print(f"  Name: summarize")
print(f"  Version: {v1.version}")
print(f"  Status: {v1.status.value}")
print(f"  Hash: {v1.hash}")

In [None]:
# Iterate on the prompt
v2 = tuner.iterate(
    name="summarize",
    new_content="""You are a helpful assistant.
Summarize the following text concisely.""",
    changes="Added role definition",
)

v3 = tuner.iterate(
    name="summarize",
    new_content="""You are a helpful assistant.
Summarize the following text in 3 bullet points.
Focus on key points only.""",
    changes="Added structure and constraints",
)

print("Prompt iterations:")
for v in tuner.registry.list_versions("summarize"):
    print(f"  {v.version}: {v.status.value} - {v.changes or 'Initial version'}")

In [None]:
# Get version history
history = tuner.get_history("summarize")

print("Version History:\n")
for entry in history:
    print(f"  {entry['version']}: {entry['status']} ({entry['hash']})")

In [None]:
# Promote a version to active
tuner.promote("summarize", "v1.2")

active = tuner.registry.get_active("summarize")
print(f"Active version: {active.version}")
print(f"Content: {active.content[:100]}...")

## Part 9: Prompt Templates

Create reusable prompt templates with variables.

In [None]:
# Create a template with variables
template = create_template(
    template="""You are a {role}.
    
{task}

Use a {tone} tone.
Maximum length: {max_length} words.""",
    variables=[
        {"name": "role", "description": "The AI's role", "required": True},
        {"name": "task", "description": "The task to perform", "required": True},
        {"name": "tone", "description": "Communication tone", "default": "professional"},
        {"name": "max_length", "description": "Maximum words", "default": "200"},
    ],
    name="task_template",
)

print(f"Template: {template.name}")
print(f"Variables: {template.get_variable_names()}")

In [None]:
# Render the template
rendered = template.render(
    role="data analyst",
    task="Analyze the sales data and identify trends.",
    tone="casual",
)

print("Rendered prompt:\n")
print(rendered)

## Part 10: A/B Testing Setup

Compare prompt versions to find the best performer.

In [None]:
# Set up an A/B test
test_config = tuner.compare(
    name="summarize",
    version_a="v1.0",
    version_b="v1.2",
    sample_size=100,
)

print(f"A/B Test Configuration:")
print(f"  Test name: {test_config.name}")
print(f"  Sample size: {test_config.sample_size}")
print(f"  Metrics: {test_config.metrics}")
print(f"\nVariant A: {test_config.variant_a.version}")
print(f"Variant B: {test_config.variant_b.version}")

In [None]:
# Analyze test results (simulated)
result = tuner.ab_runner.analyze(test_config.name)

print(f"A/B Test Results:")
print(f"\nVariant A metrics: {result.variant_a_metrics}")
print(f"Variant B metrics: {result.variant_b_metrics}")
print(f"\nWinner: {result.winner.value}")
print(f"Confidence: {result.confidence:.0%}")

## ðŸŽ¯ Exercise: Apply Evaluation-Driven Improvement

1. Create a prompt and analyze it
2. Apply suggestions to improve it
3. Record metrics for both versions
4. Compare the results

In [None]:
# Your solution here

# 1. Create initial prompt
initial_prompt = "Write code to sort a list."

# 2. Analyze and get suggestions
initial_analysis = analyze_prompt(initial_prompt)
print(f"Initial prompt analysis:")
print(f"  Suggestions: {len(initial_analysis.suggestions)}")
for s in initial_analysis.suggestions[:3]:
    print(f"    - {s.description}")

# 3. Create improved version based on suggestions
improved_prompt = """
You are an expert Python developer.

Write a function to sort a list of integers in ascending order.

Requirements:
- Use Python's built-in sorting capabilities
- Handle empty lists gracefully
- Include type hints

Example:
Input: [3, 1, 4, 1, 5]
Output: [1, 1, 3, 4, 5]

Return only the function code, no explanations.
"""

improved_analysis = analyze_prompt(improved_prompt)
print(f"\nImproved prompt analysis:")
print(f"  Has role: {improved_analysis.has_role}")
print(f"  Has examples: {improved_analysis.has_examples}")
print(f"  Has constraints: {improved_analysis.has_constraints}")
print(f"  Suggestions remaining: {len(improved_analysis.suggestions)}")

## Summary

In this scenario, you learned:

1. **Metrics Collection**: Latency, accuracy, cost, and quality tracking
2. **Evaluation Methods**: Exact match, contains, semantic similarity
3. **Cost Estimation**: Token-based pricing for different models
4. **Prompt Analysis**: Automated suggestions for improvement
5. **Version Tracking**: Managing prompt iterations
6. **A/B Testing**: Comparing prompt variants
7. **Templates**: Reusable prompts with variables

### Key Takeaways

- **Measure everything**: You can't improve what you don't measure
- **Iterate systematically**: Track changes and their impact
- **Test before promoting**: A/B test significant changes
- **Analyze prompts**: Use automated analysis to catch issues early

### Next Steps

- Integrate evaluation into your CI/CD pipeline
- Build dashboards for real-time monitoring
- Set up alerts for quality degradation