# Scenario 07: AI Agent Evaluation with Azure AI Evaluation SDK

**Estimated Time**: 45 minutes

## Learning Objectives
- Use the official `azure-ai-evaluation` SDK for agent evaluation
- Apply quality evaluators: Relevance, Coherence, Fluency, Groundedness
- Use agent-specific evaluators: IntentResolution, TaskAdherence, ToolCallAccuracy
- Run batch evaluations with the `evaluate()` API
- Optionally log results to Azure AI Foundry

## Prerequisites
- Completed Scenario 01 (Simple Agent + MCP)
- Azure OpenAI deployment (gpt-4o recommended)
- Environment variables configured (see `.env.example`)

## Key Concepts

| Evaluator | Purpose | Output Scale |
|-----------|---------|--------------|
| `RelevanceEvaluator` | Is response relevant to query? | 1-5 |
| `CoherenceEvaluator` | Is response logically structured? | 1-5 |
| `FluencyEvaluator` | Is response linguistically sound? | 1-5 |
| `GroundednessEvaluator` | Is response grounded in context? | 1-5 |
| `IntentResolutionEvaluator` | Did agent resolve user intent? | 1-5 + pass/fail |
| `TaskAdherenceEvaluator` | Did agent complete the task? | 1-5 + pass/fail |
| `ToolCallAccuracyEvaluator` | Were tool calls appropriate? | 0-1 |

In [1]:
# Cell 1: Environment Setup
import sys
from pathlib import Path

# Add project root to path
project_root = Path("..").resolve()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Load environment variables
from dotenv import load_dotenv
load_dotenv(project_root / ".env")

print(f"‚úÖ Project root: {project_root}")

‚úÖ Project root: C:\Users\jonasrotter\OneDrive - Microsoft\Desktop\Jonas Privat\MyCodingProjects\agents-workshop


In [2]:
# Cell 2: Import Azure AI Evaluation SDK and helpers
import os
import json
from pprint import pprint

# Reload modules to pick up any changes during development
import importlib
import src.common.config
import src.common.evaluation

# Azure AI Evaluation SDK imports
from azure.ai.evaluation import (
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    GroundednessEvaluator,
    IntentResolutionEvaluator,
    TaskAdherenceEvaluator,
    ToolCallAccuracyEvaluator,
    evaluate,
)

# Configuration from centralized config.py (unified with all other notebooks)
from src.common.config import (
    get_settings,
    get_model_config,
    get_azure_ai_project,
    get_config_summary,
)
from src.common.exceptions import ConfigurationError

# Evaluation helpers
from src.common.evaluation import (
    # SDK wrapper functions
    create_relevance_evaluator,
    create_coherence_evaluator,
    create_fluency_evaluator,
    create_groundedness_evaluator,
    create_intent_resolution_evaluator,
    create_task_adherence_evaluator,
    create_tool_call_accuracy_evaluator,
    batch_evaluate,
    # Retained utilities
    MetricsCollector,
    CostMetric,
    OpenAICostCalculator,
    MetricType,
)

print("‚úÖ Azure AI Evaluation SDK imports successful")

‚úÖ Azure AI Evaluation SDK imports successful


## Part 1: SDK Configuration

Configure the Azure OpenAI model for AI-assisted evaluators.

In [3]:
# Cell 3: Configure Azure OpenAI for evaluators (unified config.py)
# Clear cache to pick up any .env changes
get_settings.cache_clear()
settings = get_settings()

print(f"‚úÖ Settings loaded from .env via config.py:")
print(f"   Azure OpenAI: {'Configured' if settings.is_azure_configured else 'Not configured'}")
print(f"   Endpoint: {settings.azure_openai_endpoint or '(not set)'}")
print(f"   Deployment: {settings.azure_openai_deployment or '(not set)'}")

# Get model config for SDK evaluators (reads from settings)
try:
    model_config = get_model_config()
    # Use Azure credential auth instead of API key (key auth is disabled on this resource)
    model_config["api_key"] = ""
    print("\n‚úÖ Model configuration for SDK evaluators (using Azure credential auth):")
    pprint(get_config_summary(model_config))
except ConfigurationError as e:
    print(f"\n‚ö†Ô∏è Configuration error: {e}")
    print("\nPlease set the following environment variables in your .env file:")
    print("  AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com")
    print("  AZURE_OPENAI_DEPLOYMENT=gpt-4o")
    model_config = None

‚úÖ Settings loaded from .env via config.py:
   Azure OpenAI: Configured
   Endpoint: https://aistudiojonasr5312406741.cognitiveservices.azure.com/
   Deployment: gpt-4.1-mini

‚úÖ Model configuration for SDK evaluators (using Azure credential auth):
{'api_key': '(not set)',
 'api_version': '2024-10-01-preview',
 'azure_deployment': 'gpt-4.1-mini',
 'azure_endpoint': 'https://aistudiojonasr5312406741.cognitiveservices.azure.com/'}


In [4]:
# Cell 4: Check optional Azure AI Foundry project configuration
azure_ai_project = get_azure_ai_project()
if azure_ai_project:
    print("‚úÖ Azure AI Foundry project configured:")
    pprint(azure_ai_project)
else:
    print("‚ÑπÔ∏è Azure AI Foundry project not configured (optional)")
    print("   Set AZURE_AI_PROJECT_* env vars to enable result logging to Foundry")

‚úÖ Azure AI Foundry project configured:
{'project_name': 'jonasrotter-project',
 'resource_group_name': 'aoai-jonasrotter',
 'subscription_id': '136df6d6-935d-47c7-8ae3-38ebfccf0df3'}


## Part 2: Quality Evaluators

The SDK provides AI-assisted evaluators that use an LLM to assess response quality.
Each evaluator returns a score from 1 (poor) to 5 (excellent).

In [5]:
# Cell 5: RelevanceEvaluator - Measures response relevance to query
if model_config:
    relevance_eval = create_relevance_evaluator(model_config)
    
    # Test with a sample query/response pair
    result = relevance_eval(
        query="What is the capital of France?",
        response="The capital of France is Paris. It is known for landmarks like the Eiffel Tower."
    )
    
    print("üéØ Relevance Evaluation Result:")
    pprint(result)
    print(f"\nüìä Score Interpretation: {result.get('relevance', 0)}/5")
else:
    print("‚ö†Ô∏è Skipping - model_config not available")

üéØ Relevance Evaluation Result:
{'gpt_relevance': 5.0,
 'relevance': 5.0,
 'relevance_completion_tokens': 51,
 'relevance_finish_reason': 'stop',
 'relevance_model': 'gpt-4.1-mini-2025-04-14',
 'relevance_prompt_tokens': 1596,
 'relevance_reason': "The response directly answers the user's question by "
                     'naming Paris as the capital of France and adds relevant '
                     "context about a famous landmark, enhancing the user's "
                     'understanding without deviating from the topic.',
 'relevance_result': 'pass',
 'relevance_sample_input': '[{"role": "user", "content": "{\\"query\\": '
                           '\\"What is the capital of France?\\", '
                           '\\"response\\": \\"The capital of France is Paris. '
                           'It is known for landmarks like the Eiffel '
                           'Tower.\\"}"}]',
 'relevance_sample_output': '[{"role": "assistant", "content": "{\\n  '
                        

In [6]:
# Cell 6: CoherenceEvaluator - Measures logical flow and consistency
if model_config:
    coherence_eval = create_coherence_evaluator(model_config)
    
    result = coherence_eval(
        query="Explain how photosynthesis works",
        response="Photosynthesis converts sunlight into energy. Plants use chlorophyll to capture light. This process produces glucose and oxygen. The glucose provides energy for plant growth."
    )
    
    print("üîó Coherence Evaluation Result:")
    pprint(result)
    print(f"\nüìä Score Interpretation: {result.get('coherence', 0)}/5")
else:
    print("‚ö†Ô∏è Skipping - model_config not available")

üîó Coherence Evaluation Result:
{'coherence': 4.0,
 'coherence_completion_tokens': 188,
 'coherence_finish_reason': 'stop',
 'coherence_model': 'gpt-4.1-mini-2025-04-14',
 'coherence_prompt_tokens': 1286,
 'coherence_reason': 'The response is coherent because it logically and '
                     'clearly explains the process of photosynthesis in a '
                     'well-organized manner, directly addressing the query '
                     'with connected ideas and smooth flow.',
 'coherence_result': 'pass',
 'coherence_sample_input': '[{"role": "user", "content": "{\\"query\\": '
                           '\\"Explain how photosynthesis works\\", '
                           '\\"response\\": \\"Photosynthesis converts '
                           'sunlight into energy. Plants use chlorophyll to '
                           'capture light. This process produces glucose and '
                           'oxygen. The glucose provides energy for plant '
                         

## Part 3: More Quality Evaluators

Fluency and Groundedness evaluations complete the quality assessment suite.

In [7]:
# Cell 7: FluencyEvaluator - Measures grammatical correctness and readability
if model_config:
    fluency_eval = create_fluency_evaluator(model_config)
    
    result = fluency_eval(
        response="The quick brown fox jumps over the lazy dog. This sentence demonstrates proper grammar and natural flow."
    )
    
    print("üìù Fluency Evaluation Result:")
    pprint(result)
    print(f"\nüìä Score Interpretation: {result.get('fluency', 0)}/5")
else:
    print("‚ö†Ô∏è Skipping - model_config not available")

üìù Fluency Evaluation Result:
{'fluency': 3.0,
 'fluency_completion_tokens': 192,
 'fluency_finish_reason': 'stop',
 'fluency_model': 'gpt-4.1-mini-2025-04-14',
 'fluency_prompt_tokens': 931,
 'fluency_reason': 'The response is clear, coherent, and grammatically correct '
                   'with simple sentence structures and vocabulary. It fits '
                   'well within competent fluency but does not show the '
                   'complexity or variety needed for a higher score.',
 'fluency_result': 'pass',
 'fluency_sample_input': '[{"role": "user", "content": "{\\"response\\": '
                         '\\"The quick brown fox jumps over the lazy dog. This '
                         'sentence demonstrates proper grammar and natural '
                         'flow.\\"}"}]',
 'fluency_sample_output': '[{"role": "assistant", "content": "<S0>Let\'s think '
                          'step by step: The response consists of two '
                          'sentences. The first 

In [8]:
# Cell 8: GroundednessEvaluator - Measures factual accuracy against context
if model_config:
    groundedness_eval = create_groundedness_evaluator(model_config)
    
    context = """
    Paris is the capital of France. It has a population of about 2.1 million 
    in the city proper. The Eiffel Tower was completed in 1889.
    """
    
    result = groundedness_eval(
        context=context,
        response="Paris is the capital of France with roughly 2 million residents. The famous Eiffel Tower was built in 1889."
    )
    
    print("üìö Groundedness Evaluation Result:")
    pprint(result)
    print(f"\nüìä Score Interpretation: {result.get('groundedness', 0)}/5")
else:
    print("‚ö†Ô∏è Skipping - model_config not available")

üìö Groundedness Evaluation Result:
{'gpt_groundedness': 5.0,
 'groundedness': 5.0,
 'groundedness_completion_tokens': 179,
 'groundedness_finish_reason': 'stop',
 'groundedness_model': 'gpt-4.1-mini-2025-04-14',
 'groundedness_prompt_tokens': 1174,
 'groundedness_reason': 'The response accurately reflects the key facts from '
                        'the context, including Paris being the capital, the '
                        "approximate population, and the Eiffel Tower's "
                        'completion year, making it fully grounded and '
                        'complete.',
 'groundedness_result': 'pass',
 'groundedness_sample_input': '[{"role": "user", "content": "{\\"response\\": '
                              '\\"Paris is the capital of France with roughly '
                              '2 million residents. The famous Eiffel Tower '
                              'was built in 1889.\\", \\"context\\": '
                              '\\"\\\\n    Paris is the capital of

## Part 4: Agent-Specific Evaluators

These evaluators are designed specifically for AI agents and agentic workflows.

In [9]:
# Cell 9: IntentResolutionEvaluator - Measures how well the agent understood user intent
if model_config:
    intent_eval = create_intent_resolution_evaluator(model_config)
    
    result = intent_eval(
        query="I need to book a flight to New York for next Tuesday",
        response="I found several flight options to New York for next Tuesday. The earliest departure is at 6:00 AM with Delta, and I can also show you afternoon flights if you prefer a later start."
    )
    
    print("üéØ Intent Resolution Evaluation Result:")
    pprint(result)
    print(f"\nüìä Score Interpretation: {result.get('intent_resolution', 0)}/5")
else:
    print("‚ö†Ô∏è Skipping - model_config not available")

Class IntentResolutionEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Conversation history could not be parsed, falling back to original query: I need to book a flight to New York for next Tuesday
Empty agent response extracted, likely due to input schema change. Falling back to using the original response: I found several flight options to New York for next Tuesday. The earliest departure is at 6:00 AM with Delta, and I can also show you afternoon flights if you prefer a later start.


üéØ Intent Resolution Evaluation Result:
{'gpt_intent_resolution': 4.0,
 'intent_resolution': 4.0,
 'intent_resolution_completion_tokens': 64,
 'intent_resolution_finish_reason': 'stop',
 'intent_resolution_model': 'gpt-4.1-mini-2025-04-14',
 'intent_resolution_prompt_tokens': 1931,
 'intent_resolution_reason': 'User wanted to book a flight to New York for '
                             'next Tuesday. The agent provided relevant flight '
                             'options and offered to show more, effectively '
                             'addressing the intent by initiating the booking '
                             'process and prompting further preference, though '
                             'no final booking confirmation was given yet.',
 'intent_resolution_result': 'pass',
 'intent_resolution_sample_input': '[{"role": "user", "content": '
                                   '"{\\"query\\": \\"I need to book a flight '
                                   'to New York for next 

## Part 5: Task Adherence & Tool Call Accuracy

Evaluate how well agents follow instructions and use tools correctly.

In [10]:
# Cell 10: TaskAdherenceEvaluator - Measures if the agent followed instructions
if model_config:
    task_eval = create_task_adherence_evaluator(model_config)
    
    result = task_eval(
        query="Summarize this article in exactly 3 bullet points and include the main conclusion",
        response="""Here are the key points:
‚Ä¢ The study found that exercise improves cognitive function by 25%
‚Ä¢ Regular physical activity reduces stress hormones significantly  
‚Ä¢ The main conclusion is that 30 minutes of daily exercise can enhance mental performance"""
    )
    
    print("‚úÖ Task Adherence Evaluation Result:")
    pprint(result)
    print(f"\nüìä Score Interpretation: {result.get('task_adherence', 0)}/5")
else:
    print("‚ö†Ô∏è Skipping - model_config not available")

Class TaskAdherenceEvaluator: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Conversation history could not be parsed, falling back to original query: Summarize this article in exactly 3 bullet points and include the main conclusion
Agent response could not be parsed, falling back to original response: Here are the key points:
‚Ä¢ The study found that exercise improves cognitive function by 25%
‚Ä¢ Regular physical activity reduces stress hormones significantly  
‚Ä¢ The main conclusion is that 30 minutes of daily exercise can enhance mental performance


‚úÖ Task Adherence Evaluation Result:
{'task_adherence': 1.0,
 'task_adherence_completion_tokens': 152,
 'task_adherence_details': '',
 'task_adherence_finish_reason': 'stop',
 'task_adherence_model': 'gpt-4.1-mini-2025-04-14',
 'task_adherence_prompt_tokens': 1368,
 'task_adherence_reason': 'The user requested a summary of an article in '
                          'exactly 3 bullet points including the main '
                          'conclusion. The assistant provided exactly 3 '
                          'bullets with key findings and a clear main '
                          "conclusion, fully meeting the user's objective. "
                          'There is no evidence of unrelated or unverifiable '
                          'claims as no tool calls were used or needed for '
                          'this task; the assistant gave a concise summary '
                          'consistent with the request. The assistant '
                          'respected the exact format requ

## Part 6: Batch Evaluation

Run multiple evaluators at once using the SDK's `evaluate()` function.

In [11]:
# Cell 11: Prepare sample data for batch evaluation
sample_data = [
    {
        "query": "What is machine learning?",
        "context": "Machine learning is a subset of AI that enables systems to learn from data.",
        "response": "Machine learning is a branch of artificial intelligence where systems learn patterns from data to make predictions without explicit programming."
    },
    {
        "query": "Explain neural networks",
        "context": "Neural networks are computing systems inspired by biological neural networks in animal brains.",
        "response": "Neural networks are computational models inspired by the human brain, consisting of interconnected nodes that process information in layers."
    },
    {
        "query": "What is deep learning?",
        "context": "Deep learning uses neural networks with many layers to model complex patterns.",
        "response": "Deep learning is a subset of machine learning that uses multi-layered neural networks to analyze complex data patterns."
    }
]

print(f"üìã Prepared {len(sample_data)} samples for batch evaluation")
for i, sample in enumerate(sample_data, 1):
    print(f"   Sample {i}: {sample['query'][:40]}...")

üìã Prepared 3 samples for batch evaluation
   Sample 1: What is machine learning?...
   Sample 2: Explain neural networks...
   Sample 3: What is deep learning?...


In [12]:
# Cell 12: Run batch evaluation with multiple evaluators
import tempfile
import json as json_mod

if model_config:
    # Create evaluators dictionary with the instances we already have
    evaluators = {
        "relevance": relevance_eval,
        "coherence": coherence_eval,
        "groundedness": groundedness_eval,
    }
    
    # The SDK evaluate() function requires data to be a file path
    # Save sample data to a temporary JSONL file
    with tempfile.NamedTemporaryFile(mode='w', suffix='.jsonl', delete=False, encoding='utf-8') as f:
        temp_path = f.name
        for item in sample_data:
            f.write(json_mod.dumps(item) + '\n')
    
    try:
        # Use our batch_evaluate wrapper function
        results = batch_evaluate(
            data=temp_path,
            evaluators=evaluators,
        )
        
        print("üìä Batch Evaluation Results:")
        pprint(results)
    finally:
        # Clean up temp file
        import os
        os.unlink(temp_path)
else:
    print("‚ö†Ô∏è Skipping batch evaluation - model_config not available")
    print("   Set environment variables and re-run Part 1 to enable")

2026-01-09 14:31:11 +0100   21748 execution.bulk     INFO     Finished 1 / 3 lines.
2026-01-09 14:31:11 +0100   21748 execution.bulk     INFO     Average execution time for completed lines: 18.8 seconds. Estimated time for incomplete lines: 37.6 seconds.
2026-01-09 14:31:11 +0100   21748 execution.bulk     INFO     Finished 2 / 3 lines.
2026-01-09 14:31:11 +0100   21748 execution.bulk     INFO     Average execution time for completed lines: 9.46 seconds. Estimated time for incomplete lines: 9.46 seconds.
2026-01-09 14:31:11 +0100   21748 execution.bulk     INFO     Finished 3 / 3 lines.
2026-01-09 14:31:11 +0100   21748 execution.bulk     INFO     Average execution time for completed lines: 6.33 seconds. Estimated time for incomplete lines: 0.0 seconds.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "relevance_20260109_133052_488649"
Run status: "Completed"
Start time: "2026-01-09 13:30:52.488649+00:00"
Duration: "0:00:19.494566"

2026-01-09 14:31:12 +0100   46484 execution.bulk     INFO     Finished 1 / 3 lines.
2026-01-09 14:31:12 +0100   46484 execution.bulk     INFO     Average execution time for completed lines: 20.27 seconds. Estimated time for incomplete lines: 40.54 seconds.
2026-01-09 14:31:13 +0100   46484 execution.bulk     INFO     Finished 2 / 3 lines.
2026-01-09 14:31:13 +0100   46484 execution.bulk     INFO     Average execution time for completed lines: 10.27 seconds. Estimated time for incomplete lines: 10.27 seconds.
2026-01-09 14:31:13 +0100   46484 execution.bulk     INFO     Finished 3 / 3 lines.
2026-01-09 14:31:13 +0100   46484 execution.bulk     INFO     Average execution time for completed lines: 7.03 seconds. Estimated time for incomplete lines: 0.0 seconds.
2026-01-09 14:31:13 +0100   34908 execution.bulk     INFO     Finished 1 / 3 lines.
202

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "coherence_20260109_133052_490739"
Run status: "Completed"
Start time: "2026-01-09 13:30:52.490739+00:00"
Duration: "0:00:21.469074"

2026-01-09 14:31:15 +0100   34908 execution.bulk     INFO     Finished 3 / 3 lines.
2026-01-09 14:31:15 +0100   34908 execution.bulk     INFO     Average execution time for completed lines: 7.62 seconds. Estimated time for incomplete lines: 0.0 seconds.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "groundedness_20260109_133052_513209"
Run status: "Completed"
Start time: "2026-01-09 13:30:52.513209+00:00"
Duration: "0:00:23.516100"


{
    "relevance": {
        "status": "Completed",
        "duration": "0:00:19.494566",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null,
        "error_message": null,
        "error_code": null
    },
    "coherence": {
        "status": "Completed",
        "duration": "0:00:21.469074",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null,
        "error_message": null,
        "error_code": null
    },
    "groundedness": {
        "status": "Completed",
        "duration": "0:00:23.516100",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null,
        "error_message": null,
        "error_code": null
    }
}


üìä Batch Evaluation Results:
{'metrics': {'coherence.binary_aggregate': 1.0,
             'coherence.coherence': 4.0,
             'c

Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "relevance_20260109_133116_650104"
Run status: "Completed"
Start time: "2026-01-09 13:31:16.650104+00:00"
Duration: "0:00:19.492423"

2026-01-09 14:31:36 +0100   25432 execution.bulk     INFO     Finished 2 / 3 lines.
2026-01-09 14:31:36 +0100   25432 execution.bulk     INFO     Average execution time for completed lines: 9.83 seconds. Estimated time for incomplete lines: 9.83 seconds.
2026-01-09 14:31:36 +0100   25432 execution.bulk     INFO     Finished 3 / 3 lines.
2026-01-09 14:31:36 +0100   25432 execution.bulk     INFO     Average execution time for completed lines: 6.59 seconds. Estimated time for incomplete lines: 0.0 seconds.


Aggregated metrics for evaluator is not a dictionary will not be logged as metrics



Run name: "coherence_20260109_133116_656129"
Run status: "Completed"
Start time: "2026-01-09 13:31:16.656129+00:00"
Duration: "0:00:20.509226"


{
    "relevance": {
        "status": "Completed",
        "duration": "0:00:19.492423",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null,
        "error_message": null,
        "error_code": null
    },
    "coherence": {
        "status": "Completed",
        "duration": "0:00:20.509226",
        "completed_lines": 3,
        "failed_lines": 0,
        "log_path": null,
        "error_message": null,
        "error_code": null
    }
}


‚òÅÔ∏è Evaluation submitted to Azure AI Foundry
   View results at: https://ai.azure.com
üéØ Prompt Tuner Configuration:
   Base prompt: Answer the following question: {question}...
   Latest version: v1.1
   Versions registered: 2
üìä Prompt Evaluation Strategy:
   1. Run each prompt variation through your test dataset
   2. Evaluate responses with SDK quality evaluators


## Part 7: Custom Evaluators (Extensibility)

Create custom evaluators by extending the SDK's evaluator pattern.

In [13]:
# Cell 13: Custom Evaluator Example - Response Length Evaluator
class ResponseLengthEvaluator:
    """Custom evaluator that checks if response meets length requirements."""
    
    def __init__(self, min_words: int = 10, max_words: int = 500):
        self.min_words = min_words
        self.max_words = max_words
    
    def __call__(self, *, response: str, **kwargs) -> dict:
        word_count = len(response.split())
        
        if word_count < self.min_words:
            score = 1  # Too short
            reason = f"Response too short ({word_count} words, minimum {self.min_words})"
        elif word_count > self.max_words:
            score = 2  # Too long
            reason = f"Response too long ({word_count} words, maximum {self.max_words})"
        else:
            score = 5  # Appropriate length
            reason = f"Response length appropriate ({word_count} words)"
        
        return {
            "response_length": score,
            "response_length_reason": reason,
            "word_count": word_count
        }

# Test the custom evaluator
length_eval = ResponseLengthEvaluator(min_words=5, max_words=100)
result = length_eval(response="This is a test response with several words to check the evaluator.")
print("üìè Custom Length Evaluator Result:")
pprint(result)

In [14]:
# Cell 14: Custom Evaluator - Code Block Detector
import re

class CodeBlockEvaluator:
    """Custom evaluator that detects and validates code blocks in responses."""
    
    def __call__(self, *, response: str, **kwargs) -> dict:
        # Find markdown code blocks
        code_blocks = re.findall(r'```[\s\S]*?```', response)
        inline_code = re.findall(r'`[^`]+`', response)
        
        has_code = len(code_blocks) > 0 or len(inline_code) > 0
        
        return {
            "has_code": has_code,
            "code_block_count": len(code_blocks),
            "inline_code_count": len(inline_code),
            "code_detection_score": 5 if has_code else 1
        }

# Test with a response containing code
code_eval = CodeBlockEvaluator()
test_response = """Here's how to print in Python:
```python
print("Hello, World!")
```
You can also use `print()` with variables."""

result = code_eval(response=test_response)
print("üíª Code Block Evaluator Result:")
pprint(result)
result

{'has_code': True,
 'code_block_count': 1,
 'inline_code_count': 2,
 'code_detection_score': 5}

## Part 8: Cost Tracking with MetricsCollector

Track API costs alongside SDK quality evaluations using our MetricsCollector.

In [15]:
# Cell 15: Initialize MetricsCollector for cost tracking
from src.common.evaluation import MetricsCollector, estimate_cost

collector = MetricsCollector()

# Estimate costs for different models
models = ["gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-3.5-turbo"]
input_tokens = 1000
output_tokens = 500

print(f"üí∞ Cost comparison for {input_tokens} input + {output_tokens} output tokens:\n")
for model in models:
    cost = estimate_cost(input_tokens, output_tokens, model)
    print(f"   {model}: ${cost:.6f}")

In [16]:
# Cell 16: Record cost metrics for evaluation runs
from src.common.evaluation import MetricType

cost = collector.record_cost(
    operation="quality_evaluation",
    input_tokens=500,
    output_tokens=250,
    model="gpt-4o",
)

print("üìä Cost Metric Recorded:")
print(f"   Operation: {cost.operation}")
print(f"   Total tokens: {cost.total_tokens}")
print(f"   Cost USD: ${cost.cost_usd:.6f}")
print(f"   Model: {cost.model}")

# Record multiple evaluation costs
for i in range(5):
    collector.record_cost(
        operation=f"batch_eval_{i}",
        input_tokens=200 + i*50,
        output_tokens=100 + i*25,
        model="gpt-4o-mini",
    )

# Get cost metrics using get_metrics() with filter
cost_metrics = collector.get_metrics(metric_type=MetricType.COST)
print(f"\n‚úÖ Recorded {len(cost_metrics)} total cost metrics")

In [17]:
# Cell 17: Get evaluation summary
summary = collector.summary()
print("üìà Evaluation Session Summary:")
print(f"   Total metrics: {summary['total_metrics']}")
print(f"   Total evaluations: {summary['total_evaluations']}")
print(f"   Metric types: {summary['metric_types']}")

In [18]:
# Cell 18: Export metrics to JSON for analysis
import json

export_data = collector.export_json()
print("üì§ Exported Metrics (first 500 chars):")
print(export_data[:500] + "...")

## Part 9: Azure AI Foundry Integration (Optional)

Connect to Azure AI Foundry for cloud-based evaluation tracking and analytics.

In [19]:
# Cell 19: Check Azure AI Foundry project configuration
azure_project = get_azure_ai_project()

if azure_project:
    print("‚òÅÔ∏è Azure AI Foundry Project Configuration:")
    print(f"   Subscription: {azure_project['subscription_id'][:8]}...")
    print(f"   Resource Group: {azure_project['resource_group_name']}")
    print(f"   Project: {azure_project['project_name']}")
    print("\n   ‚úÖ Ready for cloud-based evaluation tracking!")
    print("   Results can be viewed in Azure AI Foundry portal")
else:
    print("‚ö†Ô∏è Azure AI Foundry not configured")
    print("   To enable cloud tracking, set these environment variables:")
    print("   - AZURE_AI_PROJECT_SUBSCRIPTION_ID")
    print("   - AZURE_AI_PROJECT_RESOURCE_GROUP")
    print("   - AZURE_AI_PROJECT_NAME")

In [20]:
sample_data[0]

{'query': 'What is machine learning?',
 'context': 'Machine learning is a subset of AI that enables systems to learn from data.',
 'response': 'Machine learning is a branch of artificial intelligence where systems learn patterns from data to make predictions without explicit programming.'}

In [21]:
# Cell 20: Example - Running evaluation with Azure AI Foundry tracking
if model_config and azure_project:
    import tempfile
    import json as json_mod
    import os
    from azure.ai.evaluation import evaluate
    
    # SDK evaluate() requires a file path, not in-memory data
    with tempfile.NamedTemporaryFile(mode='w', suffix='.jsonl', delete=False, encoding='utf-8') as f:
        temp_path = f.name
        for item in sample_data:
            f.write(json_mod.dumps(item) + '\n')
    
    try:
        # Run evaluation with cloud tracking
        results = evaluate(
            data=temp_path,  # Must be a file path to JSONL
            evaluators={
                "relevance": relevance_eval,  # Use existing evaluator instances
                "coherence": coherence_eval,
            },
            azure_ai_project=azure_project,  # Enable cloud tracking
            evaluation_name="workshop_demo_evaluation"
        )
        print("‚òÅÔ∏è Evaluation submitted to Azure AI Foundry")
        print("   View results at: https://ai.azure.com")
    finally:
        os.unlink(temp_path)
else:
    print("‚ö†Ô∏è Skipping Azure AI Foundry demo")
    print("   Requires both model_config and azure_project to be configured")

## Part 10: Prompt Tuning Based on Evaluation

Use evaluation results to iteratively improve prompts.

In [22]:
# Cell 21: Example prompt tuning workflow
from src.common.prompt_tuning import PromptTuner

# Initialize tuner - uses registry, analyzer, and A/B runner internally
tuner = PromptTuner()

# Create and register prompts using the actual API
base_prompt = tuner.create_prompt(
    name="qa_prompt",
    content="Answer the following question: {question}",
    metadata={"temperature": 0.7, "max_tokens": 500}
)

# Create iterations with the iterate method
detailed_prompt = tuner.iterate(
    name="qa_prompt",
    new_content="Please provide a detailed answer to: {question}\nInclude examples where relevant.",
    changes="Added instruction for detailed response with examples"
)

print("üéØ Prompt Tuner Configuration:")
print(f"   Base prompt: {base_prompt.content[:50]}...")
print(f"   Latest version: {detailed_prompt.version}")
print(f"   Versions registered: {len(tuner.registry.list_versions('qa_prompt'))}")

In [23]:
# Cell 22: Select best prompt variation based on evaluation scores
# In practice, you would run evaluations on each variation and compare scores

print("üìä Prompt Evaluation Strategy:")
print("   1. Run each prompt variation through your test dataset")
print("   2. Evaluate responses with SDK quality evaluators")
print("   3. Compare average scores across variations")
print("   4. Select the variation with highest combined quality score")
print("\nüí° Tip: Use batch_evaluate() to efficiently test multiple prompt variations")

## üéì Exercise: Build Your Evaluation Pipeline

Create a complete evaluation pipeline using the SDK evaluators.

**Task:** Evaluate a set of agent responses across multiple quality dimensions.

In [24]:
# Exercise: Complete this evaluation pipeline
# 
# Step 1: Define your test data
exercise_data = [
    {
        "query": "What are the benefits of cloud computing?",
        "context": "Cloud computing offers scalability, cost savings, and flexibility.",
        "response": "Cloud computing provides several benefits including scalability, reduced infrastructure costs, and the flexibility to access resources from anywhere."
    },
    # TODO: Add 2 more test cases
]

# Step 2: Create evaluators (uncomment when model_config is available)
# if model_config:
#     evaluators = {
#         "relevance": create_relevance_evaluator(model_config),
#         "coherence": create_coherence_evaluator(model_config),
#         "groundedness": create_groundedness_evaluator(model_config),
#     }

# Step 3: Run batch evaluation
# results = batch_evaluate(
#     data=exercise_data,
#     evaluators=["relevance", "coherence", "groundedness"],
#     model_config=model_config
# )

# Step 4: Analyze results
# print("Evaluation Results:")
# for key, value in results.items():
#     print(f"  {key}: {value}")

print("üìù Exercise: Complete the TODO items above to build your evaluation pipeline!")

## üìö Summary

In this notebook, you learned:

### Azure AI Evaluation SDK Features
- **Quality Evaluators**: RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, GroundednessEvaluator
- **Agent Evaluators**: IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator
- **Batch Evaluation**: Using `evaluate()` for efficient multi-sample evaluation
- **Custom Evaluators**: Extending the SDK with your own evaluation logic

### Key Concepts
1. **Model Configuration**: SDK evaluators require Azure OpenAI model config
2. **Column Mapping**: Map your data fields to evaluator parameters
3. **Scoring**: Quality evaluators use a 1-5 scale
4. **Extensibility**: Create custom evaluators matching the SDK pattern

### Best Practices
- ‚úÖ Use batch evaluation for large datasets
- ‚úÖ Combine multiple evaluators for comprehensive assessment
- ‚úÖ Track costs alongside quality metrics
- ‚úÖ Integrate with Azure AI Foundry for cloud tracking (optional)

### Resources
- [Azure AI Evaluation SDK Documentation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)
- [Evaluator Reference](https://learn.microsoft.com/en-us/python/api/azure-ai-evaluation/)
- [Azure AI Foundry Portal](https://ai.azure.com)

---

**Next Steps:** Apply evaluation to your own agents and use results to guide prompt tuning!