# Explore Agent Evaluators

Welcome! This notebook introduces you to evaluating AI agents using specialized evaluators from the Azure AI Evaluation SDK.

## What You'll Learn
- How to use specialized agent evaluators (Intent Resolution, Tool Call Accuracy, Task Adherence)
- How to create and evaluate agent scenarios with tool interactions
- How to assess complex multi-step agent conversations
- How to run batch evaluations for multiple agent interactions
- Best practices for evaluating production AI agents

Let's get started! üöÄ

---

## Understanding Agent Evaluation

AI agents are powerful productivity assistants that can create complex workflows for business needs. Unlike simple query-response AI systems, agents involve multiple steps:

- **Intent Recognition** - Understanding what the user wants to accomplish
- **Tool Selection & Usage** - Choosing and correctly using available tools
- **Task Execution** - Following through on the assigned workflow
- **Response Generation** - Providing helpful and accurate responses

When a user queries "What's the weather tomorrow?", an agentic workflow might involve reasoning through user intents, calling weather APIs, and utilizing retrieval-augmented generation. It's crucial to evaluate each step of the workflow, plus the quality and safety of the final output.

Azure AI Foundry provides specialized **agent evaluators** that assess these unique aspects:

1. **Intent Resolution** - Measures whether the agent correctly identifies the user's intent
2. **Tool Call Accuracy** - Measures whether the agent made the correct function tool calls
3. **Task Adherence** - Measures whether the agent's response adheres to its assigned tasks

## Step 1: Verify Azure AI Evaluation SDK

Let's ensure the Azure AI Evaluation SDK is installed. It provides specialized evaluators for agentic workflows:
- **IntentResolutionEvaluator** - For measuring intent understanding
- **ToolCallAccuracyEvaluator** - For assessing tool usage correctness
- **TaskAdherenceEvaluator** - For evaluating task completion fidelity

These work alongside standard quality and safety evaluators.

In [None]:
!pip list | grep azure-ai-evaluation

## Step 2: Import Required Libraries

Let's import the specialized agent evaluators and supporting libraries.

In [None]:
# Import agent-specific evaluators
from azure.ai.evaluation import (
    IntentResolutionEvaluator, 
    ToolCallAccuracyEvaluator, 
    TaskAdherenceEvaluator
)

# Import standard quality evaluators
from azure.ai.evaluation import (
    RelevanceEvaluator, 
    CoherenceEvaluator, 
    FluencyEvaluator
)

# Import supporting libraries
from azure.identity import DefaultAzureCredential
import os
import json

print("‚úÖ Successfully imported evaluation modules!")

## Step 3: Configure Azure AI Project

Let's set up our connection to Azure AI Foundry using environment variables.

In [None]:
# Get Azure AI project configuration from environment variables
azure_ai_foundry_name = os.environ.get("AZURE_AI_FOUNDRY_NAME")
project_name = os.environ.get("AZURE_AI_PROJECT_NAME")

if not azure_ai_foundry_name or not project_name:
    raise ValueError("AZURE_AI_FOUNDRY_NAME or AZURE_AI_PROJECT_NAME environment variable is not set")

# Construct the Azure AI Foundry project URL
azure_ai_project_url = f"https://{azure_ai_foundry_name}.services.ai.azure.com/api/projects/{project_name}"

# Set up model configuration for evaluators
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}

print(f"‚úÖ Azure AI project configured: {project_name}")
print(f"‚úÖ Model deployment: {model_config['azure_deployment']}")

## Step 4: Initialize Evaluators

Before we start evaluating, let's initialize the agent evaluators. We'll use:
- **IntentResolutionEvaluator** - Returns Likert score (1-5) for intent understanding
- **TaskAdherenceEvaluator** - Ensures agents stay within defined scope
- **ToolCallAccuracyEvaluator** - Assesses correct tool selection and usage

In [None]:
# Initialize Azure credential
credential = DefaultAzureCredential()
print("‚úÖ Azure credential created")

In [None]:
# Initialize agent evaluators
intent_evaluator = IntentResolutionEvaluator(model_config=model_config)
task_adherence_evaluator = TaskAdherenceEvaluator(model_config=model_config)

print("‚úÖ Agent evaluators initialized successfully!")

## Step 5: Intent Resolution Evaluation

**Intent Resolution** measures whether an agent correctly identifies and responds to the user's intent. This is fundamental to agent performance.

Let's test with good and poor examples.

In [None]:
# Test 1: Good Intent Resolution
print("üìä Test 1: Good Intent Resolution\n")

query_good = "What are the opening hours of the Eiffel Tower?"
response_good = "The Eiffel Tower is open daily from 9:00 AM to 11:00 PM. During summer months (mid-June to early September), it stays open until midnight."

result_good = intent_evaluator(
    query=query_good,
    response=response_good
)

print(f"Query: {query_good}")
print(f"Response: {response_good}")
print(f"\n‚úÖ Intent Resolution Score: {result_good.get('intent_resolution', 'N/A')}")
print(f"‚úÖ Result: {result_good.get('intent_resolution_result', 'N/A')}")

In [None]:
# Test 2: Poor Intent Resolution
print("üìä Test 2: Poor Intent Resolution\n")

query_poor = "What are the opening hours of the Eiffel Tower?"
response_poor = "Paris is a beautiful city with many historical landmarks and museums."

result_poor = intent_evaluator(
    query=query_poor,
    response=response_poor
)

print(f"Query: {query_poor}")
print(f"Response: {response_poor}")
print(f"\n‚ùå Intent Resolution Score: {result_poor.get('intent_resolution', 'N/A')}")
print(f"‚ùå Result: {result_poor.get('intent_resolution_result', 'N/A')}")

## Step 6: Tool Call Accuracy Evaluation

**Tool Call Accuracy** measures whether an agent makes the correct function tool calls for a user's request. This is crucial for agents that interact with external systems.

First, let's define the tools available to our agent.

In [None]:
# Define available tools for our agent
tool_definitions = [
    {
        "name": "get_weather",
        "description": "Fetches current weather information for a specified location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The location to get weather for (city, state/country)."
                },
                "units": {
                    "type": "string",
                    "description": "Temperature units (celsius or fahrenheit).",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["location"]
        }
    },
    {
        "name": "get_stock_price",
        "description": "Gets the current stock price for a company symbol.",
        "parameters": {
            "type": "object",
            "properties": {
                "symbol": {
                    "type": "string",
                    "description": "Stock symbol (e.g., MSFT, AAPL)."
                }
            },
            "required": ["symbol"]
        }
    }
]

print("‚úÖ Defined tools for our agent:")
for tool in tool_definitions:
    print(f"   - {tool['name']}: {tool['description']}")

In [None]:
# Initialize Tool Call Accuracy Evaluator
tool_call_evaluator = ToolCallAccuracyEvaluator(model_config=model_config)
print("‚úÖ Tool Call Accuracy evaluator initialized")

In [None]:
# Test 3: Correct Tool Usage
print("üìä Test 3: Correct Tool Usage\n")

query_weather = "What's the weather like in Seattle?"

correct_tool_calls = [
    {
        "type": "tool_call",
        "tool_call_id": "call_001",
        "name": "get_weather",
        "arguments": {
            "location": "Seattle",
            "units": "fahrenheit"
        }
    }
]

result_correct_tool = tool_call_evaluator(
    query=query_weather,
    tool_calls=correct_tool_calls,
    tool_definitions=tool_definitions
)

print(f"Query: {query_weather}")
print(f"Tool Called: {correct_tool_calls[0]['name']}")
print(f"\n‚úÖ Tool Call Accuracy Score: {result_correct_tool.get('tool_call_accuracy', 'N/A')}")
print(f"‚úÖ Result: {result_correct_tool.get('tool_call_accuracy_result', 'N/A')}")

In [None]:
# Test 4: Incorrect Tool Usage  
print("üìä Test 4: Incorrect Tool Usage\n")

query_weather2 = "What's the weather like in New York?"

# Incorrect: using stock price tool for weather query
incorrect_tool_calls = [
    {
        "type": "tool_call",
        "tool_call_id": "call_002", 
        "name": "get_stock_price",
        "arguments": {
            "symbol": "NYC"
        }
    }
]

result_incorrect_tool = tool_call_evaluator(
    query=query_weather2,
    tool_calls=incorrect_tool_calls,
    tool_definitions=tool_definitions
)

print(f"Query: {query_weather2}")
print(f"Tool Called: {incorrect_tool_calls[0]['name']} (WRONG!)")
print(f"\n‚ùå Tool Call Accuracy Score: {result_incorrect_tool.get('tool_call_accuracy', 'N/A')}")
print(f"‚ùå Result: {result_incorrect_tool.get('tool_call_accuracy_result', 'N/A')}")

## Step 7: Task Adherence Evaluation

**Task Adherence** measures whether an agent's response adheres to its assigned tasks and system instructions. This ensures agents stay within their defined scope and capabilities.

In [None]:
# Test 5: Good Task Adherence
print("üìä Test 5: Good Task Adherence\n")

# Define a customer service agent with specific instructions
system_message = "You are a helpful customer service agent for TechCorp. You can only help with product information, order status, and technical support. You cannot process returns or refunds - direct customers to the returns department for that."

customer_query = "Can you tell me about the TechCorp laptop specifications?"
agent_response = "I'd be happy to help with our laptop specifications! The TechCorp Pro laptop features an Intel i7 processor, 16GB RAM, 512GB SSD, and a 15.6-inch display. It's designed for professional use with excellent battery life. Would you like more details about any specific aspect?"

# Format as conversation messages
task_query = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": customer_query}
]

task_response = [
    {"role": "assistant", "content": agent_response}
]

result_good_adherence = task_adherence_evaluator(
    query=task_query,
    response=task_response
)

print(f"Customer Query: {customer_query}")
print(f"Agent Response: {agent_response[:80]}...")
print(f"\n‚úÖ Task Adherence Score: {result_good_adherence.get('task_adherence', 'N/A')}")
print(f"‚úÖ Result: {result_good_adherence.get('task_adherence_result', 'N/A')}")

In [None]:
# Test 6: Poor Task Adherence
print("üìä Test 6: Poor Task Adherence\n")

customer_query_bad = "I want to return my laptop for a full refund."
agent_response_bad = "Absolutely! I can process that refund for you right now. Just provide your order number and I'll get $1,299 refunded to your account immediately."

# Same system message - agent should NOT process refunds
task_query_bad = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": customer_query_bad}
]

task_response_bad = [
    {"role": "assistant", "content": agent_response_bad}
]

result_poor_adherence = task_adherence_evaluator(
    query=task_query_bad,
    response=task_response_bad
)

print(f"Customer Query: {customer_query_bad}")
print(f"Agent Response: {agent_response_bad[:80]}...")
print(f"\n‚ùå Task Adherence Score: {result_poor_adherence.get('task_adherence', 'N/A')}")
print(f"‚ùå Result: {result_poor_adherence.get('task_adherence_result', 'N/A')}")
print(f"‚ö†Ô∏è  Agent violated instructions by processing a refund!")

## Step 8: Complex Agent Conversations

Real agents often have complex multi-step conversations. Let's evaluate a realistic scenario with tool usage and extended interactions.

In [None]:
print("üìä Complex Agent Scenario: Travel Planning Assistant\n")

In [None]:
# Complex conversation with multiple tool calls
complex_query = [
    {
        "role": "system", 
        "content": "You are a travel planning assistant. You can help with weather information, flight searches, and hotel recommendations. Always provide helpful and accurate travel advice."
    },
    {
        "role": "user", 
        "content": "I'm planning a trip to Tokyo next week. Can you help me with weather information and suggest what to pack?"
    }
]

complex_response = [
    {
        "role": "assistant",
        "content": "I'll help you plan your Tokyo trip! Let me check the weather forecast for next week."
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_tokyo_weather",
                "name": "get_weather",
                "arguments": {
                    "location": "Tokyo, Japan",
                    "units": "celsius"
                }
            }
        ]
    },
    {
        "role": "assistant",
        "content": "Based on the weather forecast, Tokyo will have mild temperatures around 18-22¬∞C with some rain expected. I recommend packing: light layers for temperature changes, a waterproof jacket or umbrella for rain, comfortable walking shoes, and both casual and slightly formal clothing if you plan to visit restaurants or temples."
    }
]

# Evaluate intent resolution
complex_intent_result = intent_evaluator(
    query=complex_query,
    response=complex_response
)

print("Intent Resolution:")
print(f"  Score: {complex_intent_result.get('intent_resolution', 'N/A')}")
print(f"  Result: {complex_intent_result.get('intent_resolution_result', 'N/A')}")

# Evaluate task adherence
complex_task_result = task_adherence_evaluator(
    query=complex_query,
    response=complex_response
)

print("\nTask Adherence:")
print(f"  Score: {complex_task_result.get('task_adherence', 'N/A')}")
print(f"  Result: {complex_task_result.get('task_adherence_result', 'N/A')}")

## Step 9: Batch Evaluation

In real-world applications, you'll want to evaluate multiple agent interactions at once. Let's create a comprehensive batch evaluation.

In [None]:
# Create multiple evaluation scenarios
evaluation_scenarios = [
    {
        "name": "Customer Support - Product Info",
        "query": "What are the features of your premium subscription?",
        "response": "Our premium subscription includes unlimited storage, priority support, advanced analytics, and collaboration tools for teams up to 50 members.",
        "expected_intent": "product_information"
    },
    {
        "name": "Customer Support - Billing Issue", 
        "query": "I was charged twice this month, can you help?",
        "response": "I understand your concern about the duplicate charge. Let me look into your billing history and I'll make sure to resolve this for you right away.",
        "expected_intent": "billing_support"
    },
    {
        "name": "Travel Assistant - Weather Query",
        "query": "What should I expect for weather in London this weekend?",
        "response": "This weekend in London, expect cloudy skies with temperatures around 15-18¬∞C (59-64¬∞F). There's a 40% chance of light rain on Saturday, so I'd recommend bringing a light jacket and umbrella.",
        "expected_intent": "weather_information"
    },
    {
        "name": "Off-Topic Response",
        "query": "What's the capital of France?",
        "response": "I love cooking pasta! Here's my favorite recipe for spaghetti carbonara...",
        "expected_intent": "geography_question"
    }
]

print(f"‚úÖ Created {len(evaluation_scenarios)} evaluation scenarios")

In [None]:
# Run batch evaluation
print("üìä BATCH AGENT EVALUATION RESULTS\n")
print("=" * 80)

evaluation_results = []

for i, scenario in enumerate(evaluation_scenarios, 1):
    print(f"\nScenario {i}: {scenario['name']}")
    print(f"Query: {scenario['query']}")
    
    # Evaluate intent resolution
    intent_result = intent_evaluator(
        query=scenario['query'],
        response=scenario['response']
    )
    
    result_summary = {
        'scenario': scenario['name'],
        'intent_score': intent_result.get('intent_resolution', 0),
        'intent_result': intent_result.get('intent_resolution_result', 'unknown')
    }
    
    evaluation_results.append(result_summary)
    
    print(f"Intent Score: {result_summary['intent_score']} ({result_summary['intent_result']})")
    print("-" * 80)

print("\n" + "=" * 80)
print("SUMMARY STATISTICS")
print("=" * 80)

intent_scores = [r['intent_score'] for r in evaluation_results if isinstance(r['intent_score'], (int, float))]
passed_intent = len([r for r in evaluation_results if r['intent_result'] == 'pass'])

print(f"Average Intent Resolution Score: {sum(intent_scores)/len(intent_scores):.2f}")
print(f"Intent Resolution Pass Rate: {passed_intent}/{len(evaluation_results)} ({passed_intent/len(evaluation_results)*100:.1f}%)")

## Next Steps

You've successfully learned how to evaluate AI agents using Azure AI Foundry's specialized evaluators! 

### What You Accomplished
- Used **IntentResolutionEvaluator** to measure intent understanding
- Assessed **ToolCallAccuracyEvaluator** for correct tool selection
- Applied **TaskAdherenceEvaluator** to ensure agents stay within scope
- Evaluated complex multi-step agent conversations
- Created batch evaluation workflows for multiple scenarios

### Key Takeaways
- Agent evaluators provide specialized metrics for agentic workflows beyond simple query-response
- Binary pass/fail results with detailed reasoning help identify specific improvement areas
- Tool evaluation is crucial for agents that interact with external systems
- Task adherence ensures agents maintain their intended purpose and boundaries
- Combining quality and agent evaluators provides comprehensive assessment

### Production Best Practices
1. **Continuous Evaluation** - Set up automated evaluation pipelines for agent deployments
2. **Threshold Monitoring** - Configure alerts when scores drop below acceptable levels
3. **A/B Testing** - Compare different agent configurations using evaluation metrics
4. **User Feedback Integration** - Combine automated evaluations with human feedback
5. **Tool Coverage Testing** - Ensure all available tools are properly tested

Great work! üéâ