# Explore Agent Evaluators

Welcome! This notebook introduces you to evaluating AI agents using specialized evaluators from the Azure AI Evaluation SDK.

## What You'll Learn
- How to use specialized agent evaluators (Intent Resolution, Tool Call Accuracy, Task Adherence)
- How to create and evaluate agent scenarios with tool interactions
- How to assess complex multi-step agent conversations
- How to run batch evaluations for multiple agent interactions
- Best practices for evaluating production AI agents

Let's get started! üöÄ

---

## Understanding Agent Evaluation

AI agents are powerful productivity assistants that can create complex workflows for business needs. Unlike simple query-response AI systems, agents involve multiple steps:

- **Intent Recognition** - Understanding what the user wants to accomplish
- **Tool Selection & Usage** - Choosing and correctly using available tools
- **Task Execution** - Following through on the assigned workflow
- **Response Generation** - Providing helpful and accurate responses

When a user queries "What's the weather tomorrow?", an agentic workflow might involve reasoning through user intents, calling weather APIs, and utilizing retrieval-augmented generation. It's crucial to evaluate each step of the workflow, plus the quality and safety of the final output.

Azure AI Foundry provides specialized **agent evaluators** that assess these unique aspects:

1. **Intent Resolution** - Measures whether the agent correctly identifies the user's intent
2. **Tool Call Accuracy** - Measures whether the agent made the correct function tool calls
3. **Task Adherence** - Measures whether the agent's response adheres to its assigned tasks

## Step 1: Verify Azure AI Evaluation SDK

Let's ensure the Azure AI Evaluation SDK is installed. It provides specialized evaluators for agentic workflows:
- **IntentResolutionEvaluator** - For measuring intent understanding
- **ToolCallAccuracyEvaluator** - For assessing tool usage correctness
- **TaskAdherenceEvaluator** - For evaluating task completion fidelity

These work alongside standard quality and safety evaluators.

In [1]:
!pip list | grep azure-ai-evaluation

azure-ai-evaluation        1.12.0


## Step 2: Import Required Libraries

Let's import the specialized agent evaluators and supporting libraries.

In [2]:
# Import agent-specific evaluators
from azure.ai.evaluation import (
    IntentResolutionEvaluator, 
    ToolCallAccuracyEvaluator, 
    TaskAdherenceEvaluator
)

# Import standard quality evaluators
from azure.ai.evaluation import (
    RelevanceEvaluator, 
    CoherenceEvaluator, 
    FluencyEvaluator
)

# Import supporting libraries
from azure.identity import DefaultAzureCredential
import os
import json

print("‚úÖ Successfully imported evaluation modules!")

‚úÖ Successfully imported evaluation modules!


## Step 3: Configure Azure AI Project

Let's set up our connection to Azure AI Foundry using environment variables.

In [3]:
# Get Azure AI project configuration from environment variables
azure_ai_foundry_name = os.environ.get("AZURE_AI_FOUNDRY_NAME")
project_name = os.environ.get("AZURE_AI_PROJECT_NAME")

if not azure_ai_foundry_name or not project_name:
    raise ValueError("AZURE_AI_FOUNDRY_NAME or AZURE_AI_PROJECT_NAME environment variable is not set")

# Construct the Azure AI Foundry project URL
azure_ai_project_url = f"https://{azure_ai_foundry_name}.services.ai.azure.com/api/projects/{project_name}"

# Set up model configuration for evaluators
model_config = {
    "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
    "api_key": os.environ.get("AZURE_OPENAI_API_KEY"),
    "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
}

print(f"‚úÖ Azure AI project configured: {project_name}")
print(f"‚úÖ Model deployment: {model_config['azure_deployment']}")

‚úÖ Azure AI project configured: projectggdr
‚úÖ Model deployment: gpt-4.1


In [4]:
# Initialize Azure credential
credential = DefaultAzureCredential()
print("‚úÖ Azure credential created")

‚úÖ Azure credential created


## Step 4: Initialize Evaluators

Before we start evaluating, let's initialize the agent evaluators. We'll use:
- **IntentResolutionEvaluator** - Returns Likert score (1-5) for intent understanding
- **TaskAdherenceEvaluator** - Ensures agents stay within defined scope
- **ToolCallAccuracyEvaluator** - Assesses correct tool selection and usage

In [5]:
# Suppress expected fallback warnings from evaluators so we can see cleaner output
import warnings
import logging

# Suppress specific warning messages
warnings.filterwarnings('ignore', message='.*Conversation history could not be parsed.*')
warnings.filterwarnings('ignore', message='.*Empty agent response extracted.*')

# Also suppress at the logging level for the azure.ai.evaluation module
logging.getLogger('azure.ai.evaluation').setLevel(logging.CRITICAL)

print("‚úÖ Warning filters configured")



In [6]:
# Initialize agent evaluators
intent_evaluator = IntentResolutionEvaluator(model_config=model_config)
task_adherence_evaluator = TaskAdherenceEvaluator(model_config=model_config)

print("‚úÖ Agent evaluators initialized successfully!")

‚úÖ Agent evaluators initialized successfully!


## Step 5: Intent Resolution Evaluation

**Intent Resolution** measures whether an agent correctly identifies and responds to the user's intent. This is fundamental to agent performance.

Let's test with good and poor examples.

In [7]:
# Test 1: Good Intent Resolution
print("üìä Test 1: Good Intent Resolution\n")

query_good = "I'm looking for paint for my bedroom walls. What would you recommend?"

response_good = (
    "For bedroom walls, I'd recommend our Interior Eggshell Paint (SKU: PFIP000002, $44). "
    "It has a subtle sheen that's perfect for living rooms and bedrooms, offers easy cleanup, "
    "and is very durable. We have it in stock with 80 units available. "
    "Would you like to know about color options or coverage area?"
)

result_good = intent_evaluator(
    query=query_good,
    response=response_good
)

print(f"Query: {query_good}")
print(f"Response: {response_good}")
print(f"‚úÖ Intent Resolution Score: {result_good.get('intent_resolution', 'N/A')}")
print(f"‚úÖ Result: {result_good.get('intent_resolution_result', 'N/A')}")

üìä Test 1: Good Intent Resolution

Query: I'm looking for paint for my bedroom walls. What would you recommend?
Response: For bedroom walls, I'd recommend our Interior Eggshell Paint (SKU: PFIP000002, $44). It has a subtle sheen that's perfect for living rooms and bedrooms, offers easy cleanup, and is very durable. We have it in stock with 80 units available. Would you like to know about color options or coverage area?
‚úÖ Intent Resolution Score: 5.0
‚úÖ Result: pass
Query: I'm looking for paint for my bedroom walls. What would you recommend?
Response: For bedroom walls, I'd recommend our Interior Eggshell Paint (SKU: PFIP000002, $44). It has a subtle sheen that's perfect for living rooms and bedrooms, offers easy cleanup, and is very durable. We have it in stock with 80 units available. Would you like to know about color options or coverage area?
‚úÖ Intent Resolution Score: 5.0
‚úÖ Result: pass


In [8]:
# Test 2: Poor Intent Resolution
print("üìä Test 2: Poor Intent Resolution\n")

query_poor = "Do you have any hammers in stock?"

response_poor = (
    "Zava has been serving DIY enthusiasts since 1995. "
    "We have stores across the country and offer a wide range of products "
    "for all your home improvement needs."
)

result_poor = intent_evaluator(
    query=query_poor,
    response=response_poor
)

print(f"Query: {query_poor}")
print(f"Response: {response_poor}")
print(f"‚ùå Intent Resolution Score: {result_poor.get('intent_resolution', 'N/A')}")
print(f"‚ùå Result: {result_poor.get('intent_resolution_result', 'N/A')}")

üìä Test 2: Poor Intent Resolution

Query: Do you have any hammers in stock?
Response: Zava has been serving DIY enthusiasts since 1995. We have stores across the country and offer a wide range of products for all your home improvement needs.
‚ùå Intent Resolution Score: 1.0
‚ùå Result: fail
Query: Do you have any hammers in stock?
Response: Zava has been serving DIY enthusiasts since 1995. We have stores across the country and offer a wide range of products for all your home improvement needs.
‚ùå Intent Resolution Score: 1.0
‚ùå Result: fail


## Step 6: Tool Call Accuracy Evaluation

**Tool Call Accuracy** measures whether an agent makes the correct function tool calls for a user's request. This is crucial for agents that interact with external systems.

First, let's define the tools available to our agent.

In [9]:
# Define available tools for our Zava shopping assistant agent
tool_definitions = [
    {
        "name": "search_products",
        "description": "Searches the Zava product catalog for items matching keywords or categories.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search keywords (e.g., 'hammer', 'paint', 'screwdriver')."
                },
                "category": {
                    "type": "string",
                    "description": "Product category filter (e.g., 'HAND TOOLS', 'PAINT & FINISHES').",
                    "enum": ["HAND TOOLS", "PAINT & FINISHES", "POWER TOOLS", "ALL"]
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "check_inventory",
        "description": "Checks current stock levels for a specific product SKU.",
        "parameters": {
            "type": "object",
            "properties": {
                "sku": {
                    "type": "string",
                    "description": "Product SKU code (e.g., 'HTHM001600', 'PFIP000002')."
                }
            },
            "required": ["sku"]
        }
    }
]

print("‚úÖ Defined tools for our Zava shopping assistant:")
for tool in tool_definitions:
    print(f"   - {tool['name']}: {tool['description']}")

‚úÖ Defined tools for our Zava shopping assistant:
   - search_products: Searches the Zava product catalog for items matching keywords or categories.
   - check_inventory: Checks current stock levels for a specific product SKU.


In [10]:
# Initialize Tool Call Accuracy Evaluator
tool_call_evaluator = ToolCallAccuracyEvaluator(model_config=model_config)
print("‚úÖ Tool Call Accuracy evaluator initialized")

‚úÖ Tool Call Accuracy evaluator initialized


In [11]:
# Test 3: Correct Tool Usage
print("üìä Test 3: Correct Tool Usage\n")

query_product = "Do you have any screwdrivers?"

correct_tool_calls = [
    {
        "type": "tool_call",
        "tool_call_id": "call_001",
        "name": "search_products",
        "arguments": {
            "query": "screwdrivers",
            "category": "HAND TOOLS"
        }
    }
]

result_correct_tool = tool_call_evaluator(
    query=query_product,
    tool_calls=correct_tool_calls,
    tool_definitions=tool_definitions
)

print(f"Query: {query_product}")
print(f"Tool Called: {correct_tool_calls[0]['name']}")
print(f"‚úÖ Result: {result_correct_tool.get('tool_call_accuracy_result', 'N/A')}")
print(f"‚úÖ Score: {result_correct_tool.get('tool_call_accuracy', 'N/A')}")


üìä Test 3: Correct Tool Usage

Query: Do you have any screwdrivers?
Tool Called: search_products
‚úÖ Result: pass
‚úÖ Score: 5.0
Query: Do you have any screwdrivers?
Tool Called: search_products
‚úÖ Result: pass
‚úÖ Score: 5.0


In [12]:
# Test 4: Incorrect Tool Usage  
print("üìä Test 4: Incorrect Tool Usage\n")

query_product = "What's the weather like in New York?"

incorrect_tool_calls = [
    {
        "type": "tool_call",
        "tool_call_id": "call_002",
        "name": "search_products",
        "arguments": {
            "query": "screwdrivers",
            "category": "HAND TOOLS"
        }
    }
]

result_incorrect_tool = tool_call_evaluator(
    query=query_product,
    tool_calls=incorrect_tool_calls,
    tool_definitions=tool_definitions
)

print(f"Query: {query_product}")
print(f"Tool Called: {incorrect_tool_calls[0]['name']}")
print(f"‚ùå Result: {result_incorrect_tool.get('tool_call_accuracy_result', 'N/A')}")
print(f"‚ùå Score: {result_incorrect_tool.get('tool_call_accuracy', 'N/A')}")
print(f"‚ùå Reason: {result_incorrect_tool.get('tool_call_accuracy_reason', 'N/A')}")


üìä Test 4: Incorrect Tool Usage

Query: What's the weather like in New York?
Tool Called: search_products
‚ùå Result: fail
‚ùå Score: 1.0
‚ùå Reason: Let's think step by step: 1) The user's last query is 'What's the weather like in New York?'. 2) The available tools are 'search_products' (for searching product catalog) and 'check_inventory' (for checking stock of a product SKU). Neither tool is related to weather information. 3) The agent made a tool call to 'search_products' with parameters for 'screwdrivers' in the 'HAND TOOLS' category, which is completely unrelated to the user's weather query. 4) The parameters used in the tool call are not grounded in the conversation; the user never mentioned 'screwdrivers' or 'HAND TOOLS'. 5) According to the definitions, this is a clear case of an irrelevant tool call (Level 1), as the tool call does not address the user's query at all.
Query: What's the weather like in New York?
Tool Called: search_products
‚ùå Result: fail
‚ùå Score: 1.0
‚ù

## Step 7: Task Adherence Evaluation

**Task Adherence** measures whether an agent's response adheres to its assigned tasks and system instructions. This ensures agents stay within their defined scope and capabilities.

In [13]:
# Test 5: Good Task Adherence
print("üìä Test 5: Good Task Adherence\n")

# Define a Zava customer service agent with specific instructions
system_message = (
    "You are a helpful shopping assistant for Zava, a home improvement retailer. "
    "You can help customers find products, check availability, and provide product information. "
    "You cannot process orders or payments - direct customers to checkout for that."
)

customer_query = "Can you tell me about your Professional Claw Hammer?"

agent_response = (
    "I'd be happy to help! Our Professional Claw Hammer (SKU: HTHM001600, $28) "
    "is a high-quality steel claw hammer with a comfortable fiberglass handle. "
    "It's perfect for framing and general construction work. We currently have 25 units in stock. "
    "Would you like to know about similar products or add it to your cart?"
)

# Format as conversation messages
task_query = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": customer_query}
]

task_response = [
    {"role": "assistant", "content": agent_response}
]

result_good_adherence = task_adherence_evaluator(
    query=task_query,
    response=task_response
)

print(f"Customer Query: {customer_query}")
print(f"Agent Response: {agent_response[:80]}...")
print(f"‚úÖ Task Adherence Score: {result_good_adherence.get('task_adherence', 'N/A')}")
print(f"‚úÖ Result: {result_good_adherence.get('task_adherence_result', 'N/A')}")

üìä Test 5: Good Task Adherence

Customer Query: Can you tell me about your Professional Claw Hammer?
Agent Response: I'd be happy to help! Our Professional Claw Hammer (SKU: HTHM001600, $28) is a h...
‚úÖ Task Adherence Score: 5.0
‚úÖ Result: pass
Customer Query: Can you tell me about your Professional Claw Hammer?
Agent Response: I'd be happy to help! Our Professional Claw Hammer (SKU: HTHM001600, $28) is a h...
‚úÖ Task Adherence Score: 5.0
‚úÖ Result: pass


In [14]:
# Test 6: Poor Task Adherence
print("üìä Test 6: Poor Task Adherence\n")

customer_query_bad = "I'd like to purchase this hammer. Can you process my credit card?"

agent_response_bad = (
    "Absolutely! I can process that payment for you right now. "
    "Just provide your credit card number, expiration date, and CVV code "
    "and I'll charge $28 to your account immediately."
)

# Same system message - agent should NOT process payments
task_query_bad = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": customer_query_bad}
]

task_response_bad = [
    {"role": "assistant", "content": agent_response_bad}
]

result_poor_adherence = task_adherence_evaluator(
    query=task_query_bad,
    response=task_response_bad
)

print(f"Customer Query: {customer_query_bad}")
print(f"Agent Response: {agent_response_bad[:80]}...")
print(f"‚ùå Task Adherence Score: {result_poor_adherence.get('task_adherence', 'N/A')}")
print(f"‚ùå Result: {result_poor_adherence.get('task_adherence_result', 'N/A')}")
print(f"‚ö†Ô∏è  Agent violated instructions by processing payment!")

üìä Test 6: Poor Task Adherence

Customer Query: I'd like to purchase this hammer. Can you process my credit card?
Agent Response: Absolutely! I can process that payment for you right now. Just provide your cred...
‚ùå Task Adherence Score: 1.0
‚ùå Result: fail
‚ö†Ô∏è  Agent violated instructions by processing payment!
Customer Query: I'd like to purchase this hammer. Can you process my credit card?
Agent Response: Absolutely! I can process that payment for you right now. Just provide your cred...
‚ùå Task Adherence Score: 1.0
‚ùå Result: fail
‚ö†Ô∏è  Agent violated instructions by processing payment!


## Step 8: Complex Agent Conversations

Real agents often have complex multi-step conversations. Let's evaluate a realistic scenario with tool usage and extended interactions.

In [15]:
print("üìä Complex Agent Scenario: Zava Shopping Assistant\n")

üìä Complex Agent Scenario: Zava Shopping Assistant



In [16]:
# Complex conversation with multiple tool calls for a Zava shopping scenario
complex_query = [
    {
        "role": "system", 
        "content": (
            "You are a helpful shopping assistant for Zava home improvement store. "
            "You can help customers find products, check stock levels, and provide product recommendations. "
            "Always be helpful and knowledgeable about our products."
        )
    },
    {
        "role": "user", 
        "content": "I'm working on a DIY project and need some tools. Can you help me find a good hammer?"
    }
]

complex_response = [
    {
        "role": "assistant",
        "content": (
            "I'd be happy to help you find the right hammer for your DIY project! "
            "Let me search our inventory for you."
        )
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_search_hammers",
                "name": "search_products",
                "arguments": {
                    "query": "hammer",
                    "category": "HAND TOOLS"
                }
            }
        ]
    },
    {
        "role": "assistant",
        "content": (
            "Great! I found several options for you. "
            "For general DIY projects, I recommend our Professional Claw Hammer 16oz (SKU: HTHM001600, $28). "
            "It features high-quality steel construction with a comfortable fiberglass handle, "
            "perfect for framing and general construction work. We have 25 units in stock. "
            "Would you also need any other tools for your project?"
        )
    }
]

# Evaluate intent resolution
complex_intent_result = intent_evaluator(
    query=complex_query,
    response=complex_response
)

print("Intent Resolution:")
print(f"  Score: {complex_intent_result.get('intent_resolution', 'N/A')}")
print(f"  Result: {complex_intent_result.get('intent_resolution_result', 'N/A')}")

# Evaluate task adherence
complex_task_result = task_adherence_evaluator(
    query=complex_query,
    response=complex_response
)

print("\nTask Adherence:")
print(f"  Score: {complex_task_result.get('task_adherence', 'N/A')}")
print(f"  Result: {complex_task_result.get('task_adherence_result', 'N/A')}")

Intent Resolution:
  Score: 5.0
  Result: pass

Task Adherence:
  Score: 5.0
  Result: pass

Task Adherence:
  Score: 5.0
  Result: pass


## Step 9: Batch Evaluation

In real-world applications, you'll want to evaluate multiple agent interactions at once. Let's create a comprehensive batch evaluation.

In [17]:
# Create multiple evaluation scenarios for Zava shopping assistant
evaluation_scenarios = [
    {
        "name": "Product Inquiry - Paint",
        "query": "What paint do you recommend for a kitchen?",
        "response": (
            "For kitchens, I'd recommend our Interior Semi-Gloss Paint (SKU: PFIP000003, $47). "
            "It's washable, moisture-resistant, and perfect for kitchens and bathrooms. "
            "We currently have 2 units in stock."
        ),
        "expected_intent": "product_recommendation"
    },
    {
        "name": "Stock Check - Specific Product", 
        "query": "Is the Professional Claw Hammer in stock?",
        "response": (
            "Yes! The Professional Claw Hammer 16oz (SKU: HTHM001600) is in stock. "
            "We have 25 units available at $28 each."
        ),
        "expected_intent": "inventory_check"
    },
    {
        "name": "Product Comparison - Tools",
        "query": "What's the difference between your screwdriver sets?",
        "response": (
            "We have several options: Our Phillips Screwdriver Set ($16) features magnetic tips and cushion grips, "
            "while the Flathead Set ($14) has precision-machined tips. "
            "For electronics work, our Precision Screwdriver Kit ($22) is perfect for small appliances and eyeglasses."
        ),
        "expected_intent": "product_comparison"
    },
    {
        "name": "Off-Topic Response",
        "query": "Do you sell hammers?",
        "response": (
            "Did you know that Zava was founded in 1995? "
            "We're proud to serve DIY enthusiasts across the nation!"
        ),
        "expected_intent": "product_availability"
    }
]

print(f"‚úÖ Created {len(evaluation_scenarios)} Zava shopping scenarios")

‚úÖ Created 4 Zava shopping scenarios


In [18]:
# Run batch evaluation
print("üìä BATCH AGENT EVALUATION RESULTS\n")
print("=" * 80)

evaluation_results = []

for i, scenario in enumerate(evaluation_scenarios, 1):
    print(f"\nScenario {i}: {scenario['name']}")
    print(f"Query: {scenario['query']}")
    
    # Evaluate intent resolution
    intent_result = intent_evaluator(
        query=scenario['query'],
        response=scenario['response']
    )
    
    result_summary = {
        'scenario': scenario['name'],
        'intent_score': intent_result.get('intent_resolution', 0),
        'intent_result': intent_result.get('intent_resolution_result', 'unknown')
    }
    
    evaluation_results.append(result_summary)
    
    print(f"Intent Score: {result_summary['intent_score']} ({result_summary['intent_result']})")
    print("-" * 80)

print("\n" + "=" * 80)
print("SUMMARY STATISTICS")
print("=" * 80)

intent_scores = [r['intent_score'] for r in evaluation_results if isinstance(r['intent_score'], (int, float))]
passed_intent = len([r for r in evaluation_results if r['intent_result'] == 'pass'])

print(f"Average Intent Resolution Score: {sum(intent_scores)/len(intent_scores):.2f}")
print(f"Intent Resolution Pass Rate: {passed_intent}/{len(evaluation_results)} ({passed_intent/len(evaluation_results)*100:.1f}%)")

üìä BATCH AGENT EVALUATION RESULTS


Scenario 1: Product Inquiry - Paint
Query: What paint do you recommend for a kitchen?
Intent Score: 5.0 (pass)
--------------------------------------------------------------------------------

Scenario 2: Stock Check - Specific Product
Query: Is the Professional Claw Hammer in stock?
Intent Score: 5.0 (pass)
--------------------------------------------------------------------------------

Scenario 2: Stock Check - Specific Product
Query: Is the Professional Claw Hammer in stock?
Intent Score: 5.0 (pass)
--------------------------------------------------------------------------------

Scenario 3: Product Comparison - Tools
Query: What's the difference between your screwdriver sets?
Intent Score: 5.0 (pass)
--------------------------------------------------------------------------------

Scenario 3: Product Comparison - Tools
Query: What's the difference between your screwdriver sets?
Intent Score: 5.0 (pass)
-----------------------------------------

## Next Steps

You've successfully learned how to evaluate AI agents using Azure AI Foundry's specialized evaluators! 

### What You Accomplished
- Used **IntentResolutionEvaluator** to measure intent understanding
- Assessed **ToolCallAccuracyEvaluator** for correct tool selection
- Applied **TaskAdherenceEvaluator** to ensure agents stay within scope
- Evaluated complex multi-step agent conversations
- Created batch evaluation workflows for multiple scenarios

### Key Takeaways
- Agent evaluators provide specialized metrics for agentic workflows beyond simple query-response
- Binary pass/fail results with detailed reasoning help identify specific improvement areas
- Tool evaluation is crucial for agents that interact with external systems
- Task adherence ensures agents maintain their intended purpose and boundaries
- Combining quality and agent evaluators provides comprehensive assessment

### Production Best Practices
1. **Continuous Evaluation** - Set up automated evaluation pipelines for agent deployments
2. **Threshold Monitoring** - Configure alerts when scores drop below acceptable levels
3. **A/B Testing** - Compare different agent configurations using evaluation metrics
4. **User Feedback Integration** - Combine automated evaluations with human feedback
5. **Tool Coverage Testing** - Ensure all available tools are properly tested

Great work! üéâ