# Lab 7: Agent Evaluation

This notebook evaluates the quality and performance of deployed Agents.

## Objectives

- Agent evaluation using Azure AI Evaluation SDK
- Collect performance, accuracy, and safety metrics
- Analyze and visualize evaluation results

## Prerequisites

1. Completed Notebooks 01-03 (Azure resources and Agent deployment)
2. `config.json` file exists
3. Azure AI Project access permissions

---

---

## ‚öôÔ∏è Before You Start

**Select a Python kernel:**

1. Click **"Select Kernel"** in the top right of the notebook
2. Select **"Python Environments..."**
3. Select **`.venv (Python 3.x.x)`** (virtual environment created in project root)

> üí° **GitHub Codespaces**: In Codespaces, the `.venv` environment is automatically created.  
> If you don't see `.venv`, create it in the terminal with `python -m venv .venv`.

---

## 1. Environment Setup and Load Config

Load the configuration generated in previous notebooks.

In [None]:
import json
import sys
import subprocess
from pathlib import Path

# Load config.json
config_path = Path("config.json")
if not config_path.exists():
    raise FileNotFoundError("config.json not found. Please run notebooks 01-03 first.")

with open(config_path, 'r') as f:
    config = json.load(f)

# Extract key variables
PROJECT_CONNECTION_STRING = config.get("project_connection_string", "")
simple_project_conn = PROJECT_CONNECTION_STRING.split(';')[0] if PROJECT_CONNECTION_STRING else ""

print("=== Configuration Loaded ===")
print(f"Resource Group: {config.get('resource_group')}")
print(f"Location: {config.get('location')}")
print(f"Project: {simple_project_conn}")
print(f"Model: {config.get('model_deployment_name')}")
print("\n" + "="*50)

## 2. Install Azure AI Evaluation Package

Install packages required for Agent evaluation.

In [None]:
# Install Azure AI Evaluation package
print("=== Installing Azure AI Evaluation Package ===\n")

result = subprocess.run(
    [sys.executable, "-m", "pip", "install", "-q", "azure-ai-evaluation"],
    capture_output=True,
    text=True
)

if result.returncode == 0:
    print("‚úÖ azure-ai-evaluation installed successfully")
else:
    print(f"‚ö†Ô∏è  Installation warning: {result.stderr}")
    print("   Proceeding anyway...")

print("\n" + "="*50)

## 3. Evaluation Overview

### Evaluation Types

**Performance Evaluators:**
- **Intent Resolution**: Evaluate if the Agent correctly understood user intent
- **Tool Call Accuracy**: Evaluate if the Agent correctly called the right tools  
- **Task Adherence**: Evaluate if the Agent faithfully followed instructions

**Safety Evaluators:**
- **Content Safety**: Evaluate for inappropriate content (violence, hate, etc.)
- **Indirect Attack**: Detect indirect malicious attack attempts
- **Code Vulnerability**: Evaluate security vulnerabilities in generated code

### Evaluation Process

1. Generate test queries
2. Create Agent and execute queries
3. Collect responses and measure metrics
4. Evaluate quality with Evaluators
5. Save and analyze results
6. Clean up Agent

---

## 4. Generate Test Queries

Generate queries to test various Agent capabilities.

In [None]:
# Create evals directory
import os
from pathlib import Path

evals_dir = Path("evals")
evals_dir.mkdir(exist_ok=True)

print("=== Creating Evaluation Test Queries ===\n")

# Define test queries
eval_queries = [
    {
        "query": "Hello. I would like to get travel destination recommendations.",
        "ground-truth": "Agent should respond to greeting and call Research Agent for travel recommendations."
    },
    {
        "query": "Please tell me the current weather in Busan. Include temperature and feels-like temperature.",
        "ground-truth": "Must provide accurate current weather information for Busan, including both temperature and feels-like temperature."
    },
    {
        "query": "What are some travel destinations in Jeju Island that I can enjoy with my family?",
        "ground-truth": "Should search and recommend family-friendly destinations in Jeju Island. Provide information considering natural attractions, experiential activities, accessibility, etc."
    },
    {
        "query": "Please recommend beaches where I can surf.",
        "ground-truth": "Should search and recommend beaches in Korea where surfing is possible. Provide information about surfing spots like Yangyang, Busan, etc."
    },
    {
        "query": "What are good autumn foliage spots to visit in fall?",
        "ground-truth": "Should search and recommend good autumn foliage spots to visit in fall. Provide information about natural attractions like Naejangsan, Seoraksan, etc."
    }
]

# Save as JSON file
eval_queries_path = evals_dir / "eval-queries.json"
with open(eval_queries_path, 'w', encoding='utf-8') as f:
    json.dump(eval_queries, f, indent=2, ensure_ascii=False)

print(f"‚úÖ Created {eval_queries_path}")
print(f"\nüìã Test Queries ({len(eval_queries)} total):\n")

for i, query in enumerate(eval_queries, 1):
    print(f"   {i}. {query['query'][:60]}...")

print("\nüí° Each query tests different Agent capabilities:")
print("   ‚Ä¢ General conversation and travel intent understanding")
print("   ‚Ä¢ Weather lookup (Tool functionality)")
print("   ‚Ä¢ Travel destination knowledge search (RAG functionality)")

print("\n" + "="*60)

## 5. Run Agent Evaluation

Create evaluation Agent, execute test queries, and evaluate quality with Evaluators.

### Notes

- **Execution time**: Takes approximately 2-3 minutes
- **Region constraints**: Some Safety Evaluators are not supported in eastus region
- **Permission issues**: Azure AI Project storage permissions may be required

In [None]:
# Run Agent Evaluation
import time
from pathlib import Path
from urllib.parse import urlparse
from azure.ai.agents.models import RunStatus, MessageRole
from azure.ai.projects import AIProjectClient
from azure.ai.evaluation import (
    AIAgentConverter, evaluate, ToolCallAccuracyEvaluator, IntentResolutionEvaluator, 
    TaskAdherenceEvaluator
)
from azure.identity import DefaultAzureCredential

print("=== Running Agent Evaluation ===\n")

# Set file paths
current_dir = Path(".")
evals_dir = current_dir / "evals"
eval_queries_path = evals_dir / "eval-queries.json"
eval_input_path = evals_dir / "eval-input.jsonl"
eval_output_path = evals_dir / "eval-output.json"

# Load environment variables (already loaded from config)
project_endpoint = simple_project_conn
parsed_endpoint = urlparse(project_endpoint)
model_endpoint = f"{parsed_endpoint.scheme}://{parsed_endpoint.netloc}"
deployment_name = config.get("model_deployment_name", "gpt-4o")

print(f"üìã Configuration:")
print(f"   Project: {project_endpoint}")
print(f"   Model: {deployment_name}")
print(f"   Test Queries: {eval_queries_path}")
print(f"\n")

# Initialize AIProjectClient
print("üîå Connecting to AI Project...")
credential = DefaultAzureCredential()
ai_project = AIProjectClient(
    credential=credential,
    endpoint=project_endpoint,
    api_version="2025-05-15-preview"  # Evaluations require preview API
)
print("‚úÖ Connected\n")

# Create evaluation Agent
print("ü§ñ Creating Evaluation Agent...")
eval_agent = ai_project.agents.create_agent(
    model=deployment_name,
    name="Evaluation Agent",
    instructions="""You are a helpful travel and weather assistant.
    
You can help users with:
1. Travel recommendations and destination information
2. Weather information for any city
3. General travel planning advice

Be friendly, informative, and provide detailed responses."""
)
print(f"‚úÖ Agent created: {eval_agent.name} (ID: {eval_agent.id})\n")

# Setup evaluation config
api_version = config.get("api_version", "2024-08-01-preview")
model_config = {
    "azure_deployment": deployment_name,
    "azure_endpoint": model_endpoint,
    "api_version": api_version,
}

thread_data_converter = AIAgentConverter(ai_project)

# Execute test queries and prepare evaluation input
print("="*70)
print("üìù Executing Test Queries\n")

with open(eval_queries_path, "r", encoding="utf-8") as f:
    test_data = json.load(f)

print(f"   Total queries: {len(test_data)}\n")

with open(eval_input_path, "w", encoding="utf-8") as f:
    for idx, row in enumerate(test_data, 1):
        query_text = row.get("query")
        print(f"   [{idx}/{len(test_data)}] {query_text[:50]}...")
        
        # Create new thread (isolate each query)
        thread = ai_project.agents.threads.create()
        
        # Create user query
        ai_project.agents.messages.create(
            thread.id, role=MessageRole.USER, content=query_text
        )
        
        # Run Agent and measure performance
        start_time = time.time()
        run = ai_project.agents.runs.create_and_process(
            thread_id=thread.id, agent_id=eval_agent.id
        )
        end_time = time.time()
        
        if run.status != RunStatus.COMPLETED:
            print(f"      ‚ö†Ô∏è  Run failed: {run.last_error or 'Unknown error'}")
            continue
        
        # Collect operational metrics
        operational_metrics = {
            "server-run-duration-in-seconds": (
                run.completed_at - run.created_at
            ).total_seconds(),
            "client-run-duration-in-seconds": end_time - start_time,
            "completion-tokens": run.usage.completion_tokens,
            "prompt-tokens": run.usage.prompt_tokens,
            "ground-truth": row.get("ground-truth", '')
        }
        
        # Add thread data + operational metrics to evaluation input
        evaluation_data = thread_data_converter.prepare_evaluation_data(thread_ids=thread.id)
        eval_item = evaluation_data[0]
        eval_item["metrics"] = operational_metrics
        f.write(json.dumps(eval_item, ensure_ascii=False) + "\n")
        
        print(f"      ‚úÖ Completed in {operational_metrics['client-run-duration-in-seconds']:.1f}s")
        print(f"         Tokens: {operational_metrics['prompt-tokens']} prompt + {operational_metrics['completion-tokens']} completion\n")

print("="*70)
print("\n‚úÖ All queries executed successfully")
print(f"   Input saved to: {eval_input_path}\n")

# Run Evaluation
print("="*70)
print("üî¨ Running Evaluators\n")

print("   Evaluators:")
print("      ‚Ä¢ ToolCallAccuracyEvaluator")
print("      ‚Ä¢ IntentResolutionEvaluator")
print("      ‚Ä¢ TaskAdherenceEvaluator")
print("\n   ‚ö†Ô∏è  Note: Some RAI evaluators are not supported in eastus region.")
print("      (CodeVulnerability, ContentSafety, IndirectAttack are excluded)\n")
print("   ‚è≥ This may take 1-2 minutes...\n")

# Define OperationalMetricsEvaluator
class OperationalMetricsEvaluator:
    """Propagate operational metrics to the final evaluation results"""
    def __init__(self):
        pass
    def __call__(self, *, metrics: dict, **kwargs):
        return metrics

# Run Evaluation (locally)
print("   üí° Saving evaluation results locally.\n")

results = evaluate(
    evaluation_name="foundry-agent-evaluation",
    data=eval_input_path,
    evaluators={
        "operational_metrics": OperationalMetricsEvaluator(),
        "tool_call_accuracy": ToolCallAccuracyEvaluator(model_config=model_config),
        "intent_resolution": IntentResolutionEvaluator(model_config=model_config),
        "task_adherence": TaskAdherenceEvaluator(model_config=model_config),
        # Evaluators not supported in eastus region are excluded
        # "code_vulnerability": CodeVulnerabilityEvaluator(credential=credential, azure_ai_project=project_endpoint),
        # "content_safety": ContentSafetyEvaluator(credential=credential, azure_ai_project=project_endpoint),
        # "indirect_attack": IndirectAttackEvaluator(credential=credential, azure_ai_project=project_endpoint)
    },
    output_path=eval_output_path,
    # Remove azure_ai_project parameter (ML workspace not needed)
)

print("‚úÖ Evaluation completed!\n")
print(f"üìÅ Results saved to: {eval_output_path}")
print(f"   Check results in the next cell.\n")
print("="*70)

# Delete Evaluation Agent
print("üßπ Cleaning up...")
ai_project.agents.delete_agent(eval_agent.id)
print(f"‚úÖ Evaluation Agent deleted: {eval_agent.id}\n")

print("="*70)

## 6. Visualize Evaluation Results

Visualize and analyze results with tables and charts.

### How to Check Results

**Method 1: Check in Notebook**
- Run the cell below to view results directly in the notebook.

**Method 2: Check in Terminal**
- Run the following command in terminal to view the same results:
  ```bash
  python3 show_eval_results.py
  ```
- This script is automatically created in the project root.

In [None]:
# Detailed visualization of Evaluation results
import json
from pathlib import Path

SEPARATOR = "‚îÄ" * 100
LINE = "=" * 100

def get_score_color(score, threshold=3.0):
    if score >= 4.5:
        return "\033[92m"
    elif score >= threshold:
        return "\033[93m"
    else:
        return "\033[91m"

def reset_color():
    return "\033[0m"

def get_score_indicator(score, threshold=3.0):
    if score >= 4.5:
        return "‚úÖ"
    elif score >= threshold:
        return "‚ö†Ô∏è"
    else:
        return "‚ùå"

def extract_query_text(query_input):
    if isinstance(query_input, list):
        for item in query_input:
            if isinstance(item, dict) and item.get("role") == "user":
                content = item.get("content", [])
                if isinstance(content, list) and len(content) > 0:
                    return content[0].get("text", "")
    return ""

def extract_response_text(response):
    if isinstance(response, list):
        for item in response:
            if isinstance(item, dict) and item.get("role") == "assistant":
                content = item.get("content", [])
                if isinstance(content, list) and len(content) > 0:
                    return content[0].get("text", "")
    return ""

eval_output_path = Path("evals/eval-output.json")

print(LINE)
print("üìä AGENT EVALUATION RESULTS - Detailed Analysis Report")
print(LINE, "\n")

if not eval_output_path.exists():
    print("‚ùå Evaluation results file not found.")
    print(f"   File path: {eval_output_path.absolute()}")
    print("\n   Please run cell 5 of 07_evaluate_agents.ipynb first.\n")
else:
    with open(eval_output_path, "r", encoding="utf-8") as f:
        data = json.load(f)
    
    metrics = data.get("metrics", {})
    rows = data.get("rows", [])
    
    # Section 1: Overall Average Scores
    print("‚≠ê Overall Average Performance Scores")
    print(LINE)
    scores_config = [
        ("Intent Resolution", "intent_resolution.intent_resolution", "Intent Understanding", 3.0),
        ("Task Adherence", "task_adherence.task_adherence", "Task Fidelity", 3.0),
    ]
    
    for name, key, desc, threshold in scores_config:
        if key in metrics:
            score = metrics[key]
            color = get_score_color(score, threshold)
            reset = reset_color()
            indicator = get_score_indicator(score, threshold)
            stars = "‚òÖ" * int(score) + "‚òÜ" * (5 - int(score))
            bar = "‚ñà" * int(score * 4) + "‚ñë" * (20 - int(score * 4))
            
            print(f"{indicator} {name:20} {color}{score:.2f}/5.0{reset}  {stars}")
            print(f"     {desc:20} [{bar}]")
            if score < threshold:
                print(f"     {color}‚ö†Ô∏è Below threshold (baseline: {threshold:.1f}){reset}")
            print()
    
    # Section 2: Operational Metrics
    print("\n‚ö° Operational Metrics (Average)")
    print(LINE)
    
    operational_keys = [
        ("operational_metrics.server-run-duration-in-seconds", "Server Run Duration", "s"),
        ("operational_metrics.client-run-duration-in-seconds", "Client Run Duration", "s"),
        ("operational_metrics.prompt-tokens", "Prompt Tokens", "tokens"),
        ("operational_metrics.completion-tokens", "Completion Tokens", "tokens"),
    ]
    
    total_tokens = 0
    for key, desc, unit in operational_keys:
        if key in metrics:
            value = metrics[key]
            if unit == "tokens":
                print(f"  {desc:30} {int(value):>10,} {unit}")
                total_tokens += value
            else:
                print(f"  {desc:30} {value:>10.2f} {unit}")
    
    if total_tokens > 0:
        print(f"  {'Total Token Usage':30} {int(total_tokens):>10,} tokens")
        cost = (total_tokens / 1000) * 0.0025
        print(f"  {'Estimated Cost (GPT-4o)':30} ${cost:>9.4f}")
    print()
    
    # Section 3: Individual Query Detailed Results
    if rows:
        print("\nüìã Detailed Results by Query")
        print(LINE)
        
        for idx, row in enumerate(rows, 1):
            query = extract_query_text(row.get("inputs.query", []))
            response = extract_response_text(row.get("inputs.response", []))
            ground_truth = row.get("inputs.metrics.ground-truth", "")
            
            print(f"\n{SEPARATOR}")
            print(f"üîç Query #{idx}")
            print(SEPARATOR)
            
            print("\nüí¨ User Question:")
            print(f"   {query}")
            
            if ground_truth:
                print("\nüìå Expected Behavior (Ground Truth):")
                print(f"   {ground_truth}")
            
            if response:
                print("\nü§ñ Agent Response (Summary):")
                response_preview = response[:200] if len(response) > 200 else response
                lines_shown = 0
                for line in response_preview.split("\n"):
                    if line.strip() and lines_shown < 3:
                        print(f"   {line.strip()}")
                        lines_shown += 1
                if len(response) > 200:
                    print(f"   ... (total {len(response):,} characters)")
            
            print("\nüìä Evaluation Scores:")
            
            intent = row.get("outputs.intent_resolution.intent_resolution", "N/A")
            task = row.get("outputs.task_adherence.task_adherence", "N/A")
            tool = row.get("outputs.tool_call_accuracy.tool_call_accuracy", "N/A")
            
            intent_threshold = row.get("outputs.intent_resolution.intent_resolution_threshold", 3)
            task_threshold = row.get("outputs.task_adherence.task_adherence_threshold", 3)
            
            if isinstance(intent, (int, float)):
                color = get_score_color(intent, intent_threshold)
                reset = reset_color()
                indicator = get_score_indicator(intent, intent_threshold)
                print(f"   {indicator} Intent Resolution:  {color}{intent:.1f}/5.0{reset} (threshold: {intent_threshold})")
            else:
                print(f"   ‚Ä¢ Intent Resolution:  {intent}")
            
            if isinstance(task, (int, float)):
                color = get_score_color(task, task_threshold)
                reset = reset_color()
                indicator = get_score_indicator(task, task_threshold)
                print(f"   {indicator} Task Adherence:     {color}{task:.1f}/5.0{reset} (threshold: {task_threshold})")
            else:
                print(f"   ‚Ä¢ Task Adherence:     {task}")
            
            print(f"   ‚Ä¢ Tool Call Accuracy: {tool}")
            
            # Evaluation reasoning
            print("\nüí° Evaluation Details:")
            
            intent_reason = row.get("outputs.intent_resolution.intent_resolution_reason", "")
            task_reason = row.get("outputs.task_adherence.task_adherence_reason", "")
            tool_reason = row.get("outputs.tool_call_accuracy.tool_call_accuracy_reason", "")
            
            if intent_reason:
                print("\n   [Intent Resolution Reasoning]")
                for sentence in intent_reason.split(". "):
                    if sentence.strip():
                        print(f"   ‚Ä¢ {sentence.strip()}.")
            
            if task_reason:
                print("\n   [Task Adherence Reasoning]")
                for sentence in task_reason.split(". "):
                    if sentence.strip():
                        print(f"   ‚Ä¢ {sentence.strip()}.")
            
            if tool_reason:
                print("\n   [Tool Call Accuracy Reasoning]")
                for sentence in tool_reason.split(". "):
                    if sentence.strip():
                        print(f"   ‚Ä¢ {sentence.strip()}.")
            
            duration = row.get("outputs.operational_metrics.client-run-duration-in-seconds", 0)
            prompt_tokens = row.get("outputs.operational_metrics.prompt-tokens", 0)
            completion_tokens = row.get("outputs.operational_metrics.completion-tokens", 0)
            
            print("\n‚è±Ô∏è  Performance Metrics:")
            print(f"   ‚Ä¢ Execution Time: {duration:.2f}s")
            print(f"   ‚Ä¢ Token Usage: {prompt_tokens:,} (input) + {completion_tokens:,} (output) = {prompt_tokens + completion_tokens:,} (total)")
            
            issues = []
            if isinstance(intent, (int, float)) and intent < intent_threshold:
                issues.append(f"Low Intent Resolution score ({intent:.1f} < {intent_threshold})")
            if isinstance(task, (int, float)) and task < task_threshold:
                issues.append(f"Low Task Adherence score ({task:.1f} < {task_threshold})")
            
            if issues:
                print(f"\n{get_score_color(1.0, 3.0)}‚ö†Ô∏è  Issues Found:{reset_color()}")
                for issue in issues:
                    print(f"   ‚Ä¢ {issue}")
        
        # Section 4: Statistical Summary
        print(f"\n{SEPARATOR}\n")
        print("\nüìà Statistical Summary and Analysis")
        print(LINE)
        
        intent_scores = []
        task_scores = []
        durations = []
        total_tokens_list = []
        failed_queries = []
        
        for idx, row in enumerate(rows, 1):
            intent = row.get("outputs.intent_resolution.intent_resolution")
            task = row.get("outputs.task_adherence.task_adherence")
            duration = row.get("outputs.operational_metrics.client-run-duration-in-seconds", 0)
            prompt = row.get("outputs.operational_metrics.prompt-tokens", 0)
            completion = row.get("outputs.operational_metrics.completion-tokens", 0)
            
            intent_threshold = row.get("outputs.intent_resolution.intent_resolution_threshold", 3)
            task_threshold = row.get("outputs.task_adherence.task_adherence_threshold", 3)
            
            if isinstance(intent, (int, float)):
                intent_scores.append(intent)
                if intent < intent_threshold:
                    query = extract_query_text(row.get("inputs.query", []))
                    failed_queries.append((idx, "Intent Resolution", intent, query[:50]))
            
            if isinstance(task, (int, float)):
                task_scores.append(task)
                if task < task_threshold:
                    query = extract_query_text(row.get("inputs.query", []))
                    failed_queries.append((idx, "Task Adherence", task, query[:50]))
            
            if duration:
                durations.append(duration)
            total_tokens_list.append(prompt + completion)
        
        if intent_scores:
            avg_intent = sum(intent_scores) / len(intent_scores)
            color = get_score_color(avg_intent, 3.0)
            reset = reset_color()
            pass_count = len([s for s in intent_scores if s >= 3.0])
            
            print("\nüìä Intent Resolution")
            print(f"   Average: {color}{avg_intent:.2f}/5.0{reset}")
            print(f"   Max: {max(intent_scores):.1f}  |  Min: {min(intent_scores):.1f}")
            print(f"   Pass Rate: {pass_count}/{len(intent_scores)} ({pass_count/len(intent_scores)*100:.1f}%)")
        
        if task_scores:
            avg_task = sum(task_scores) / len(task_scores)
            color = get_score_color(avg_task, 3.0)
            reset = reset_color()
            pass_count = len([s for s in task_scores if s >= 3.0])
            
            print("\nüìä Task Adherence")
            print(f"   Average: {color}{avg_task:.2f}/5.0{reset}")
            print(f"   Max: {max(task_scores):.1f}  |  Min: {min(task_scores):.1f}")
            print(f"   Pass Rate: {pass_count}/{len(task_scores)} ({pass_count/len(task_scores)*100:.1f}%)")
        
        if durations:
            print("\n‚è±Ô∏è  Execution Time")
            print(f"   Average: {sum(durations)/len(durations):.2f}s")
            print(f"   Max: {max(durations):.2f}s  |  Min: {min(durations):.2f}s")
        
        if total_tokens_list:
            avg_tokens = sum(total_tokens_list) / len(total_tokens_list)
            total_all_tokens = sum(total_tokens_list)
            
            print("\nüí∞ Token Usage")
            print(f"   Average: {avg_tokens:,.0f} tokens/query")
            print(f"   Total: {total_all_tokens:,} tokens")
            print(f"   Max: {max(total_tokens_list):,}  |  Min: {min(total_tokens_list):,}")
            print(f"   Estimated Cost (GPT-4o): ${(total_all_tokens / 1000) * 0.0025:.4f}")
        
        if failed_queries:
            print(f"\n{get_score_color(1.0, 3.0)}‚ö†Ô∏è  Queries Needing Improvement ({len(failed_queries)} items){reset_color()}")
            print(SEPARATOR)
            
            seen = set()
            for idx, metric, score, query in failed_queries:
                key = (idx, metric)
                if key not in seen:
                    seen.add(key)
                    print(f"   Query #{idx}: {metric} = {score:.1f}")
                    print(f"   ‚îî‚îÄ {query}...")
                    print()
        else:
            print("\n‚úÖ All queries passed the threshold!")
    
    print(f"\n{LINE}")
    print(f"‚úÖ Evaluated {len(rows)} queries in total")
    print(f"üìÅ Detailed JSON: {eval_output_path.absolute()}")
    print(f"üí° Can also run in terminal: python3 show_eval_results.py")
    print(f"{LINE}\n")

## 7. Evaluation Metrics Interpretation Guide

### Metric Interpretation

**Operational Metrics:**
- `server-run-duration-in-seconds`: Agent execution time on server
- `client-run-duration-in-seconds`: Total elapsed time measured by client
- `prompt-tokens`: Number of input tokens
- `completion-tokens`: Number of generated tokens

**Performance Metrics:**
- `tool_call_accuracy.*`: Tool call accuracy (1-5 points)
- `intent_resolution.*`: Intent understanding accuracy (1-5 points)
- `task_adherence.*`: Task adherence level (1-5 points)

**Score Interpretation:**
- 5 points: Perfect
- 4 points: Good
- 3 points: Average
- 2 points: Needs improvement
- 1 point: Very poor

---

## 8. Next Steps and Agent Improvement

### Agent Improvement Methods

1. **Analyze Low Scores**
   - Identify queries with low scores
   - Determine which evaluators had issues

2. **Improve Prompts**
   - Write clearer Agent instructions
   - Add examples
   - Specify constraints

3. **Improve Functionality**
   - Add or improve Tools
   - Enhance RAG knowledge base
   - Adjust Multi-Agent configuration

4. **Re-evaluate**
   - Re-evaluate with same queries
   - Compare score changes
   - Continuous improvement

### References

- [Azure AI Foundry Agent Evaluation](https://learn.microsoft.com/azure/ai-foundry/how-to/develop/agent-evaluate-sdk)
- [Built-in Evaluators](https://learn.microsoft.com/azure/ai-foundry/how-to/develop/evaluate-sdk)
- [Evaluation Best Practices](https://learn.microsoft.com/azure/ai-foundry/concepts/evaluation-approach-gen-ai)

---

## Complete!

Congratulations! You have successfully completed Agent Evaluation. üéâ