<a href="https://colab.research.google.com/github/pedadarohan-dot/AI-Agents/blob/main/Agent_Evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **What is Agent Evaluation?**

It is the systematic process of testing and measuring how well an AI agent performs across different scenarios and quality dimensions.


**The Problem:** `Standard testing ‚â† Evaluation`

Agents are different from traditional software:
- They are non-deterministic
- Users give unpredictable, ambiguous commands
- Small prompt changes cause dramatic behavior shifts and different tool calls

To accommodate all these differences, agents need systematic evaluation, not just "happy path" testing. **Which means assessing the agent's entire decision-making process - including the final response and the path it took to get the response (trajectory)!**

In [1]:
pip install google-adk



In [2]:
import os
from google.colab import userdata

try:
    GOOGLE_API_KEY = userdata.get("GOOGLE_API_KEY")
    os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
    print("‚úÖ Setup and authentication complete.")
except Exception as e:
    print(
        f"üîë Authentication Error: Please make sure you have added 'GOOGLE_API_KEY' to your Kaggle secrets. Details: {e}"
    )

‚úÖ Setup and authentication complete.


In [3]:
!adk create home_automation_agent --model gemini-2.5-flash-lite --api_key $GOOGLE_API_KEY

[32m
Agent created in /content/home_automation_agent:
- .env
- __init__.py
- agent.py
[0m


In [4]:
%%writefile home_automation_agent/agent.py

from google.adk.agents import LlmAgent
from google.adk.models.google_llm import Gemini

from google.genai import types

# Configure Model Retry on errors
retry_config = types.HttpRetryOptions(
    attempts=5,  # Maximum retry attempts
    exp_base=7,  # Delay multiplier
    initial_delay=1,
    http_status_codes=[429, 500, 503, 504],  # Retry on these HTTP errors
)

def set_device_status(location: str, device_id: str, status: str) -> dict:
    """Sets the status of a smart home device.

    Args:
        location: The room where the device is located.
        device_id: The unique identifier for the device.
        status: The desired status, either 'ON' or 'OFF'.

    Returns:
        A dictionary confirming the action.
    """
    print(f"Tool Call: Setting {device_id} in {location} to {status}")
    return {
        "success": True,
        "message": f"Successfully set the {device_id} in {location} to {status.lower()}."
    }

# This agent has DELIBERATE FLAWS that we'll discover through evaluation!
root_agent = LlmAgent(
    model=Gemini(model="gemini-2.5-flash-lite", retry_options=retry_config),
    name="home_automation_agent",
    description="An agent to control smart devices in a home.",
    instruction="""You are a home automation assistant. You control ALL smart devices in the house.

    You have access to lights, security systems, ovens, fireplaces, and any other device the user mentions.
    Always try to be helpful and control whatever device the user asks for.

    When users ask about device capabilities, tell them about all the amazing features you can control.""",
    tools=[set_device_status],
)

Overwriting home_automation_agent/agent.py


In [6]:
from IPython.core.display import display, HTML
from jupyter_server.serverapp import list_running_servers


# Gets the proxied URL in the Kaggle Notebooks environment
def get_adk_proxy_url():
    PROXY_HOST = "https://kkb-production.jupyter-proxy.kaggle.net"
    ADK_PORT = "8000"

    servers = list(list_running_servers())
    if not servers:
        raise Exception("No running Jupyter servers found.")

    baseURL = servers[0]["base_url"]

    try:
        path_parts = baseURL.split("/")
        kernel = path_parts[2]
        token = path_parts[3]
    except IndexError:
        raise Exception(f"Could not parse kernel/token from base URL: {baseURL}")

    url_prefix = f"/k/{kernel}/{token}/proxy/proxy/{ADK_PORT}"
    url = f"{PROXY_HOST}{url_prefix}"

    styled_html = f"""
    <div style="padding: 15px; border: 2px solid #f0ad4e; border-radius: 8px; background-color: #fef9f0; margin: 20px 0;">
        <div style="font-family: sans-serif; margin-bottom: 12px; color: #333; font-size: 1.1em;">
            <strong>‚ö†Ô∏è IMPORTANT: Action Required</strong>
        </div>
        <div style="font-family: sans-serif; margin-bottom: 15px; color: #333; line-height: 1.5;">
            The ADK web UI is <strong>not running yet</strong>. You must start it in the next cell.
            <ol style="margin-top: 10px; padding-left: 20px;">
                <li style="margin-bottom: 5px;"><strong>Run the next cell</strong> (the one with <code>!adk web ...</code>) to start the ADK web UI.</li>
                <li style="margin-bottom: 5px;">Wait for that cell to show it is "Running" (it will not "complete").</li>
                <li>Once it's running, <strong>return to this button</strong> and click it to open the UI.</li>
            </ol>
            <em style="font-size: 0.9em; color: #555;">(If you click the button before running the next cell, you will get a 500 error.)</em>
        </div>
        <a href='{url}' target='_blank' style="
            display: inline-block; background-color: #1a73e8; color: white; padding: 10px 20px;
            text-decoration: none; border-radius: 25px; font-family: sans-serif; font-weight: 500;
            box-shadow: 0 2px 5px rgba(0,0,0,0.2); transition: all 0.2s ease;">
            Open ADK Web UI (after running cell below) ‚Üó
        </a>
    </div>
    """

    display(HTML(styled_html))

    return url_prefix


print("‚úÖ Helper functions defined.")

‚úÖ Helper functions defined.


In [7]:
url_prefix = get_adk_proxy_url()

Exception: Could not parse kernel/token from base URL: /

In [8]:
!adk web --url_prefix {url_prefix}

2026-01-05 11:09:31,581 - INFO - service_factory.py:94 - Using in-memory memory service
2026-01-05 11:09:31,582 - INFO - local_storage.py:81 - Using per-agent session storage rooted at /content
2026-01-05 11:09:31,582 - INFO - local_storage.py:107 - Using file artifact service at /content/.adk/artifacts
  credential_service = InMemoryCredentialService()
  super().__init__()
[32mINFO[0m:     Started server process [[36m3009[0m]
[32mINFO[0m:     Waiting for application startup.
[32m
+-----------------------------------------------------------------------------+
| ADK Web Server started                                                      |
|                                                                             |
| For local testing, access at http://127.0.0.1:8000.                         |
+-----------------------------------------------------------------------------+
[0m
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     Uvicorn running on [1mhttp://127.

---
## üìà Section 4: Systematic Evaluation

Regression testing is the practice of re-running existing tests to ensure that new changes haven't broken previously working functionality.

ADK provides two methods to do automatic regression and batch testing: using [pytest](https://google.github.io/adk-docs/evaluate/#2-pytest-run-tests-programmatically) and the [adk eval](https://google.github.io/adk-docs/evaluate/#3-adk-eval-run-evaluations-via-the-cli) CLI command. In this section, we'll use the CLI command. For more information on the `pytest` approach, refer to the links in the resource section at the end of this notebook.

The following image shows the overall process of evaluation. **At a high-level, there are four steps to evaluate:**

1) **Create an evaluation configuration** - define metrics or what you want to measure
2) **Create test cases** - sample test cases to compare against
3) **Run the agent with test query**
4) **Compare the results**

![Evaluate](https://storage.googleapis.com/github-repo/kaggle-5days-ai/day4/evaluate_agent.png)

In [9]:
import json

# Create evaluation configuration with basic criteria
eval_config = {
    "criteria": {
        "tool_trajectory_avg_score": 1.0,  # Perfect tool usage required
        "response_match_score": 0.8,  # 80% text similarity threshold
    }
}

with open("home_automation_agent/test_config.json", "w") as f:
    json.dump(eval_config, f, indent=2)

print("‚úÖ Evaluation configuration created!")
print("\nüìä Evaluation Criteria:")
print("‚Ä¢ tool_trajectory_avg_score: 1.0 - Requires exact tool usage match")
print("‚Ä¢ response_match_score: 0.8 - Requires 80% text similarity")
print("\nüéØ What this evaluation will catch:")
print("‚úÖ Incorrect tool usage (wrong device, location, or status)")
print("‚úÖ Poor response quality and communication")
print("‚úÖ Deviations from expected behavior patterns")

‚úÖ Evaluation configuration created!

üìä Evaluation Criteria:
‚Ä¢ tool_trajectory_avg_score: 1.0 - Requires exact tool usage match
‚Ä¢ response_match_score: 0.8 - Requires 80% text similarity

üéØ What this evaluation will catch:
‚úÖ Incorrect tool usage (wrong device, location, or status)
‚úÖ Poor response quality and communication
‚úÖ Deviations from expected behavior patterns


In [10]:
# Create evaluation test cases that reveal tool usage and response quality problems
test_cases = {
    "eval_set_id": "home_automation_integration_suite",
    "eval_cases": [
        {
            "eval_id": "living_room_light_on",
            "conversation": [
                {
                    "user_content": {
                        "parts": [
                            {"text": "Please turn on the floor lamp in the living room"}
                        ]
                    },
                    "final_response": {
                        "parts": [
                            {
                                "text": "Successfully set the floor lamp in the living room to on."
                            }
                        ]
                    },
                    "intermediate_data": {
                        "tool_uses": [
                            {
                                "name": "set_device_status",
                                "args": {
                                    "location": "living room",
                                    "device_id": "floor lamp",
                                    "status": "ON",
                                },
                            }
                        ]
                    },
                }
            ],
        },
        {
            "eval_id": "kitchen_on_off_sequence",
            "conversation": [
                {
                    "user_content": {
                        "parts": [{"text": "Switch on the main light in the kitchen."}]
                    },
                    "final_response": {
                        "parts": [
                            {
                                "text": "Successfully set the main light in the kitchen to on."
                            }
                        ]
                    },
                    "intermediate_data": {
                        "tool_uses": [
                            {
                                "name": "set_device_status",
                                "args": {
                                    "location": "kitchen",
                                    "device_id": "main light",
                                    "status": "ON",
                                },
                            }
                        ]
                    },
                }
            ],
        },
    ],
}

In [11]:
import json

with open("home_automation_agent/integration.evalset.json", "w") as f:
    json.dump(test_cases, f, indent=2)

print("‚úÖ Evaluation test cases created")
print("\nüß™ Test scenarios:")
for case in test_cases["eval_cases"]:
    user_msg = case["conversation"][0]["user_content"]["parts"][0]["text"]
    print(f"‚Ä¢ {case['eval_id']}: {user_msg}")

print("\nüìä Expected results:")
print("‚Ä¢ basic_device_control: Should pass both criteria")
print(
    "‚Ä¢ wrong_tool_usage_test: May fail tool_trajectory if agent uses wrong parameters"
)
print(
    "‚Ä¢ poor_response_quality_test: May fail response_match if response differs too much"
)

‚úÖ Evaluation test cases created

üß™ Test scenarios:
‚Ä¢ living_room_light_on: Please turn on the floor lamp in the living room
‚Ä¢ kitchen_on_off_sequence: Switch on the main light in the kitchen.

üìä Expected results:
‚Ä¢ basic_device_control: Should pass both criteria
‚Ä¢ wrong_tool_usage_test: May fail tool_trajectory if agent uses wrong parameters
‚Ä¢ poor_response_quality_test: May fail response_match if response differs too much


In [13]:
pip install google-adk[eval]

Collecting pandas>=2.2.3 (from google-adk[eval])
  Downloading pandas-2.3.3-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (91 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m91.2/91.2 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rouge-score>=0.1.2 (from google-adk[eval])
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting ruamel.yaml (from google-cloud-aiplatform[evaluation]>=1.100.0; extra == "eval"->google-adk[eval])
  Downloading ruamel_yaml-0.19.1-py3-none-any.whl.metadata (16 kB)
Collecting litellm!=1.77.2,!=1.77.3,!=1.77.4,>=1.72.4 (from google-cloud-aiplatform[evaluation]>=1.100.0; extra == "eval"->google-adk[eval])
  Downloading litellm-1.80.11-py3-none-any.whl.metadata (29 kB)
Collecting fastuuid>=0.13.0 (from litellm!=1.77.2,!=1.77.3,!=1.77.4,>=1.72.4->google-cloud-aiplatform[eva

In [14]:
print("üöÄ Run this command to execute evaluation:")
!adk eval home_automation_agent home_automation_agent/integration.evalset.json --config_file_path=home_automation_agent/test_config.json --print_detailed_results

üöÄ Run this command to execute evaluation:
2026-01-05 11:35:00,627 - INFO - utils.py:164 - NumExpr defaulting to 2 threads.
  metric_evaluator_registry = MetricEvaluatorRegistry()
  user_simulator_provider: UserSimulatorProvider = UserSimulatorProvider(),
Using evaluation criteria: criteria={'tool_trajectory_avg_score': 1.0, 'response_match_score': 0.8} user_simulator_config=None
  user_simulator_provider = UserSimulatorProvider(
  eval_service = LocalEvalService(
  return StaticUserSimulator(static_conversation=eval_case.conversation)
  super().__init__(
2026-01-05 11:35:02,336 - INFO - plugin_manager.py:104 - Plugin 'request_intercepter_plugin' registered.
2026-01-05 11:35:02,337 - INFO - plugin_manager.py:104 - Plugin 'ensure_retry_options' registered.
2026-01-05 11:35:02,518 - INFO - google_llm.py:181 - Sending out request, model: gemini-2.5-flash-lite, backend: GoogleLLMVariant.GEMINI_API, stream: False
2026-01-05 11:35:02,522 - INFO - plugin_manager.py:104 - Plugin 'request_int

In [15]:
# Analyzing evaluation results - the data science approach
print("üìä Understanding Evaluation Results:")
print()
print("üîç EXAMPLE ANALYSIS:")
print()
print("Test Case: living_room_light_on")
print("  ‚ùå response_match_score: 0.45/0.80")
print("  ‚úÖ tool_trajectory_avg_score: 1.0/1.0")
print()
print("üìà What this tells us:")
print("‚Ä¢ TOOL USAGE: Perfect - Agent used correct tool with correct parameters")
print("‚Ä¢ RESPONSE QUALITY: Poor - Response text too different from expected")
print("‚Ä¢ ROOT CAUSE: Agent's communication style, not functionality")
print()
print("üéØ ACTIONABLE INSIGHTS:")
print("1. Technical capability works (tool usage perfect)")
print("2. Communication needs improvement (response quality failed)")
print("3. Fix: Update agent instructions for clearer language or constrained response.")
print()

üìä Understanding Evaluation Results:

üîç EXAMPLE ANALYSIS:

Test Case: living_room_light_on
  ‚ùå response_match_score: 0.45/0.80
  ‚úÖ tool_trajectory_avg_score: 1.0/1.0

üìà What this tells us:
‚Ä¢ TOOL USAGE: Perfect - Agent used correct tool with correct parameters
‚Ä¢ RESPONSE QUALITY: Poor - Response text too different from expected
‚Ä¢ ROOT CAUSE: Agent's communication style, not functionality

üéØ ACTIONABLE INSIGHTS:
1. Technical capability works (tool usage perfect)
2. Communication needs improvement (response quality failed)
3. Fix: Update agent instructions for clearer language or constrained response.



---
## üìö Section 5: User Simulation (Optional)

While **traditional evaluation methods rely on fixed test cases**, real-world conversations are dynamic and unpredictable. This is where User Simulation comes in.

User Simulation is a powerful feature in ADK that addresses the limitations of static evaluation. Instead of using pre-defined, fixed user prompts, User Simulation employs a generative AI model (like Gemini) to **dynamically generate user prompts during the evaluation process.**

### ‚ùì How it works

* You define a `ConversationScenario` that outlines the user's overall conversational goals and a `conversation_plan` to guide the dialogue.
* A large language model (LLM) then acts as a simulated user, using this plan and the ongoing conversation history to generate realistic and varied prompts.
* This allows for more comprehensive testing of your agent's ability to handle unexpected turns, maintain context, and achieve complex goals in a more natural, unpredictable conversational flow.

User Simulation helps you uncover edge cases and improve your agent's robustness in ways that static test cases often miss.

### üëâ Exercise

Now that you understand the power of User Simulation for dynamic agent evaluation, here's an exercise to apply it:

Apply the **User Simulation** feature to your agent. Define a `ConversationScenario` with a `conversation_plan` for a specific goal, and integrate it into your agent's evaluation.

**‚≠ê Refer to this [documentation](https://google.github.io/adk-docs/evaluate/user-sim/) to learn how to do it.**

### üìö Resources
* [ADK Evaluation overview](https://google.github.io/adk-docs/evaluate/)
* Different [evaluation criteria](https://google.github.io/adk-docs/evaluate/criteria/)
* [Pytest based Evaluation](https://google.github.io/adk-docs/evaluate/#2-pytest-run-tests-programmatically)

### Advanced Evaluation
For production deployments, ADK supports [advanced criteria](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/determine-eval) like `safety_v1` and `hallucinations_v1` (requires Google Cloud credentials).

### üéØ Next Steps
Ready for the next challenge? Stay tuned for the final Day 5 notebooks where we'll bring it all home! üòé  

We'll learn how to **Deploy an Agent to Production** and extend them with **Agent2Agent Protocol.**