In [1]:
from dotenv import load_dotenv
import os

load_dotenv()
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")


Agent Observability - Logs, Traces & Metrics

Today, you'll learn:

How to add observability to the agent you've built and
How to evaluate if the agents are working as expected
In this notebook, we'll focus on the first part - Agent Observability!

What is Agent Observability?¬∂
üö® The challenge: Unlike traditional software that fails predictably, AI agents can fail mysteriously. Example:

User: "Find quantum computing papers"
Agent: "I cannot help with that request."
You: üò≠ WHY?? Is it the prompt? Missing tools? API error?
üí° The Solution: Agent observability gives you complete visibility into your agent's decision-making process. You'll see exactly what prompts are sent to the LLM, which tools are available, how the model responds, and where failures occur.

DEBUG Log: LLM Request shows "Functions: []" (no tools!)
You: üéØ Aha! Missing google_search tool - easy fix!
Foundational pillars of Agent Observability
Logs: A log is a record of a single event, telling you what happened at a specific moment.
Traces: A trace connects the logs into a single story, showing you why a final result occurred by revealing the entire sequence of steps.
Metrics: Metrics are the summary numbers (like averages and error rates) that tell you how well the agent is performing overall.

 1.3: Set up logging and cleanup old files¬∂
Let's configure logging for our debugging session. The following cell makes sure we also capture other log levels, like DEBUG.

In [2]:
import logging
import os

# Clean up any previous logs
for log_file in ["logger.log", "web.log", "tunnel.log"]:
    if os.path.exists(log_file):
        os.remove(log_file)
        print(f"üßπ Cleaned up {log_file}")

# Configure logging with DEBUG log level.
logging.basicConfig(
    filename="logger.log",
    level=logging.DEBUG,
    format="%(filename)s:%(lineno)s %(levelname)s:%(message)s",
)

 Section 2: Hands-On Debugging with ADK Web UI
2.1: Create a "Research Paper Finder" Agent
Our goal: Build a research paper finder agent that helps users find academic papers on any topic.

But first, let's intentionally create an incorrect version of the agent to practice debugging! We'll start by creating a new agent folder using the adk create CLI command.


Agent definition
Next, let's create our root agent.

We'll configure it as an LlmAgent, give it a name, model and instruction.
The root_agent gets the user prompt and delegates the search to the google_search_agent.
Then, the agent uses the count_papers tool to count the number of papers returned.
üëâ Pay attention to the root agent's instructions and the count_papers tool parameter

In [3]:
!adk create research-agent --model gemini-2.5-flash-lite --api_key $GOOGLE_API_KEY


Agent created in d:\5Days_Google_AI_course\day_4\research-agent:
- .env
- __init__.py
- agent.py



In [4]:
%%writefile research-agent/agent.py

from google.adk.agents import LlmAgent
from google.adk.models.google_llm import Gemini
from google.adk.tools.agent_tool import AgentTool
from google.adk.tools.google_search_tool import google_search

from google.genai import types
from typing import List

retry_config = types.HttpRetryOptions(
    attempts=5,  # Maximum retry attempts
    exp_base=7,  # Delay multiplier
    initial_delay=1,
    http_status_codes=[429, 500, 503, 504],  # Retry on these HTTP errors
)

# ---- Intentionally pass incorrect datatype - `str` instead of `List[str]` ----
def count_papers(papers: str):
    """
    This function counts the number of papers in a list of strings.
    Args:
      papers: A list of strings, where each string is a research paper.
    Returns:
      The number of papers in the list.
    """
    return len(papers)


# Google Search agent
google_search_agent = LlmAgent(
    name="google_search_agent",
    model=Gemini(model="gemini-2.5-flash-lite", retry_options=retry_config),
    description="Searches for information using Google search",
    instruction="""Use the google_search tool to find information on the given topic. Return the raw search results.
    If the user asks for a list of papers, then give them the list of research papers you found and not the summary.""",
    tools=[google_search]
)


# Root agent
root_agent = LlmAgent(
    name="research_paper_finder_agent",
    model=Gemini(model="gemini-2.5-flash-lite", retry_options=retry_config),
    instruction="""Your task is to find research papers and count them. 

    You MUST ALWAYS follow these steps:
    1) Find research papers on the user provided topic using the 'google_search_agent'. 
    2) Then, pass the papers to 'count_papers' tool to count the number of papers returned.
    3) Return both the list of research papers and the total number of papers.
    """,
    tools=[AgentTool(agent=google_search_agent), count_papers]
)

Overwriting research-agent/agent.py


2.2: Run the agent
Let's now run our agent with the adk web --log_level DEBUG CLI command.

üìç The key here is --log_level DEBUG - this shows us:

Full LLM Prompts: The complete request sent to the language model, including system instructions, history, and tools.
Detailed API responses from services.
Internal state transitions and variable values.
Other log levels include: INFO, ERROR and WARNING.

Get the proxied URL to access the ADK web UI in the Kaggle Notebooks environment:

In [6]:
url_prefix = "http://localhost:8000"  # Local backend


Now you can start the ADK web UI with the --log_level parameter.

üëâ Note: The following cell will not "complete", but will remain running and serving the ADK web UI until you manually stop the cell.

In [7]:
!adk web --log_level DEBUG --url_prefix {url_prefix}

^C


Once the ADK web UI starts, open the proxy link using the button in the previous cell.

As you start chatting with the agent, you should see the DEBUG logs appear in the output cell below!

‚ÄºÔ∏è IMPORTANT: DO NOT SHARE THE PROXY LINK with anyone - treat it as sensitive data as it contains your authentication token in the URL.



üìù 2.3: Test the agent in ADK web UI
üëâ Do: In the ADK web UI
Select "research-agent" from the dropdown in the top-left.
In the chat interface, type: Find latest quantum computing papers
Send the message and observe the response. The agent should return a list of research papers and their count.
It looks like our agent works and we got a response! ü§î But wait, isn't the count of papers unusually large? Let's look at the logs and trace.

üëâ Do: Events tab - Traces in detail
In the web UI, click the "Events" tab on the left sidebar
You'll see a chronological list of all agent actions
Click on any event to expand its details in the bottom panel
Try clicking the "Trace" button to see timing information for each step.
Click the execute_tool count_papers span. You'll see that the function call to count_papers returns the large number as the response.
Let's look at what was passed as input to this function.
Find the call_llm span corresponding to the count_papers function call.
üëâ Do: Inspect the Function call in Events:
Click on the specific span to open the Events tab.
Examine the function_call, focusing on the papers argument.
Notice that root_agent passes the list of papers as a str instead of a List[str] - there's our bug!
Demo



2.4: Your Turn - fix it! üëæ
Update the datatype of the papers argument in the count_papers tool to a List[str] and rerun the adk web command!

‚ÄºÔ∏è Stop the ADK web UI üõë
In order to run cells in the remainder of this notebook, please stop the running cell where you started adk web in Section 3.1.

Otherwise that running cell will block / prevent other cells from running as long as the ADK web UI is running.

2.5: Debug through local Logs
Optionally, you can also examine the local DEBUG logs to find the root cause. Run the following cell to print the contents of the log file. Look for detailed logs like:

In [8]:
print("üîç Examining web server logs for debugging clues...\n")
!cat logger.log

üîç Examining web server logs for debugging clues...



'cat' is not recognized as an internal or external command,
operable program or batch file.


other Observability questions you can now answer from logs and adk web:

Efficiency: Is the agent making optimal tool choices?
Reasoning Quality: Are the prompts well-structured and context-appropriate?
Performance: Look at the traces to identify which steps take the longest?
Failure Diagnosis: When something goes wrong, where exactly did it fail?
Key Learning: Core debugging pattern: symptom ‚Üí logs ‚Üí root cause ‚Üí fix.

Debugging Victory: You just went from "Agent mysteriously failed" to "I know exactly why and how to fix it!" This is the power of observability!



üßë‚Äçüíª Section 3: Logging in production¬∂
üéØ Great! You can now debug agent failures using ADK web UI and DEBUG logs.

But what happens when you move beyond development? Real-world scenarios where you need to move beyond the web UI:

‚ùå Problem 1: Production Deployment

You: "Let me open the ADK web UI to check why the agent failed"
DevOps: "Um... this is a production server. No web UI access."
You: üò± "How do I debug production issues?"
‚ùå Problem 2: Automated Systems

You: "The agent runs 1000 times per day in our pipeline"
Boss: "Which runs are slow? What's our success rate?"
You: üò∞ "I'd have to manually check the web UI 1000 times..."
üí° The Solution:

We need a way to capture observability data or in other words, add logs to our code.

üëâ In traditional software development, this is done by adding log statements in Python functions - and agents are no different! We need to add log statements to our agent and a common approach is to add log statements to Plugins.



3.1: How to add logs for production observability?
A Plugin is a custom code module that runs automatically at various stages of your agent's lifecycle. Plugins are composed of "Callbacks" which provide the hooks to interrupt an agent's flow. Think of it like this:

Your agent workflow: User message ‚Üí Agent thinks ‚Üí Calls tools ‚Üí Returns response
Plugin hooks into this: Before agent starts ‚Üí After tool runs ‚Üí When LLM responds ‚Üí etc.
Plugin contains your custom code: Logging, monitoring, security checks, caching, etc.
image.png

Callbacks
Callbacks are the atomic components inside a Plugin - these are just Python functions that run at specific points in an agent's lifecycle! Callbacks are grouped together to create a Plugin.

There are different kinds of callbacks such as:

before/after_agent_callbacks - runs before/after an agent is invoked
before/after_tool_callbacks - runs before/after a tool is called
before/after_model_callbacks - similarly, runs before/after the LLM model is called
on_model_error_callback - which runs when a model error is encountered




3.2: To make things more concrete, what does a Plugin look like?


In [10]:
print("----- EXAMPLE PLUGIN - DOES NOTHING ----- ")

import logging
from google.adk.agents.base_agent import BaseAgent
from google.adk.agents.callback_context import CallbackContext
from google.adk.models.llm_request import LlmRequest
from google.adk.plugins.base_plugin import BasePlugin


# Applies to all agent and model calls
class CountInvocationPlugin(BasePlugin):
    """A custom plugin that counts agent and tool invocations."""

    def __init__(self) -> None:
        """Initialize the plugin with counters."""
        super().__init__(name="count_invocation")
        self.agent_count: int = 0
        self.tool_count: int = 0
        self.llm_request_count: int = 0

    # Callback 1: Runs before an agent is called. You can add any custom logic here.
    async def before_agent_callback(
        self, *, agent: BaseAgent, callback_context: CallbackContext
    ) -> None:
        """Count agent runs."""
        self.agent_count += 1
        logging.info(f"[Plugin] Agent run count: {self.agent_count}")

    # Callback 2: Runs before a model is called. You can add any custom logic here.
    async def before_model_callback(
        self, *, callback_context: CallbackContext, llm_request: LlmRequest
    ) -> None:
        """Count LLM requests."""
        self.llm_request_count += 1
        logging.info(f"[Plugin] LLM request count: {self.llm_request_count}")


----- EXAMPLE PLUGIN - DOES NOTHING ----- 


Key insight: You register a plugin once on your runner, and it automatically applies to every agent, tool call, and LLM request in your system as per your definition. Read more about Plugin hooks here.

You can follow along with the numbers in the diagram below to understand the flow.

3.3: ADK's built-in LoggingPlugin
But you don't have to define all the callbacks and plugins to capture standard Observability data in ADK. Instead, ADK provides a built-in LoggingPlugin that automatically captures all agent activity:

üöÄ User messages and agent responses
‚è±Ô∏è Timing data for performance analysis
üß† LLM requests and responses for debugging
üîß Tool calls and results
‚úÖ Complete execution traces
Agent definition
Let's use the same agent from the previous demo - the Research paper finder!

In [11]:
from google.adk.agents import LlmAgent
from google.adk.models.google_llm import Gemini
from google.adk.tools.agent_tool import AgentTool
from google.adk.tools.google_search_tool import google_search

from google.genai import types
from typing import List

retry_config = types.HttpRetryOptions(
    attempts=5,  # Maximum retry attempts
    exp_base=7,  # Delay multiplier
    initial_delay=1,
    http_status_codes=[429, 500, 503, 504],  # Retry on these HTTP errors
)


def count_papers(papers: List[str]):
    """
    This function counts the number of papers in a list of strings.
    Args:
      papers: A list of strings, where each string is a research paper.
    Returns:
      The number of papers in the list.
    """
    return len(papers)


# Google search agent
google_search_agent = LlmAgent(
    name="google_search_agent",
    model=Gemini(model="gemini-2.5-flash-lite", retry_options=retry_config),
    description="Searches for information using Google search",
    instruction="Use the google_search tool to find information on the given topic. Return the raw search results.",
    tools=[google_search],
)

# Root agent
research_agent_with_plugin = LlmAgent(
    name="research_paper_finder_agent",
    model=Gemini(model="gemini-2.5-flash-lite", retry_options=retry_config),
    instruction="""Your task is to find research papers and count them. 
   
   You must follow these steps:
   1) Find research papers on the user provided topic using the 'google_search_agent'. 
   2) Then, pass the papers to 'count_papers' tool to count the number of papers returned.
   3) Return both the list of research papers and the total number of papers.
   """,
    tools=[AgentTool(agent=google_search_agent), count_papers],
)


3.4: Add LoggingPlugin to Runner
The following code creates the InMemoryRunner. This is used to programmatically invoke the agent.

To use LoggingPlugin in the above research agent, 1) Import the plugin 2) Add it when initializing the InMemoryRunner.

In [12]:
from google.adk.runners import InMemoryRunner
from google.adk.plugins.logging_plugin import (
    LoggingPlugin,
)  # <---- 1. Import the Plugin
from google.genai import types
import asyncio

runner = InMemoryRunner(
    agent=research_agent_with_plugin,
    plugins=[
        LoggingPlugin()
    ],  # <---- 2. Add the plugin. Handles standard Observability logging across ALL agents
)

In [13]:
# Let's now run the agent using run_debug function.

In [14]:
print("üöÄ Running agent with LoggingPlugin...")
print("üìä Watch the comprehensive logging output below:\n")

response = await runner.run_debug("Find recent papers on quantum computing")

üöÄ Running agent with LoggingPlugin...
üìä Watch the comprehensive logging output below:


 ### Created new session: debug_session_id

User > Find recent papers on quantum computing
[90m[logging_plugin] üöÄ USER MESSAGE RECEIVED[0m
[90m[logging_plugin]    Invocation ID: e-c25f0b02-089b-4821-9c48-ca5581644911[0m
[90m[logging_plugin]    Session ID: debug_session_id[0m
[90m[logging_plugin]    User ID: debug_user_id[0m
[90m[logging_plugin]    App Name: InMemoryRunner[0m
[90m[logging_plugin]    Root Agent: research_paper_finder_agent[0m
[90m[logging_plugin]    User Content: text: 'Find recent papers on quantum computing'[0m
[90m[logging_plugin] üèÉ INVOCATION STARTING[0m
[90m[logging_plugin]    Invocation ID: e-c25f0b02-089b-4821-9c48-ca5581644911[0m
[90m[logging_plugin]    Starting Agent: research_paper_finder_agent[0m
[90m[logging_plugin] ü§ñ AGENT STARTING[0m
[90m[logging_plugin]    Agent Name: research_paper_finder_agent[0m
[90m[logging_plugin]    Invocati

üìä Summary¬∂
‚ùì When to use which type of Logging?

Development debugging? ‚Üí Use adk web --log_level DEBUG
Common production observability? ‚Üí Use LoggingPlugin()
Custom requirements? ‚Üí Build Custom Callbacks and Plugins


4B
EVALUATE YOUR AGENTS

Agent Evaluation¬∂
Welcome to Day 4 of the Kaggle 5-day Agents course!

In the previous notebook, we explored how to implement Observability in AI agents. This approach is primarily reactive; it comes into play after an issue has surfaced, providing the necessary data to debug and understand the root cause.

In this notebook, we'll complement those observability practices with a proactive approach using Agent Evaluation. By continuously evaluating our agent's performance, we can catch any quality degradations much earlier!

                            Observability + Agent Evaluation
                            (reactive)      (proactive)
What is Agent Evaluation?
It is the systematic process of testing and measuring how well an AI agent performs across different scenarios and quality dimensions.



ü§ñ The story
You've built a home automation agent. It works perfectly in your tests, so you launch it confidently...

Week 1: üö® "Agent turned on the fireplace when I asked for lights!"
Week 2: üö® "Agent won't respond to commands in the guest room!"
Week 3: üö® "Agent gives rude responses when devices are unavailable!"
The Problem: Standard testing ‚â† Evaluation

Agents are different from traditional software:

They are non-deterministic
Users give unpredictable, ambiguous commands
Small prompt changes cause dramatic behavior shifts and different tool calls
To accommodate all these differences, agents need systematic evaluation, not just "happy path" testing. Which means assessing the agent's entire decision-making process - including the final response and the path it took to get the response (trajectory)!



Section 2: Create a Home Automation Agent¬∂
Let's create the agent that will be the center of our evaluation story. This home automation agent seems perfect in basic tests but has hidden flaws we'll discover through comprehensive evaluation. Run the adk create CLI command to set up the project scaffolding.

In [16]:
!adk create home_automation_agent --model gemini-2.5-flash-lite --api_key $GOOGLE_API_KEY


Agent created in d:\5Days_Google_AI_course\day_4\home_automation_agent:
- .env
- __init__.py
- agent.py



Run the below cell to create the home automation agent.

This agent uses a single set_device_status tool to control smart home devices. A device's status can only be ON or OFF. The agent's instruction is deliberately overconfident - it claims to control "ALL smart devices" and "any device the user mentions" - setting up the evaluation problems we'll discover.

In [17]:
%%writefile home_automation_agent/agent.py

from google.adk.agents import LlmAgent
from google.adk.models.google_llm import Gemini

from google.genai import types

# Configure Model Retry on errors
retry_config = types.HttpRetryOptions(
    attempts=5,  # Maximum retry attempts
    exp_base=7,  # Delay multiplier
    initial_delay=1,
    http_status_codes=[429, 500, 503, 504],  # Retry on these HTTP errors
)

def set_device_status(location: str, device_id: str, status: str) -> dict:
    """Sets the status of a smart home device.

    Args:
        location: The room where the device is located.
        device_id: The unique identifier for the device.
        status: The desired status, either 'ON' or 'OFF'.

    Returns:
        A dictionary confirming the action.
    """
    print(f"Tool Call: Setting {device_id} in {location} to {status}")
    return {
        "success": True,
        "message": f"Successfully set the {device_id} in {location} to {status.lower()}."
    }

# This agent has DELIBERATE FLAWS that we'll discover through evaluation!
root_agent = LlmAgent(
    model=Gemini(model="gemini-2.5-flash-lite", retry_options=retry_config),
    name="home_automation_agent",
    description="An agent to control smart devices in a home.",
    instruction="""You are a home automation assistant. You control ALL smart devices in the house.
    
    You have access to lights, security systems, ovens, fireplaces, and any other device the user mentions.
    Always try to be helpful and control whatever device the user asks for.
    
    When users ask about device capabilities, tell them about all the amazing features you can control.""",
    tools=[set_device_status],
)

Overwriting home_automation_agent/agent.py


Section 3: Interactive Evaluation with ADK Web UI¬∂
3.1: Launch ADK Web UI
Get the proxied URL to access the ADK web UI in the Kaggle Notebooks environment:

In [18]:
!adk web --url_prefix {url_prefix}

  credential_service = InMemoryCredentialService()
  super().__init__()
INFO:     Started server process [12808]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
ERROR:    [Errno 10048] error while attempting to bind on address ('127.0.0.1', 8000): only one usage of each socket address (protocol/network address/port) is normally permitted
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.



+-----------------------------------------------------------------------------+
| ADK Web Server started                                                      |
|                                                                             |
| For local testing, access at http://127.0.0.1:8000.                         |
+-----------------------------------------------------------------------------+


+-----------------------------------------------------------------------------+
| ADK Web Server shutting down...                                             |
+-----------------------------------------------------------------------------+



Once the ADK web UI starts, open the proxy link using the button in the previous cell.

‚ÄºÔ∏è IMPORTANT: DO NOT SHARE THE PROXY LINK with anyone - treat it as sensitive data as it contains your authentication token in the URL.

3.2: Create Your First "Perfect" Test Case
üëâ Do: In the ADK web UI:

Click the public URL above to open the ADK web UI
Select "home_automation_agent" from the dropdown
Have a normal conversation: Type Turn on the desk lamp in the office
Agent responds correctly - controls device and confirms action
üëâ Do: Save this as your first evaluation case:

Navigate to the Eval tab on the right-hand panel
Click Create Evaluation set and name it home_automation_tests
In the home_automation_tests set, click the ">" arrow and click Add current session
Give it the case name basic_device_control

Run the Evaluation¬∂
üëâ Do: Run your first evaluation

Now, let's run the test case to see if the agent can replicate its previous success.

In the Eval tab, make sure your new test case is checked.
Click the Run Evaluation button.
The EVALUATION METRIC dialog will appear. For now, leave the default values and click Start.
The evaluation will run, and you should see a green Pass result in the Evaluation History. This confirms the agent's behavior matched the saved session.
‚ÄºÔ∏è Understanding the Evaluation Metrics

When you run evaluation, you'll see two key scores:

Response Match Score: Measures how similar the agent's actual response is to the expected response. Uses text similarity algorithms to compare content. A score of 1.0 = perfect match, 0.0 = completely different.

Tool Trajectory Score: Measures whether the agent used the correct tools with correct parameters. Checks the sequence of tool calls against expected behavior. A score of 1.0 = perfect tool usage, 0.0 = wrong tools or parameters.

üëâ Do: Analyze a Failure

Let's intentionally break the test to see what a failure looks like.

In the list of eval cases, click the Edit (pencil) icon next to your test case.
In the "Final Response" text box, change the expected text to something incorrect, like: The desk lamp is off.
Save the changes and re-run the evaluation.
This time, the result will be a red Fail. Hover your mouse over the "Fail" label. A tooltip will appear showing a side-by-side comparison of the Actual vs. Expected Output, highlighting exactly why the test failed (the final response didn't match). This immediate, detailed feedback is invaluable for debugging.

Create these scenarios in separate conversations:

Ambiguous Commands: "Turn on the lights in the bedroom"

Save as a new test case: ambiguous_device_reference
Run evaluation - it likely passes but the agent might be confused
Invalid Locations: "Please turn off the TV in the garage"

Save as a new test case: invalid_location_test
Run evaluation - the agent might try to control non-existent devices
Complex Commands: "Turn off all lights and turn on security system"

Save as a new test case: complex_multi_device_command
Run evaluation - the agent might attempt operations beyond its capabilities
The Problem You'll Discover: Even when tests "pass," you can see the agent:

Makes assumptions about devices that don't exist
Gives responses that sound helpful but aren't accurate
Tries to control devices it shouldn't have access to
ü§î What am I missing?
‚ùå Web UI Limitation: So far, we've seen how to create and evaluate test cases in the ADK web UI. The web UI is great for interactive test creation, but testing one conversation at a time doesn't scale.

‚ùì The Question: How do I proactively detect regressions in my agent's performance?

Let's answer that question in the next section!

‚ÄºÔ∏è Stop the ADK web UI üõë
In order to run cells in the remainder of this notebook, please stop the running cell where you started adk web in Section 3.1.

Otherwise that running cell will block / prevent other cells from running as long as the ADK web UI is running.


üìà Section 4: Systematic Evaluation
Regression testing is the practice of re-running existing tests to ensure that new changes haven't broken previously working functionality.

ADK provides two methods to do automatic regression and batch testing: using pytest and the adk eval CLI command. In this section, we'll use the CLI command. For more information on the pytest approach, refer to the links in the resource section at the end of this notebook.

The following image shows the overall process of evaluation. At a high-level, there are four steps to evaluate:

1) Create an evaluation configuration - define metrics or what you want to measure 2) Create test cases - sample test cases to compare against 3) Run the agent with test query 4) Compare the results

4.1: Create evaluation configuration
This optional file lets us define the pass/fail thresholds. Create test_config.json in the root directory.

In [19]:
import json

# Create evaluation configuration with basic criteria
eval_config = {
    "criteria": {
        "tool_trajectory_avg_score": 1.0,  # Perfect tool usage required
        "response_match_score": 0.8,  # 80% text similarity threshold
    }
}

with open("home_automation_agent/test_config.json", "w") as f:
    json.dump(eval_config, f, indent=2)

print("‚úÖ Evaluation configuration created!")
print("\nüìä Evaluation Criteria:")
print("‚Ä¢ tool_trajectory_avg_score: 1.0 - Requires exact tool usage match")
print("‚Ä¢ response_match_score: 0.8 - Requires 80% text similarity")
print("\nüéØ What this evaluation will catch:")
print("‚úÖ Incorrect tool usage (wrong device, location, or status)")
print("‚úÖ Poor response quality and communication")
print("‚úÖ Deviations from expected behavior patterns")

‚úÖ Evaluation configuration created!

üìä Evaluation Criteria:
‚Ä¢ tool_trajectory_avg_score: 1.0 - Requires exact tool usage match
‚Ä¢ response_match_score: 0.8 - Requires 80% text similarity

üéØ What this evaluation will catch:
‚úÖ Incorrect tool usage (wrong device, location, or status)
‚úÖ Poor response quality and communication
‚úÖ Deviations from expected behavior patterns


4.2: Create test cases¬∂
This file (integration.evalset.json) will contain multiple test cases (sessions).

This evaluation set can be created synthetically or from the conversation sessions in the ADK web UI.

Tip: To persist the conversations from the ADK web UI, simply create an evalset in the UI and add the current session to it. All the conversations in that session will be auto-converted to an evalset and downloaded locally.

In [20]:
# Create evaluation test cases that reveal tool usage and response quality problems
test_cases = {
    "eval_set_id": "home_automation_integration_suite",
    "eval_cases": [
        {
            "eval_id": "living_room_light_on",
            "conversation": [
                {
                    "user_content": {
                        "parts": [
                            {"text": "Please turn on the floor lamp in the living room"}
                        ]
                    },
                    "final_response": {
                        "parts": [
                            {
                                "text": "Successfully set the floor lamp in the living room to on."
                            }
                        ]
                    },
                    "intermediate_data": {
                        "tool_uses": [
                            {
                                "name": "set_device_status",
                                "args": {
                                    "location": "living room",
                                    "device_id": "floor lamp",
                                    "status": "ON",
                                },
                            }
                        ]
                    },
                }
            ],
        },
        {
            "eval_id": "kitchen_on_off_sequence",
            "conversation": [
                {
                    "user_content": {
                        "parts": [{"text": "Switch on the main light in the kitchen."}]
                    },
                    "final_response": {
                        "parts": [
                            {
                                "text": "Successfully set the main light in the kitchen to on."
                            }
                        ]
                    },
                    "intermediate_data": {
                        "tool_uses": [
                            {
                                "name": "set_device_status",
                                "args": {
                                    "location": "kitchen",
                                    "device_id": "main light",
                                    "status": "ON",
                                },
                            }
                        ]
                    },
                }
            ],
        },
    ],
}

In [21]:
# Let's write the test cases to the integration.evalset.json in our agent's root directory.
import json

with open("home_automation_agent/integration.evalset.json", "w") as f:
    json.dump(test_cases, f, indent=2)

print("‚úÖ Evaluation test cases created")
print("\nüß™ Test scenarios:")
for case in test_cases["eval_cases"]:
    user_msg = case["conversation"][0]["user_content"]["parts"][0]["text"]
    print(f"‚Ä¢ {case['eval_id']}: {user_msg}")

print("\nüìä Expected results:")
print("‚Ä¢ basic_device_control: Should pass both criteria")
print(
    "‚Ä¢ wrong_tool_usage_test: May fail tool_trajectory if agent uses wrong parameters"
)
print(
    "‚Ä¢ poor_response_quality_test: May fail response_match if response differs too much"
)

‚úÖ Evaluation test cases created

üß™ Test scenarios:
‚Ä¢ living_room_light_on: Please turn on the floor lamp in the living room
‚Ä¢ kitchen_on_off_sequence: Switch on the main light in the kitchen.

üìä Expected results:
‚Ä¢ basic_device_control: Should pass both criteria
‚Ä¢ wrong_tool_usage_test: May fail tool_trajectory if agent uses wrong parameters
‚Ä¢ poor_response_quality_test: May fail response_match if response differs too much


.3: Run CLI Evaluation¬∂
Execute the adk eval command, pointing it to your agent directory, the evalset, and the config file.

In [22]:
print("Run this command to execute evaluation:")
!adk eval home_automation_agent home_automation_agent/integration.evalset.json --config_file_path=home_automation_agent/test_config.json --print_detailed_results


Run this command to execute evaluation:


Error: Eval module is not installed, please install via `pip install "google-adk[eval]"`.


4.4: Analyzing sample evaluation results
The command will run all test cases and print a summary. The --print_detailed_results flag provides a turn-by-turn breakdown of each test, showing scores and a diff for any failures.

In [23]:
# Analyzing evaluation results - the data science approach
print("üìä Understanding Evaluation Results:")
print()
print("üîç EXAMPLE ANALYSIS:")
print()
print("Test Case: living_room_light_on")
print("  ‚ùå response_match_score: 0.45/0.80")
print("  ‚úÖ tool_trajectory_avg_score: 1.0/1.0")
print()
print("üìà What this tells us:")
print("‚Ä¢ TOOL USAGE: Perfect - Agent used correct tool with correct parameters")
print("‚Ä¢ RESPONSE QUALITY: Poor - Response text too different from expected")
print("‚Ä¢ ROOT CAUSE: Agent's communication style, not functionality")
print()
print("üéØ ACTIONABLE INSIGHTS:")
print("1. Technical capability works (tool usage perfect)")
print("2. Communication needs improvement (response quality failed)")
print("3. Fix: Update agent instructions for clearer language or constrained response.")
print()

üìä Understanding Evaluation Results:

üîç EXAMPLE ANALYSIS:

Test Case: living_room_light_on
  ‚ùå response_match_score: 0.45/0.80
  ‚úÖ tool_trajectory_avg_score: 1.0/1.0

üìà What this tells us:
‚Ä¢ TOOL USAGE: Perfect - Agent used correct tool with correct parameters
‚Ä¢ RESPONSE QUALITY: Poor - Response text too different from expected
‚Ä¢ ROOT CAUSE: Agent's communication style, not functionality

üéØ ACTIONABLE INSIGHTS:
1. Technical capability works (tool usage perfect)
2. Communication needs improvement (response quality failed)
3. Fix: Update agent instructions for clearer language or constrained response.

