# Testing Your Agent

In this recipe, you will build a lightweight evaluation harness for a function-calling agent. You will learn how to:

1. Define test cases as structured data.
2. Evaluate tool-call trajectories (did the agent call the right tools?).
3. Evaluate final responses (did the agent say the right thing?).
4. Track non-functional metrics like latency and token usage.

> **Estimated time:** 15 minutes  
> **Prerequisites:** Basic Python; familiarity with function-calling agents.

## Why test agents?

Agents are stateful and probabilistic—small changes to prompts, tools, or models can silently break behavior. A structured test set lets you:

- **Catch regressions early** when you update prompts, tool schemas, or models.
- **Compare alternatives objectively** (e.g., model A vs. model B).
- **Ship with confidence** because you know core use cases still pass.

# Steps

## Step 1: Define mock tools

We use deterministic mock tools so that tests are reproducible. In a real setup, these would call external APIs.

In [1]:
from dataclasses import dataclass, field
from typing import Any, Dict, List
import time


def get_weather(city: str, country: str) -> Dict[str, Any]:
    """Get current weather for a city. Returns deterministic mock data."""
    return {"city": city, "country": country, "temp_f": 72, "condition": "Sunny"}


def find_hotel(city: str, country: str, start_date: str, end_date: str, max_budget_per_night: int) -> Dict[str, Any]:
    """Find a hotel in a city. Returns deterministic mock data."""
    return {"city": city, "country": country, "hotel": f"{city}Hotel", "price": max_budget_per_night}


def retrieve_booking(booking_name: str) -> Dict[str, Any]:
    """Retrieve a booking by name. Returns deterministic mock data."""
    return {"booking_name": booking_name, "receipt_path": f"~/Desktop/{booking_name}.txt"}


def send_email(to: str, subject: str, body: str, attachments: List[str]) -> Dict[str, Any]:
    """Send an email. Returns deterministic mock data."""
    return {"to": to, "subject": subject, "status": "sent", "attachments": attachments}

## Step 2: Define evaluation helpers

Before building the agent, we define how to evaluate its outputs. This follows TDD principles: write your assertions first.

We need two types of evaluation:
- **Trajectory evaluation**: Did the agent call the right tools with the right parameters?
- **Response evaluation**: Did the agent's final response contain the expected content?

In [2]:
@dataclass
class ToolCall:
    """Represents a single tool call made by the agent."""
    tool_name: str
    tool_parameters: Dict[str, Any]


@dataclass
class AgentResult:
    """The output of an agent run, including tool calls and metrics."""
    tool_calls: List[ToolCall] = field(default_factory=list)
    final_response: str = ""
    latency_ms: float = 0.0
    prompt_tokens: int = 0
    response_tokens: int = 0
    total_tokens: int = 0


def trajectory_match(actual: List[ToolCall], expected: List[Dict[str, Any]]) -> bool:
    """Check if actual tool calls exactly match expected.
    
    Use exact match when mistakes are risky (e.g., tools that write or delete data).
    """
    actual_norm = [{"tool_name": c.tool_name, "tool_parameters": c.tool_parameters} for c in actual]
    return actual_norm == expected


def response_match(actual: str, expected_contains: str) -> bool:
    """Check if actual response contains the expected substring.
    
    For deterministic tests, a simple substring check is fast and easy to debug.
    For flexible responses, consider LLM-as-a-judge or semantic similarity.
    """
    return expected_contains.lower() in actual.lower()

## Step 3: Build a minimal function-calling agent

This agent uses simple rule-based tool selection for demonstration. In production, you would use an LLM to decide which tools to call.

We also include a stateful agent that maintains memory across turns for multi-turn testing.

In [3]:
def estimate_tokens(text: str) -> int:
    """Simple heuristic for token count (for demo purposes)."""
    return max(1, len(text) // 4)


class SimpleFCAgent:
    """A minimal function-calling agent using rule-based tool selection."""
    
    def run(self, user_input: str) -> AgentResult:
        start = time.time()
        tool_calls: List[ToolCall] = []
        response_parts: List[str] = []

        # Rule-based tool selection for demo
        if "weather" in user_input.lower():
            tool_calls.append(ToolCall("get_weather", {"city": "Boston", "country": "USA"}))
            weather = get_weather(city="Boston", country="USA")
            response_parts.append(f"It is {weather['temp_f']}°F and {weather['condition']} in {weather['city']}.")

        if "hotel" in user_input.lower():
            tool_calls.append(ToolCall("find_hotel", {"city": "Boston", "country": "USA", "start_date": "12/12/2025", "end_date": "12/25/2025", "max_budget_per_night": 200}))
            hotel = find_hotel(city="Boston", country="USA", start_date="12/12/2025", end_date="12/25/2025", max_budget_per_night=200)
            response_parts.append(f"I found {hotel['hotel']} in {hotel['city']} for ${hotel['price']}/night.")

        if not response_parts:
            response_parts.append("I can help with weather and hotel searches.")

        final_response = " ".join(response_parts)
        latency_ms = (time.time() - start) * 1000
        prompt_tokens = estimate_tokens(user_input)
        response_tokens = estimate_tokens(final_response)
        
        return AgentResult(
            tool_calls=tool_calls,
            final_response=final_response,
            latency_ms=latency_ms,
            prompt_tokens=prompt_tokens,
            response_tokens=response_tokens,
            total_tokens=prompt_tokens + response_tokens,
        )


class StatefulAgent:
    """An agent that maintains memory across conversation turns."""
    
    def __init__(self):
        self.memory: Dict[str, Any] = {}

    def run(self, user_input: str) -> AgentResult:
        start = time.time()
        tool_calls: List[ToolCall] = []
        response_parts: List[str] = []
        lower = user_input.lower()

        if "retrieve" in lower and "booking" in lower:
            tool_calls.append(ToolCall("retrieve_booking", {"booking_name": "nyc_trip_oct"}))
            booking = retrieve_booking("nyc_trip_oct")
            self.memory["last_receipt_path"] = booking["receipt_path"]
            response_parts.append(f"Receipt saved to {booking['receipt_path']}.")

        if "email" in lower:
            attachment = self.memory.get("last_receipt_path", "~/Desktop/unknown.txt")
            tool_calls.append(ToolCall("send_email", {
                "to": "manager@example.com",
                "subject": "Travel approval",
                "body": "Please approve this trip.",
                "attachments": [attachment]
            }))
            send_email(to="manager@example.com", subject="Travel approval", body="Please approve this trip.", attachments=[attachment])
            response_parts.append("I emailed the receipt for approval.")

        if not response_parts:
            response_parts.append("I can help with weather, hotel searches, and booking emails.")

        final_response = " ".join(response_parts)
        latency_ms = (time.time() - start) * 1000
        prompt_tokens = estimate_tokens(user_input)
        response_tokens = estimate_tokens(final_response)
        
        return AgentResult(
            tool_calls=tool_calls,
            final_response=final_response,
            latency_ms=latency_ms,
            prompt_tokens=prompt_tokens,
            response_tokens=response_tokens,
            total_tokens=prompt_tokens + response_tokens,
        )

## Step 4: Define the test set

Each test case includes:
- **Input**: The user query.
- **Expected tool calls**: The tools and parameters the agent should use.
- **Expected response**: A substring the final response should contain.

We cover four key scenarios:
1. No tool needed (general query)
2. Single tool (weather only)
3. Single tool (hotel only)
4. Multiple tools in one turn (weather + hotel)

In [4]:
TESTS = [
    {
        "name": "no_tool",
        "input": "Hello, what can you do?",
        "expected_tool_calls": [],
        "expected_response_contains": "help"
    },
    {
        "name": "weather_only",
        "input": "What is the weather in Boston?",
        "expected_tool_calls": [
            {"tool_name": "get_weather", "tool_parameters": {"city": "Boston", "country": "USA"}}
        ],
        "expected_response_contains": "Sunny"
    },
    {
        "name": "hotel_only",
        "input": "Find a hotel in Boston.",
        "expected_tool_calls": [
            {"tool_name": "find_hotel", "tool_parameters": {"city": "Boston", "country": "USA", "start_date": "12/12/2025", "end_date": "12/25/2025", "max_budget_per_night": 200}}
        ],
        "expected_response_contains": "BostonHotel"
    },
    {
        "name": "weather_and_hotel",
        "input": "What is the weather in Boston and find me a hotel?",
        "expected_tool_calls": [
            {"tool_name": "get_weather", "tool_parameters": {"city": "Boston", "country": "USA"}},
            {"tool_name": "find_hotel", "tool_parameters": {"city": "Boston", "country": "USA", "start_date": "12/12/2025", "end_date": "12/25/2025", "max_budget_per_night": 200}}
        ],
        "expected_response_contains": "Sunny"
    },
]

## Step 5: Run single-turn tests

We run each test case through the agent and evaluate both the trajectory (tool calls) and the response.

In [5]:
agent = SimpleFCAgent()

results = []
for test in TESTS:
    output = agent.run(test["input"])
    traj_ok = trajectory_match(output.tool_calls, test["expected_tool_calls"])
    resp_ok = response_match(output.final_response, test["expected_response_contains"])
    results.append({
        "name": test["name"],
        "trajectory_ok": traj_ok,
        "response_ok": resp_ok,
        "latency_ms": round(output.latency_ms, 2),
        "prompt_tokens": output.prompt_tokens,
        "response_tokens": output.response_tokens,
        "total_tokens": output.total_tokens,
        "final_response": output.final_response
    })

results

[{'name': 'no_tool',
  'trajectory_ok': True,
  'response_ok': True,
  'latency_ms': 0.0,
  'prompt_tokens': 5,
  'response_tokens': 10,
  'total_tokens': 15,
  'final_response': 'I can help with weather and hotel searches.'},
 {'name': 'weather_only',
  'trajectory_ok': True,
  'response_ok': True,
  'latency_ms': 0.0,
  'prompt_tokens': 7,
  'response_tokens': 7,
  'total_tokens': 14,
  'final_response': 'It is 72°F and Sunny in Boston.'},
 {'name': 'hotel_only',
  'trajectory_ok': True,
  'response_ok': True,
  'latency_ms': 0.0,
  'prompt_tokens': 5,
  'response_tokens': 11,
  'total_tokens': 16,
  'final_response': 'I found BostonHotel in Boston for $200/night.'},
 {'name': 'weather_and_hotel',
  'trajectory_ok': True,
  'response_ok': True,
  'latency_ms': 0.0,
  'prompt_tokens': 12,
  'response_tokens': 19,
  'total_tokens': 31,
  'final_response': 'It is 72°F and Sunny in Boston. I found BostonHotel in Boston for $200/night.'}]

## Step 6: Define multi-turn tests

Multi-turn tests verify that the agent maintains context across conversation turns. Each turn is evaluated independently.

In [6]:
MULTI_TURN_TESTS = [
    {
        "name": "multi_turn_basic",
        "turns": [
            {
                "input": "Retrieve my booking for my New York trip; it should be called nyc_trip_oct.",
                "expected_tool_calls": [
                    {"tool_name": "retrieve_booking", "tool_parameters": {"booking_name": "nyc_trip_oct"}}
                ],
                "expected_response_contains": "Receipt saved"
            },
            {
                "input": "Email this receipt to my manager for approval.",
                "expected_tool_calls": [
                    {"tool_name": "send_email", "tool_parameters": {"to": "manager@example.com", "subject": "Travel approval", "body": "Please approve this trip.", "attachments": ["~/Desktop/nyc_trip_oct.txt"]}}
                ],
                "expected_response_contains": "emailed"
            }
        ]
    },
    {
        "name": "long_context",
        "turns": [
            {
                "input": "Retrieve my booking for my New York trip; it should be called nyc_trip_oct.",
                "expected_tool_calls": [
                    {"tool_name": "retrieve_booking", "tool_parameters": {"booking_name": "nyc_trip_oct"}}
                ],
                "expected_response_contains": "Receipt saved"
            },
            {
                "input": "Thanks. Also, what can you do?",
                "expected_tool_calls": [],
                "expected_response_contains": "help"
            },
            {
                "input": "Email the receipt we discussed to my manager.",
                "expected_tool_calls": [
                    {"tool_name": "send_email", "tool_parameters": {"to": "manager@example.com", "subject": "Travel approval", "body": "Please approve this trip.", "attachments": ["~/Desktop/nyc_trip_oct.txt"]}}
                ],
                "expected_response_contains": "emailed"
            }
        ]
    }
]

## Step 7: Run multi-turn tests

For multi-turn tests, we use the `StatefulAgent` which maintains memory across turns.

In [7]:
def run_multi_turn_tests(tests):
    """Run multi-turn tests with a fresh stateful agent per test."""
    all_results = []
    for test in tests:
        agent = StatefulAgent()  # Fresh agent for each test
        turn_results = []
        for turn in test["turns"]:
            output = agent.run(turn["input"])
            traj_ok = trajectory_match(output.tool_calls, turn["expected_tool_calls"])
            resp_ok = response_match(output.final_response, turn["expected_response_contains"])
            turn_results.append({
                "input": turn["input"],
                "trajectory_ok": traj_ok,
                "response_ok": resp_ok,
                "final_response": output.final_response
            })
        all_results.append({"name": test["name"], "turns": turn_results})
    return all_results


multi_turn_results = run_multi_turn_tests(MULTI_TURN_TESTS)
multi_turn_results

[{'name': 'multi_turn_basic',
  'turns': [{'input': 'Retrieve my booking for my New York trip; it should be called nyc_trip_oct.',
    'trajectory_ok': True,
    'response_ok': True,
    'final_response': 'Receipt saved to ~/Desktop/nyc_trip_oct.txt.'},
   {'input': 'Email this receipt to my manager for approval.',
    'trajectory_ok': True,
    'response_ok': True,
    'final_response': 'I emailed the receipt for approval.'}]},
 {'name': 'long_context',
  'turns': [{'input': 'Retrieve my booking for my New York trip; it should be called nyc_trip_oct.',
    'trajectory_ok': True,
    'response_ok': True,
    'final_response': 'Receipt saved to ~/Desktop/nyc_trip_oct.txt.'},
   {'input': 'Thanks. Also, what can you do?',
    'trajectory_ok': True,
    'response_ok': True,
    'final_response': 'I can help with weather, hotel searches, and booking emails.'},
   {'input': 'Email the receipt we discussed to my manager.',
    'trajectory_ok': True,
    'response_ok': True,
    'final_respon

## Step 8: Compute summary metrics

Aggregate results to get an overall view of agent performance.

In [8]:
# Single-turn summary
passed = sum(1 for r in results if r["trajectory_ok"] and r["response_ok"])
total = len(results)
avg_latency = round(sum(r["latency_ms"] for r in results) / total, 2)
avg_total_tokens = round(sum(r["total_tokens"] for r in results) / total, 2)

single_turn_summary = {
    "passed": passed,
    "total": total,
    "pass_rate": f"{(passed/total)*100:.1f}%",
    "avg_latency_ms": avg_latency,
    "avg_total_tokens": avg_total_tokens
}

print("Single-turn test summary:")
single_turn_summary

Single-turn test summary:


{'passed': 4,
 'total': 4,
 'pass_rate': '100.0%',
 'avg_latency_ms': 0.0,
 'avg_total_tokens': 19.0}

In [9]:
# Multi-turn summary
multi_turn_passed = 0
multi_turn_total = 0

for test in multi_turn_results:
    for turn in test["turns"]:
        multi_turn_total += 1
        if turn["trajectory_ok"] and turn["response_ok"]:
            multi_turn_passed += 1

multi_turn_summary = {
    "passed": multi_turn_passed,
    "total": multi_turn_total,
    "pass_rate": f"{(multi_turn_passed/multi_turn_total)*100:.1f}%" if multi_turn_total > 0 else "N/A"
}

print("Multi-turn test summary:")
multi_turn_summary

Multi-turn test summary:


{'passed': 5, 'total': 5, 'pass_rate': '100.0%'}

## Summary

In this recipe, you learned how to:

1. **Define mock tools** with deterministic outputs for reproducible testing.
2. **Write evaluation helpers** for trajectory and response matching.
3. **Structure test cases** as data (single-turn and multi-turn).
4. **Run tests and collect metrics** including pass rates, latency, and token usage.

### Next steps

- **Expand your test set** with more edge cases and failure scenarios.
- **Swap in a real LLM** (e.g., Granite on watsonx.ai) for the agent's tool selection.
- **Add LLM-as-a-judge** for response evaluation when substring matching isn't enough.
- **Integrate with CI/CD** to run tests automatically on every code change.

# Test-Driven Agent Development — Minimal End-to-End Demo
This notebook implements a very basic function-calling agent, a tiny toolset, and a test runner that checks tool-call trajectories and final responses.

## 1) Define tools (mock implementations)
We use deterministic tool outputs so tests are reproducible.

In [10]:
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional
import time

def get_weather(city: str, country: str) -> Dict[str, Any]:
    # Deterministic mock
    return {"city": city, "country": country, "temp_f": 72, "condition": "Sunny"}

def find_hotel(city: str, country: str, start_date: str, end_date: str, max_budget_per_night: int) -> Dict[str, Any]:
    # Deterministic mock
    return {"city": city, "country": country, "hotel": f"{city}Hotel", "price": max_budget_per_night}

def retrieve_booking(booking_name: str) -> Dict[str, Any]:
    # Deterministic mock
    return {"booking_name": booking_name, "receipt_path": f"~/Desktop/{booking_name}.txt"}

def send_email(to: str, subject: str, body: str, attachments: List[str]) -> Dict[str, Any]:
    # Deterministic mock
    return {"to": to, "subject": subject, "status": "sent", "attachments": attachments}

## 2) A minimal function-calling agent
This agent selects tools by simple rules and builds a response.

In [11]:
@dataclass
class ToolCall:
    tool_name: str
    tool_parameters: Dict[str, Any]

@dataclass
class AgentResult:
    tool_calls: List[ToolCall] = field(default_factory=list)
    final_response: str = ""
    latency_ms: float = 0.0
    prompt_tokens: int = 0
    response_tokens: int = 0
    total_tokens: int = 0

def estimate_tokens(text: str) -> int:
    # Simple heuristic for demo purposes
    return max(1, len(text) // 4)

class SimpleFCAgent:
    def run(self, user_input: str) -> AgentResult:
        start = time.time()
        tool_calls: List[ToolCall] = []
        response_parts: List[str] = []

        # Rule-based tool selection for demo
        if "weather" in user_input.lower():
            tool_calls.append(ToolCall("get_weather", {"city": "Boston", "country": "USA"}))
            weather = get_weather(city="Boston", country="USA")
            response_parts.append(f"It is {weather['temp_f']}°F and {weather['condition']} in {weather['city']}.")

        if "hotel" in user_input.lower():
            tool_calls.append(ToolCall("find_hotel", {"city": "Boston", "country": "USA", "start_date": "12/12/2025", "end_date": "12/25/2025", "max_budget_per_night": 200}))
            hotel = find_hotel(city="Boston", country="USA", start_date="12/12/2025", end_date="12/25/2025", max_budget_per_night=200)
            response_parts.append(f"I found {hotel['hotel']} in {hotel['city']} for ${hotel['price']}/night.")

        if not response_parts:
            response_parts.append("I can help with weather and hotel searches.")

        final_response = " ".join(response_parts)
        latency_ms = (time.time() - start) * 1000
        prompt_tokens = estimate_tokens(user_input)
        response_tokens = estimate_tokens(final_response)
        total_tokens = prompt_tokens + response_tokens
        return AgentResult(
            tool_calls=tool_calls,
            final_response=final_response,
            latency_ms=latency_ms,
            prompt_tokens=prompt_tokens,
            response_tokens=response_tokens,
            total_tokens=total_tokens,
        )

class StatefulAgent:
    def __init__(self):
        self.memory: Dict[str, Any] = {}

    def run(self, user_input: str) -> AgentResult:
        start = time.time()
        tool_calls: List[ToolCall] = []
        response_parts: List[str] = []
        lower = user_input.lower()

        if "retrieve" in lower and "booking" in lower:
            tool_calls.append(ToolCall("retrieve_booking", {"booking_name": "nyc_trip_oct"}))
            booking = retrieve_booking("nyc_trip_oct")
            self.memory["last_receipt_path"] = booking["receipt_path"]
            response_parts.append(f"Receipt saved to {booking['receipt_path']}.")

        if "email" in lower:
            attachment = self.memory.get("last_receipt_path", "~/Desktop/unknown.txt")
            tool_calls.append(ToolCall("send_email", {"to": "manager@example.com", "subject": "Travel approval", "body": "Please approve this trip.", "attachments": [attachment]}))
            send_email(to="manager@example.com", subject="Travel approval", body="Please approve this trip.", attachments=[attachment])
            response_parts.append("I emailed the receipt for approval.")

        if not response_parts:
            response_parts.append("I can help with weather, hotel searches, and booking emails.")

        final_response = " ".join(response_parts)
        latency_ms = (time.time() - start) * 1000
        prompt_tokens = estimate_tokens(user_input)
        response_tokens = estimate_tokens(final_response)
        total_tokens = prompt_tokens + response_tokens
        return AgentResult(
            tool_calls=tool_calls,
            final_response=final_response,
            latency_ms=latency_ms,
            prompt_tokens=prompt_tokens,
            response_tokens=response_tokens,
            total_tokens=total_tokens,
        )

## 3) Define a structured test set
Each test includes expected tool calls and a target response.

In [12]:
TESTS = [
    {
        "name": "no_tool",
        "input": "Hello, what can you do?",
        "expected_tool_calls": [],
        "expected_response_contains": "help"
    },
    {
        "name": "weather_only",
        "input": "What is the weather in Boston?",
        "expected_tool_calls": [
            {"tool_name": "get_weather", "tool_parameters": {"city": "Boston", "country": "USA"}}
        ],
        "expected_response_contains": "Sunny"
    },
    {
        "name": "hotel_only",
        "input": "Find a hotel in Boston.",
        "expected_tool_calls": [
            {"tool_name": "find_hotel", "tool_parameters": {"city": "Boston", "country": "USA", "start_date": "12/12/2025", "end_date": "12/25/2025", "max_budget_per_night": 200}}
        ],
        "expected_response_contains": "BostonHotel"
    },
    {
        "name": "weather_and_hotel",
        "input": "What is the weather in Boston and find me a hotel?",
        "expected_tool_calls": [
            {"tool_name": "get_weather", "tool_parameters": {"city": "Boston", "country": "USA"}},
            {"tool_name": "find_hotel", "tool_parameters": {"city": "Boston", "country": "USA", "start_date": "12/12/2025", "end_date": "12/25/2025", "max_budget_per_night": 200}}
        ],
        "expected_response_contains": "Sunny"
    },
]

## 3.1 Multi-turn and long-context tests
These tests verify memory across turns and a longer conversation.

In [13]:
MULTI_TURN_TESTS = [
    {
        "name": "multi_turn_basic",
        "turns": [
            {
                "input": "Retrieve my booking for my New York trip; it should be called nyc_trip_oct.",
                "expected_tool_calls": [
                    {"tool_name": "retrieve_booking", "tool_parameters": {"booking_name": "nyc_trip_oct"}}
                ],
                "expected_response_contains": "Receipt saved"
            },
            {
                "input": "Email this receipt to my manager for approval.",
                "expected_tool_calls": [
                    {"tool_name": "send_email", "tool_parameters": {"to": "manager@example.com", "subject": "Travel approval", "body": "Please approve this trip.", "attachments": ["~/Desktop/nyc_trip_oct.txt"]}}
                ],
                "expected_response_contains": "emailed"
            }
        ]
    },
    {
        "name": "long_context",
        "turns": [
            {
                "input": "Retrieve my booking for my New York trip; it should be called nyc_trip_oct.",
                "expected_tool_calls": [
                    {"tool_name": "retrieve_booking", "tool_parameters": {"booking_name": "nyc_trip_oct"}}
                ],
                "expected_response_contains": "Receipt saved"
            },
            {
                "input": "Thanks. Also, what can you do?",
                "expected_tool_calls": [],
                "expected_response_contains": "help"
            },
            {
                "input": "Email the receipt we discussed to my manager.",
                "expected_tool_calls": [
                    {"tool_name": "send_email", "tool_parameters": {"to": "manager@example.com", "subject": "Travel approval", "body": "Please approve this trip.", "attachments": ["~/Desktop/nyc_trip_oct.txt"]}}
                ],
                "expected_response_contains": "emailed"
            }
        ]
    }
 ]

def run_multi_turn_tests(tests):
    agent = StatefulAgent()
    results = []
    for test in tests:
        turn_results = []
        for turn in test["turns"]:
            output = agent.run(turn["input"])
            traj_ok = trajectory_match(output.tool_calls, turn["expected_tool_calls"])
            resp_ok = response_match(output.final_response, turn["expected_response_contains"])
            turn_results.append({
                "input": turn["input"],
                "trajectory_ok": traj_ok,
                "response_ok": resp_ok,
                "final_response": output.final_response
            })
        results.append({"name": test["name"], "turns": turn_results})
    return results

multi_turn_results = run_multi_turn_tests(MULTI_TURN_TESTS)
multi_turn_results

[{'name': 'multi_turn_basic',
  'turns': [{'input': 'Retrieve my booking for my New York trip; it should be called nyc_trip_oct.',
    'trajectory_ok': True,
    'response_ok': True,
    'final_response': 'Receipt saved to ~/Desktop/nyc_trip_oct.txt.'},
   {'input': 'Email this receipt to my manager for approval.',
    'trajectory_ok': True,
    'response_ok': True,
    'final_response': 'I emailed the receipt for approval.'}]},
 {'name': 'long_context',
  'turns': [{'input': 'Retrieve my booking for my New York trip; it should be called nyc_trip_oct.',
    'trajectory_ok': True,
    'response_ok': True,
    'final_response': 'Receipt saved to ~/Desktop/nyc_trip_oct.txt.'},
   {'input': 'Thanks. Also, what can you do?',
    'trajectory_ok': True,
    'response_ok': True,
    'final_response': 'I can help with weather, hotel searches, and booking emails.'},
   {'input': 'Email the receipt we discussed to my manager.',
    'trajectory_ok': True,
    'response_ok': True,
    'final_respon

## 4) Evaluation helpers
Trajectory evaluation = tool calls and parameters.
Response evaluation = simple substring check for demo.

In [14]:
def normalize_call(call: ToolCall) -> Dict[str, Any]:
    return {"tool_name": call.tool_name, "tool_parameters": call.tool_parameters}

def trajectory_match(actual: List[ToolCall], expected: List[Dict[str, Any]]) -> bool:
    actual_norm = [normalize_call(c) for c in actual]
    return actual_norm == expected

def response_match(actual: str, expected_contains: str) -> bool:
    return expected_contains.lower() in actual.lower()

## 5) Run tests end-to-end

In [15]:
agent = SimpleFCAgent()

results = []
for test in TESTS:
    output = agent.run(test["input"])
    traj_ok = trajectory_match(output.tool_calls, test["expected_tool_calls"])
    resp_ok = response_match(output.final_response, test["expected_response_contains"])
    results.append({
        "name": test["name"],
        "trajectory_ok": traj_ok,
        "response_ok": resp_ok,
        "latency_ms": round(output.latency_ms, 2),
        "prompt_tokens": output.prompt_tokens,
        "response_tokens": output.response_tokens,
        "total_tokens": output.total_tokens,
        "final_response": output.final_response
    })

results

[{'name': 'no_tool',
  'trajectory_ok': True,
  'response_ok': True,
  'latency_ms': 0.0,
  'prompt_tokens': 5,
  'response_tokens': 10,
  'total_tokens': 15,
  'final_response': 'I can help with weather and hotel searches.'},
 {'name': 'weather_only',
  'trajectory_ok': True,
  'response_ok': True,
  'latency_ms': 0.0,
  'prompt_tokens': 7,
  'response_tokens': 7,
  'total_tokens': 14,
  'final_response': 'It is 72°F and Sunny in Boston.'},
 {'name': 'hotel_only',
  'trajectory_ok': True,
  'response_ok': True,
  'latency_ms': 0.0,
  'prompt_tokens': 5,
  'response_tokens': 11,
  'total_tokens': 16,
  'final_response': 'I found BostonHotel in Boston for $200/night.'},
 {'name': 'weather_and_hotel',
  'trajectory_ok': True,
  'response_ok': True,
  'latency_ms': 0.0,
  'prompt_tokens': 12,
  'response_tokens': 19,
  'total_tokens': 31,
  'final_response': 'It is 72°F and Sunny in Boston. I found BostonHotel in Boston for $200/night.'}]

## 6) Summary metrics

In [16]:
passed = sum(1 for r in results if r["trajectory_ok"] and r["response_ok"])
total = len(results)
avg_latency = round(sum(r["latency_ms"] for r in results) / total, 2)
avg_total_tokens = round(sum(r["total_tokens"] for r in results) / total, 2)
{
    "passed": passed,
    "total": total,
    "pass_rate": f"{(passed/total)*100:.1f}%",
    "avg_latency_ms": avg_latency,
    "avg_total_tokens": avg_total_tokens
}

{'passed': 4,
 'total': 4,
 'pass_rate': '100.0%',
 'avg_latency_ms': 0.0,
 'avg_total_tokens': 19.0}