# Agent Evaluation Harness

## Background

Evaluating LLM-based agents is fundamentally harder than evaluating static question-answering. An agent interacts with **tools** over **multiple steps**, and its performance depends on both the underlying model and the **scaffold** (prompt templates, tool definitions, retry logic, max steps). Companies like Vals AI run benchmarks such as SWE-bench, Finance Agent, and Terminal-Bench to measure agent capabilities in a standardized way.

The key insight: **you cannot attribute a benchmark score solely to the model**. The scaffold matters enormously. This notebook builds a minimal but complete evaluation harness that lets you disentangle scaffold effects from model effects.

### What We Build

1. **Tool Registry & Execution** -- safe, validated tool dispatch
2. **ReAct Agent Loop** -- thought-action-observation cycle with trajectory recording
3. **Evaluation Dataset & Scoring** -- tasks with multiple answer types and partial credit
4. **Harness-Level Metrics** -- accuracy, failure classification, stratified analysis
5. **Scaffold vs Model Disentanglement** -- ablation studies showing scaffold sensitivity

Everything uses deterministic mock agents (no API keys, no external services).

### References

- [Yao et al. - ReAct: Synergizing Reasoning and Acting (2023)](https://arxiv.org/abs/2210.03629)
- [Jimenez et al. - SWE-bench (2024)](https://arxiv.org/abs/2310.06770)
- [Vals AI Benchmarks](https://www.vals.ai/)

---
## Part 1: Tool Registry & Execution

An agent evaluation harness needs a **tool registry** that maps tool names to callable functions, validates arguments, and handles errors gracefully. This is the foundation: if tool execution is unreliable, nothing downstream works.

Key design choices:
- Tools return strings (serialized JSON) for uniformity
- The calculator uses `ast.parse` to reject arbitrary code execution
- The registry raises `KeyError` for unknown tools rather than silently failing

In [None]:
from dataclasses import dataclass, field
from typing import Callable, Any, Optional
from collections import Counter
import ast
import re
import json
import copy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# --- Knowledge Base for Deterministic Tools ---

COMPANY_DB: dict[str, dict[str, Any]] = {
    "ACME": {"revenue": 1500000, "expenses": 1200000, "employees": 50, "sector": "tech"},
    "GLOBEX": {"revenue": 3200000, "expenses": 2800000, "employees": 120, "sector": "finance"},
    "INITECH": {"revenue": 800000, "expenses": 750000, "employees": 30, "sector": "consulting"},
    "UMBRELLA": {"revenue": 5000000, "expenses": 4200000, "employees": 200, "sector": "biotech"},
    "WAYNE": {"revenue": 10000000, "expenses": 8500000, "employees": 500, "sector": "defense"},
}


def search(query: str) -> str:
    """Search for company info. Returns matching company data."""
    query_upper = query.upper()
    for name, data in COMPANY_DB.items():
        if name in query_upper:
            return json.dumps({"company": name, **data})
    return json.dumps({"error": "No results found"})


def retrieve_document(doc_id: str) -> str:
    """Retrieve a financial document by ID."""
    docs = {
        "10K-ACME-2024": "ACME Corp Annual Report 2024. Revenue: $1.5M, Net Income: $300K, Employees: 50.",
        "10K-GLOBEX-2024": "Globex Inc Annual Report 2024. Revenue: $3.2M, Net Income: $400K, Employees: 120.",
        "10K-INITECH-2024": "Initech LLC Annual Report 2024. Revenue: $800K, Net Income: $50K, Employees: 30.",
        "10K-UMBRELLA-2024": "Umbrella Corp Annual Report 2024. Revenue: $5M, Net Income: $800K, Employees: 200.",
        "10K-WAYNE-2024": "Wayne Enterprises Annual Report 2024. Revenue: $10M, Net Income: $1.5M, Employees: 500.",
    }
    return docs.get(doc_id, json.dumps({"error": f"Document {doc_id} not found"}))


def calculate(expression: str) -> str:
    """Safe math evaluation using ast.parse for simple expressions."""
    try:
        allowed_nodes = (
            ast.Expression, ast.BinOp, ast.UnaryOp, ast.Constant,
            ast.Add, ast.Sub, ast.Mult, ast.Div, ast.Mod, ast.Pow,
            ast.USub, ast.UAdd,
        )
        tree = ast.parse(expression, mode="eval")
        for node in ast.walk(tree):
            if not isinstance(node, allowed_nodes):
                return json.dumps({"error": f"Unsafe operation: {type(node).__name__}"})
        result = eval(compile(tree, "<calc>", "eval"))
        return json.dumps({"result": result})
    except Exception as e:
        return json.dumps({"error": str(e)})


def submit_answer(answer: str) -> str:
    """Submit final answer."""
    return json.dumps({"status": "submitted", "answer": answer})

In [None]:
@dataclass
class Tool:
    """A tool that an agent can invoke."""
    name: str
    description: str
    parameters: dict  # JSON-schema style
    func: Callable


@dataclass
class ToolCall:
    """A parsed tool invocation from agent output."""
    tool_name: str
    arguments: dict
    raw_text: str


class ToolRegistry:
    """Registry that maps tool names to Tool objects and executes calls."""

    def __init__(self) -> None:
        self._tools: dict[str, Tool] = {}

    def register(self, tool: Tool) -> None:
        """Add a tool to the registry."""
        self._tools[tool.name] = tool

    def get(self, name: str) -> Tool:
        """Lookup a tool by name. Raises KeyError if not found."""
        if name not in self._tools:
            raise KeyError(f"Tool '{name}' not found in registry.")
        return self._tools[name]

    def execute(self, call: ToolCall) -> str:
        """Validate and run a tool call. Catches exceptions and returns error JSON."""
        tool = self.get(call.tool_name)  # raises KeyError if missing
        try:
            return tool.func(**call.arguments)
        except Exception as e:
            return json.dumps({"error": f"Tool execution failed: {str(e)}"})

    def list_tools(self) -> list[str]:
        """Return names and descriptions for all registered tools."""
        return [f"{t.name}: {t.description}" for t in self._tools.values()]

In [None]:
# --- Build the default registry ---

def build_default_registry() -> ToolRegistry:
    """Create a registry with all four standard tools."""
    registry = ToolRegistry()

    registry.register(Tool(
        name="search",
        description="Search for company info. Returns matching company data.",
        parameters={"type": "object", "properties": {"query": {"type": "string"}}},
        func=search,
    ))
    registry.register(Tool(
        name="retrieve_document",
        description="Retrieve a financial document by ID.",
        parameters={"type": "object", "properties": {"doc_id": {"type": "string"}}},
        func=retrieve_document,
    ))
    registry.register(Tool(
        name="calculate",
        description="Safe math evaluation for simple arithmetic expressions.",
        parameters={"type": "object", "properties": {"expression": {"type": "string"}}},
        func=calculate,
    ))
    registry.register(Tool(
        name="submit_answer",
        description="Submit the final answer to the evaluation harness.",
        parameters={"type": "object", "properties": {"answer": {"type": "string"}}},
        func=submit_answer,
    ))

    return registry


registry = build_default_registry()
print("Registered tools:")
for desc in registry.list_tools():
    print(f"  - {desc}")

In [None]:
# --- Tests for Part 1 ---

print("Part 1 Tests")
print("=" * 60)

# Test 1: All tools return strings
assert isinstance(search("ACME"), str)
assert isinstance(retrieve_document("10K-ACME-2024"), str)
assert isinstance(calculate("2 + 3"), str)
assert isinstance(submit_answer("42"), str)
print("[PASS] All tools return str")

# Test 2: Search returns expected data
result = json.loads(search("ACME revenue"))
assert result["company"] == "ACME"
assert result["revenue"] == 1500000
print("[PASS] search() returns expected company data")

# Test 3: Search with no match
result = json.loads(search("NONEXISTENT"))
assert "error" in result
print("[PASS] search() returns error for unknown company")

# Test 4: Calculate works for safe expressions
assert json.loads(calculate("(1500000 - 1200000) / 1500000"))["result"] == 0.2
assert json.loads(calculate("2 ** 10"))["result"] == 1024
print("[PASS] calculate() evaluates safe expressions correctly")

# Test 5: Calculate rejects malicious code
result_import = json.loads(calculate('__import__("os").system("rm -rf /")'))
assert "error" in result_import, "Should reject __import__"
print(f"[PASS] calculate() rejects __import__: {result_import['error']}")

result_open = json.loads(calculate('open("/etc/passwd")'))
assert "error" in result_open, "Should reject open()"
print(f"[PASS] calculate() rejects open(): {result_open['error']}")

# Test 6: Invalid tool name raises KeyError
try:
    registry.get("nonexistent_tool")
    assert False, "Should have raised KeyError"
except KeyError:
    pass
print("[PASS] registry.get() raises KeyError for unknown tool")

# Test 7: Registry execute with valid call
call = ToolCall(tool_name="search", arguments={"query": "GLOBEX"}, raw_text="")
result = json.loads(registry.execute(call))
assert result["company"] == "GLOBEX"
print("[PASS] registry.execute() dispatches correctly")

# Test 8: Registry execute catches bad arguments gracefully
bad_call = ToolCall(tool_name="search", arguments={"wrong_param": "x"}, raw_text="")
result = json.loads(registry.execute(bad_call))
assert "error" in result
print("[PASS] registry.execute() catches bad arguments")

print("\nAll Part 1 tests passed.")

---
## Part 2: ReAct Agent Loop

The [ReAct](https://arxiv.org/abs/2210.03629) framework interleaves **reasoning** (Thought) with **acting** (Action + Action Input). The agent generates text in a structured format, and the harness parses out the action, executes it, and feeds the observation back.

A key design decision: we use **deterministic mock agents** with scripted action sequences. This lets us test the harness itself without any LLM variance. In production, you would swap in an LLM call.

The trajectory dataclass records every step for post-hoc analysis -- this is critical for failure classification later.

In [None]:
@dataclass
class AgentStep:
    """A single step in an agent's trajectory."""
    thought: str
    action: Optional[ToolCall]
    observation: str
    step_num: int


@dataclass
class AgentTrajectory:
    """Full record of an agent's execution on a task."""
    question: str
    steps: list[AgentStep] = field(default_factory=list)
    final_answer: Optional[str] = None
    terminated: bool = False
    termination_reason: str = ""

In [None]:
def parse_action(text: str) -> Optional[ToolCall]:
    """Parse a ReAct-format response into a ToolCall.

    Expected format:
        Thought: <reasoning>
        Action: <tool_name>
        Action Input: <json_arguments>

    Returns None if no valid action is found.
    """
    action_match = re.search(r"Action:\s*(.+?)\s*$", text, re.MULTILINE)
    input_match = re.search(r"Action Input:\s*(.+?)\s*$", text, re.MULTILINE)

    if not action_match or not input_match:
        return None

    tool_name = action_match.group(1).strip()
    raw_input = input_match.group(1).strip()

    try:
        arguments = json.loads(raw_input)
    except json.JSONDecodeError:
        # Fallback: treat the entire input as a single string argument.
        # Determine the expected parameter name from the tool name.
        param_map = {
            "search": "query",
            "retrieve_document": "doc_id",
            "calculate": "expression",
            "submit_answer": "answer",
        }
        param_name = param_map.get(tool_name, "input")
        arguments = {param_name: raw_input}

    return ToolCall(tool_name=tool_name, arguments=arguments, raw_text=text)

In [None]:
class MockAgent:
    """Deterministic agent that follows a scripted sequence of actions."""

    def __init__(self, script: list[str]) -> None:
        self.script = script
        self.step = 0

    def __call__(self, observation: str, history: list[AgentStep]) -> str:
        if self.step < len(self.script):
            response = self.script[self.step]
            self.step += 1
            return response
        return (
            'Thought: I have no more actions.\n'
            'Action: submit_answer\n'
            'Action Input: {"answer": "unknown"}'
        )

In [None]:
def run_agent(
    question: str,
    registry: ToolRegistry,
    agent_fn: Callable,
    max_steps: int = 10,
) -> AgentTrajectory:
    """Execute a ReAct agent loop and record the full trajectory.

    Args:
        question: The task question to present to the agent.
        registry: ToolRegistry with available tools.
        agent_fn: Callable(observation, history) -> str.
        max_steps: Maximum number of steps before forced termination.

    Returns:
        AgentTrajectory with all steps recorded.
    """
    trajectory = AgentTrajectory(question=question)
    observation = f"Question: {question}"

    for step_num in range(1, max_steps + 1):
        # Get agent response
        response = agent_fn(observation, trajectory.steps)

        # Extract thought
        thought_match = re.search(r"Thought:\s*(.+?)(?:\n|$)", response)
        thought = thought_match.group(1).strip() if thought_match else ""

        # Parse action
        action = parse_action(response)

        if action is None:
            # No valid action parsed -- record and terminate
            step = AgentStep(
                thought=thought,
                action=None,
                observation="Parse error: no valid action found.",
                step_num=step_num,
            )
            trajectory.steps.append(step)
            trajectory.terminated = True
            trajectory.termination_reason = "parse_error"
            break

        # Check for submit_answer
        if action.tool_name == "submit_answer":
            answer = action.arguments.get("answer", "")
            obs = registry.execute(action)
            step = AgentStep(
                thought=thought,
                action=action,
                observation=obs,
                step_num=step_num,
            )
            trajectory.steps.append(step)
            trajectory.final_answer = answer
            trajectory.terminated = True
            trajectory.termination_reason = "submitted"
            break

        # Execute the tool
        try:
            obs = registry.execute(action)
        except KeyError:
            obs = json.dumps({"error": f"Unknown tool: {action.tool_name}"})

        step = AgentStep(
            thought=thought,
            action=action,
            observation=obs,
            step_num=step_num,
        )
        trajectory.steps.append(step)
        observation = f"Observation: {obs}"

    # Max-steps termination
    if not trajectory.terminated:
        trajectory.terminated = True
        trajectory.termination_reason = "max_steps"

    return trajectory

In [None]:
# --- Tests for Part 2 ---

print("Part 2 Tests")
print("=" * 60)

# Test 1: parse_action extracts valid action
text = (
    'Thought: I need to search for ACME.\n'
    'Action: search\n'
    'Action Input: {"query": "ACME"}'
)
action = parse_action(text)
assert action is not None
assert action.tool_name == "search"
assert action.arguments == {"query": "ACME"}
print("[PASS] parse_action extracts valid ReAct action")

# Test 2: parse_action returns None for invalid text
assert parse_action("Just some random text with no action.") is None
print("[PASS] parse_action returns None for invalid format")

# Test 3: Agent terminates on submit_answer
script = [
    'Thought: I need to search for ACME.\nAction: search\nAction Input: {"query": "ACME"}',
    'Thought: I have the answer.\nAction: submit_answer\nAction Input: {"answer": "1500000"}',
]
agent = MockAgent(script)
traj = run_agent("What is ACME's revenue?", registry, agent, max_steps=10)
assert traj.terminated is True
assert traj.termination_reason == "submitted"
assert traj.final_answer == "1500000"
assert len(traj.steps) == 2
print("[PASS] Agent terminates on submit_answer with correct trajectory")

# Test 4: Max steps respected
infinite_script = [
    'Thought: Searching again.\nAction: search\nAction Input: {"query": "ACME"}'
] * 20
agent = MockAgent(infinite_script)
traj = run_agent("Loop forever", registry, agent, max_steps=3)
assert traj.termination_reason == "max_steps"
assert len(traj.steps) == 3
print("[PASS] max_steps terminates agent correctly")

# Test 5: Parse error terminates
bad_script = ["This is gibberish with no action."]
agent = MockAgent(bad_script)
traj = run_agent("Will fail", registry, agent, max_steps=5)
assert traj.termination_reason == "parse_error"
assert len(traj.steps) == 1
print("[PASS] Parse error terminates agent")

# Test 6: Trajectory records observations from tools
script = [
    'Thought: Search WAYNE.\nAction: search\nAction Input: {"query": "WAYNE"}',
    'Thought: Submit.\nAction: submit_answer\nAction Input: {"answer": "10000000"}',
]
agent = MockAgent(script)
traj = run_agent("WAYNE revenue?", registry, agent)
obs_data = json.loads(traj.steps[0].observation)
assert obs_data["company"] == "WAYNE"
assert obs_data["revenue"] == 10000000
print("[PASS] Trajectory records tool observations")

print("\nAll Part 2 tests passed.")

---
## Part 3: Evaluation Dataset & Scoring

A benchmark is only as good as its tasks and scoring. We define tasks at three difficulty levels:

- **Easy**: single tool call (lookup and submit)
- **Medium**: two-step reasoning (lookup, calculate, submit)
- **Hard**: multi-document reasoning (multiple lookups, comparison, submit)

Scoring supports three answer types:
- `exact_match`: string equality (case-insensitive, stripped)
- `numeric_tolerance`: partial credit with 5% full / 20% half thresholds
- `contains`: substring check for open-ended answers

In [None]:
@dataclass
class AgentTask:
    """A single evaluation task for an agent."""
    task_id: str
    question: str
    expected_answer: str
    answer_type: str  # "exact_match" | "numeric_tolerance" | "contains"
    available_tools: list[str]
    max_steps: int
    difficulty: str  # "easy" | "medium" | "hard"

In [None]:
TASKS: list[AgentTask] = [
    # --- Easy: single lookup ---
    AgentTask("E1", "What is ACME's revenue?", "1500000", "numeric_tolerance",
             ["search", "submit_answer"], 5, "easy"),
    AgentTask("E2", "How many employees does GLOBEX have?", "120", "numeric_tolerance",
             ["search", "submit_answer"], 5, "easy"),
    AgentTask("E3", "What sector is INITECH in?", "consulting", "exact_match",
             ["search", "submit_answer"], 5, "easy"),
    AgentTask("E4", "What are UMBRELLA's expenses?", "4200000", "numeric_tolerance",
             ["search", "submit_answer"], 5, "easy"),
    AgentTask("E5", "What is WAYNE's revenue?", "10000000", "numeric_tolerance",
             ["search", "submit_answer"], 5, "easy"),
    AgentTask("E6", "What sector is WAYNE in?", "defense", "exact_match",
             ["search", "submit_answer"], 5, "easy"),
    AgentTask("E7", "How many employees does ACME have?", "50", "numeric_tolerance",
             ["search", "submit_answer"], 5, "easy"),

    # --- Medium: two-step (lookup + calculate) ---
    AgentTask("M1", "What is ACME's profit margin (revenue - expenses) / revenue?",
             "0.2", "numeric_tolerance",
             ["search", "calculate", "submit_answer"], 7, "medium"),
    AgentTask("M2", "What is GLOBEX's profit margin?",
             "0.125", "numeric_tolerance",
             ["search", "calculate", "submit_answer"], 7, "medium"),
    AgentTask("M3", "What is INITECH's profit (revenue minus expenses)?",
             "50000", "numeric_tolerance",
             ["search", "calculate", "submit_answer"], 7, "medium"),
    AgentTask("M4", "What is UMBRELLA's revenue per employee?",
             "25000", "numeric_tolerance",
             ["search", "calculate", "submit_answer"], 7, "medium"),
    AgentTask("M5", "What is WAYNE's profit margin?",
             "0.15", "numeric_tolerance",
             ["search", "calculate", "submit_answer"], 7, "medium"),
    AgentTask("M6", "What is ACME's revenue per employee?",
             "30000", "numeric_tolerance",
             ["search", "calculate", "submit_answer"], 7, "medium"),
    AgentTask("M7", "What is GLOBEX's profit (revenue minus expenses)?",
             "400000", "numeric_tolerance",
             ["search", "calculate", "submit_answer"], 7, "medium"),

    # --- Hard: multi-doc reasoning ---
    AgentTask("H1", "Which company has higher revenue, ACME or GLOBEX?",
             "GLOBEX", "contains",
             ["search", "calculate", "submit_answer"], 10, "hard"),
    AgentTask("H2", "Which company has more employees, INITECH or UMBRELLA?",
             "UMBRELLA", "contains",
             ["search", "submit_answer"], 10, "hard"),
    AgentTask("H3", "What is the combined revenue of ACME and GLOBEX?",
             "4700000", "numeric_tolerance",
             ["search", "calculate", "submit_answer"], 10, "hard"),
    AgentTask("H4", "Which company has the highest profit margin: ACME, GLOBEX, or WAYNE?",
             "ACME", "contains",
             ["search", "calculate", "submit_answer"], 10, "hard"),
    AgentTask("H5", "What is the total number of employees across ACME, GLOBEX, and INITECH?",
             "200", "numeric_tolerance",
             ["search", "calculate", "submit_answer"], 10, "hard"),
    AgentTask("H6", "Which company is more profitable per employee, WAYNE or UMBRELLA?",
             "WAYNE", "contains",
             ["search", "calculate", "submit_answer"], 10, "hard"),
]

print(f"Total tasks: {len(TASKS)}")
for diff in ["easy", "medium", "hard"]:
    count = sum(1 for t in TASKS if t.difficulty == diff)
    print(f"  {diff}: {count} tasks")

In [None]:
def score_answer(predicted: str, expected: str, answer_type: str) -> float:
    """Score a predicted answer against the expected answer.

    Args:
        predicted: The agent's answer string.
        expected: The ground-truth answer string.
        answer_type: One of 'exact_match', 'numeric_tolerance', 'contains'.

    Returns:
        Score in [0.0, 1.0].
    """
    if predicted is None:
        return 0.0

    predicted = predicted.strip()
    expected = expected.strip()

    if answer_type == "exact_match":
        return 1.0 if predicted.lower() == expected.lower() else 0.0

    elif answer_type == "numeric_tolerance":
        try:
            p_val = float(predicted)
            e_val = float(expected)
        except (ValueError, TypeError):
            return 0.0
        if e_val == 0:
            return 1.0 if p_val == 0 else 0.0
        rel_error = abs(p_val - e_val) / abs(e_val)
        if rel_error <= 0.05:
            return 1.0
        elif rel_error <= 0.20:
            return 0.5
        else:
            return 0.0

    elif answer_type == "contains":
        return 1.0 if expected.lower() in predicted.lower() else 0.0

    else:
        raise ValueError(f"Unknown answer_type: {answer_type}")

In [None]:
# --- Tests for Part 3 ---

print("Part 3 Tests")
print("=" * 60)

# Exact match
assert score_answer("consulting", "consulting", "exact_match") == 1.0
assert score_answer("Consulting", "consulting", "exact_match") == 1.0  # case insensitive
assert score_answer("tech", "consulting", "exact_match") == 0.0
print("[PASS] exact_match scoring")

# Numeric tolerance
assert score_answer("100", "100", "numeric_tolerance") == 1.0  # exact
assert score_answer("104.9", "100", "numeric_tolerance") == 1.0  # 4.9% off -> full
assert score_answer("105.1", "100", "numeric_tolerance") == 0.5  # 5.1% off -> half
assert score_answer("119.9", "100", "numeric_tolerance") == 0.5  # 19.9% off -> half
assert score_answer("120.1", "100", "numeric_tolerance") == 0.0  # 20.1% off -> zero
print("[PASS] numeric_tolerance boundaries (4.9% -> 1.0, 5.1% -> 0.5, 20.1% -> 0.0)")

# Contains
assert score_answer("GLOBEX has higher revenue", "GLOBEX", "contains") == 1.0
assert score_answer("ACME", "GLOBEX", "contains") == 0.0
print("[PASS] contains scoring")

# None answer
assert score_answer(None, "100", "numeric_tolerance") == 0.0
print("[PASS] None answer returns 0.0")

# All scores in [0, 1]
for score_val in [0.0, 0.5, 1.0]:
    assert 0.0 <= score_val <= 1.0
print("[PASS] All scores in [0, 1]")

# Task dataset integrity
task_ids = [t.task_id for t in TASKS]
assert len(task_ids) == len(set(task_ids)), "Duplicate task IDs!"
assert len(TASKS) == 20
for t in TASKS:
    assert t.difficulty in ("easy", "medium", "hard")
    assert t.answer_type in ("exact_match", "numeric_tolerance", "contains")
print("[PASS] Task dataset integrity (20 tasks, unique IDs, valid fields)")

print("\nAll Part 3 tests passed.")

---
## Part 4: Harness-Level Metrics

A benchmark harness needs to go beyond raw accuracy. We need:

- **Failure classification**: distinguish wrong answers from timeouts, tool errors, and parse errors
- **Stratified accuracy**: performance broken down by difficulty, answer type, etc.
- **Aggregate metrics**: average steps, tool usage patterns

This mirrors what production benchmarks like SWE-bench compute. The failure taxonomy is especially important because it tells you *where* to invest effort: if most failures are timeouts, increase max_steps; if they are tool errors, fix the tools; if they are wrong answers, improve the model.

In [None]:
@dataclass
class EvalResults:
    """Aggregated evaluation results."""
    task_results: list[dict]  # per-task: task_id, score, trajectory, failure_mode
    metrics: dict  # aggregate: accuracy, avg_steps, tool_usage, failure_modes

In [None]:
def classify_failure(trajectory: AgentTrajectory, score: float) -> str:
    """Classify why a task failed (or succeeded).

    Categories:
        'success': score >= 0.5
        'wrong_answer': submitted but score < 0.5
        'timeout': hit max_steps
        'tool_error': any observation contains 'error'
        'parse_error': termination due to parse failure
    """
    if score >= 0.5:
        return "success"
    if trajectory.termination_reason == "parse_error":
        return "parse_error"
    if trajectory.termination_reason == "max_steps":
        return "timeout"
    # Check for tool errors in observations
    for step in trajectory.steps:
        if "error" in step.observation.lower():
            return "tool_error"
    return "wrong_answer"

In [None]:
def make_agent_script_for_task(task: AgentTask, solve: bool = True) -> list[str]:
    """Generate a deterministic MockAgent script for a task.

    If solve=True, generates a correct sequence. If solve=False,
    generates a plausible but incorrect sequence.
    """
    tid = task.task_id
    script: list[str] = []

    if task.difficulty == "easy":
        # Single lookup tasks -- extract company name from question
        company = None
        for name in COMPANY_DB:
            if name in task.question.upper():
                company = name
                break
        if company and solve:
            script.append(
                f'Thought: I need to look up {company}.\n'
                f'Action: search\n'
                f'Action Input: {{"query": "{company}"}}'
            )
            script.append(
                f'Thought: I found the answer.\n'
                f'Action: submit_answer\n'
                f'Action Input: {{"answer": "{task.expected_answer}"}}'
            )
        else:
            # Wrong answer
            script.append(
                'Thought: I will guess.\n'
                'Action: submit_answer\n'
                'Action Input: {"answer": "wrong"}'
            )

    elif task.difficulty == "medium":
        company = None
        for name in COMPANY_DB:
            if name in task.question.upper():
                company = name
                break
        if company and solve:
            data = COMPANY_DB[company]
            script.append(
                f'Thought: I need to look up {company}.\n'
                f'Action: search\n'
                f'Action Input: {{"query": "{company}"}}'
            )
            # Build a calculation expression based on expected answer
            rev = data["revenue"]
            exp = data["expenses"]
            emp = data["employees"]
            if "margin" in task.question.lower():
                expr = f"({rev} - {exp}) / {rev}"
            elif "per employee" in task.question.lower():
                expr = f"{rev} / {emp}"
            else:  # profit
                expr = f"{rev} - {exp}"
            script.append(
                f'Thought: I need to calculate.\n'
                f'Action: calculate\n'
                f'Action Input: {{"expression": "{expr}"}}'
            )
            script.append(
                f'Thought: I have the answer.\n'
                f'Action: submit_answer\n'
                f'Action Input: {{"answer": "{task.expected_answer}"}}'
            )
        else:
            script.append(
                'Thought: I will guess.\n'
                'Action: submit_answer\n'
                'Action Input: {"answer": "999"}'
            )

    elif task.difficulty == "hard":
        if solve:
            # Find all companies mentioned
            companies = [n for n in COMPANY_DB if n in task.question.upper()]
            for c in companies:
                script.append(
                    f'Thought: Looking up {c}.\n'
                    f'Action: search\n'
                    f'Action Input: {{"query": "{c}"}}'
                )
            if "calculate" in task.available_tools and any(
                kw in task.question.lower()
                for kw in ["combined", "total", "margin", "per employee", "profitable"]
            ):
                # Add a calculation step
                # Build the expression from company data
                if "combined revenue" in task.question.lower() or "total" in task.question.lower():
                    vals = [str(COMPANY_DB[c]["revenue"]) for c in companies]
                    if "employee" in task.question.lower():
                        vals = [str(COMPANY_DB[c]["employees"]) for c in companies]
                    expr = " + ".join(vals)
                else:
                    expr = "0"  # placeholder
                script.append(
                    f'Thought: Calculating.\n'
                    f'Action: calculate\n'
                    f'Action Input: {{"expression": "{expr}"}}'
                )
            script.append(
                f'Thought: I have the answer.\n'
                f'Action: submit_answer\n'
                f'Action Input: {{"answer": "{task.expected_answer}"}}'
            )
        else:
            # Fail by submitting wrong or timing out
            script.append(
                'Thought: I will guess.\n'
                'Action: submit_answer\n'
                'Action Input: {"answer": "WRONG"}'
            )

    return script

In [None]:
def create_mock_agent_factory(
    easy_solve_rate: float = 1.0,
    medium_solve_rate: float = 0.57,
    hard_solve_rate: float = 0.33,
    seed: int = 42,
) -> Callable[[AgentTask], MockAgent]:
    """Create a factory that produces scripted MockAgents with realistic
    difficulty-stratified success rates.

    Uses a seeded RNG to make solve/fail decisions deterministic.
    """
    rng = np.random.RandomState(seed)

    def factory(task: AgentTask) -> MockAgent:
        rate_map = {"easy": easy_solve_rate, "medium": medium_solve_rate, "hard": hard_solve_rate}
        solve = rng.random() < rate_map[task.difficulty]
        script = make_agent_script_for_task(task, solve=solve)
        return MockAgent(script)

    return factory

In [None]:
def run_evaluation(
    tasks: list[AgentTask],
    agent_factory_fn: Callable[[AgentTask], MockAgent],
    registry: ToolRegistry,
) -> EvalResults:
    """Run the full evaluation harness over a task suite.

    For each task:
        1. Create an agent via factory.
        2. Run the agent loop.
        3. Score the answer.
        4. Classify the failure mode.

    Aggregate metrics: accuracy, avg_steps, tool_usage, failure_modes.
    """
    task_results: list[dict] = []
    tool_usage: Counter = Counter()
    failure_modes: Counter = Counter()
    total_steps = 0

    for task in tasks:
        agent = agent_factory_fn(task)
        traj = run_agent(task.question, registry, agent, max_steps=task.max_steps)
        score = score_answer(traj.final_answer, task.expected_answer, task.answer_type)
        failure = classify_failure(traj, score)

        # Record tool usage
        for step in traj.steps:
            if step.action is not None:
                tool_usage[step.action.tool_name] += 1
        total_steps += len(traj.steps)
        failure_modes[failure] += 1

        task_results.append({
            "task_id": task.task_id,
            "difficulty": task.difficulty,
            "score": score,
            "failure_mode": failure,
            "num_steps": len(traj.steps),
            "termination_reason": traj.termination_reason,
            "trajectory": traj,
        })

    accuracy = np.mean([r["score"] for r in task_results])
    avg_steps = total_steps / len(tasks) if tasks else 0

    metrics = {
        "accuracy": float(accuracy),
        "avg_steps": float(avg_steps),
        "tool_usage": dict(tool_usage),
        "failure_modes": dict(failure_modes),
        "num_tasks": len(tasks),
    }

    return EvalResults(task_results=task_results, metrics=metrics)

In [None]:
def stratified_accuracy(
    results: EvalResults,
    stratify_by: str = "difficulty",
) -> dict[str, float]:
    """Compute accuracy grouped by an attribute.

    Args:
        results: EvalResults from run_evaluation.
        stratify_by: Key in task_results dict to group by.

    Returns:
        Dict mapping group name to accuracy.
    """
    groups: dict[str, list[float]] = {}
    for r in results.task_results:
        key = r[stratify_by]
        groups.setdefault(key, []).append(r["score"])
    return {k: float(np.mean(v)) for k, v in groups.items()}

In [None]:
# --- Run evaluation ---

registry = build_default_registry()
factory = create_mock_agent_factory(easy_solve_rate=1.0, medium_solve_rate=0.57, hard_solve_rate=0.33, seed=42)
results = run_evaluation(TASKS, factory, registry)

print("Evaluation Results")
print("=" * 60)
print(f"Overall accuracy: {results.metrics['accuracy']:.2%}")
print(f"Average steps:    {results.metrics['avg_steps']:.1f}")
print(f"Total tasks:      {results.metrics['num_tasks']}")
print(f"\nTool usage:       {results.metrics['tool_usage']}")
print(f"Failure modes:    {results.metrics['failure_modes']}")

strat = stratified_accuracy(results, "difficulty")
print(f"\nStratified accuracy:")
for diff in ["easy", "medium", "hard"]:
    if diff in strat:
        print(f"  {diff:8s}: {strat[diff]:.2%}")

In [None]:
# --- Tests for Part 4 ---

print("Part 4 Tests")
print("=" * 60)

# Test 1: accuracy in [0, 1]
assert 0.0 <= results.metrics["accuracy"] <= 1.0
print(f"[PASS] accuracy = {results.metrics['accuracy']:.2%} is in [0, 1]")

# Test 2: avg_steps > 0
assert results.metrics["avg_steps"] > 0
print(f"[PASS] avg_steps = {results.metrics['avg_steps']:.1f} > 0")

# Test 3: failure categories cover all tasks
total_failures = sum(results.metrics["failure_modes"].values())
assert total_failures == len(TASKS), f"Failure modes sum {total_failures} != {len(TASKS)}"
print(f"[PASS] failure modes cover all {len(TASKS)} tasks")

# Test 4: stratification sums check
strat = stratified_accuracy(results, "difficulty")
for diff in ["easy", "medium", "hard"]:
    assert diff in strat, f"Missing difficulty '{diff}' in stratification"
    assert 0.0 <= strat[diff] <= 1.0
print("[PASS] stratified accuracy covers all difficulties, values in [0, 1]")

# Test 5: easy accuracy >= hard accuracy (by design)
assert strat["easy"] >= strat["hard"], "Easy should be >= hard by design"
print(f"[PASS] easy ({strat['easy']:.0%}) >= hard ({strat['hard']:.0%})")

# Test 6: classify_failure categories are valid
valid_modes = {"success", "wrong_answer", "timeout", "tool_error", "parse_error"}
for mode in results.metrics["failure_modes"]:
    assert mode in valid_modes, f"Invalid failure mode: {mode}"
print("[PASS] all failure modes are valid categories")

# Test 7: per-task results have expected fields
for r in results.task_results:
    assert "task_id" in r
    assert "score" in r
    assert "failure_mode" in r
    assert 0.0 <= r["score"] <= 1.0
print("[PASS] per-task results have expected fields")

print("\nAll Part 4 tests passed.")

---
## Part 5: Scaffold vs Model Disentanglement

The central question in agent evaluation: **how much of the score is due to the model vs the scaffold?**

The scaffold includes:
- Maximum number of steps (more steps = more chances to recover)
- Prompt style (verbose instructions vs minimal)
- Available tools (removing tools forces different strategies)

We run ablation experiments across these dimensions and compute **scaffold sensitivity** (coefficient of variation of accuracy across configs). A high value means the scaffold has outsized influence on the benchmark score.

In [None]:
@dataclass
class AblationConfig:
    """Configuration for a scaffold ablation experiment."""
    name: str
    max_steps: int
    prompt_style: str  # 'standard' or 'verbose'
    tools_available: list[str]

In [None]:
ALL_TOOLS = ["search", "retrieve_document", "calculate", "submit_answer"]

ABLATION_CONFIGS: list[AblationConfig] = [
    AblationConfig("steps-2", max_steps=2, prompt_style="standard", tools_available=ALL_TOOLS.copy()),
    AblationConfig("steps-5", max_steps=5, prompt_style="standard", tools_available=ALL_TOOLS.copy()),
    AblationConfig("steps-10", max_steps=10, prompt_style="standard", tools_available=ALL_TOOLS.copy()),
    AblationConfig("verbose-prompt", max_steps=10, prompt_style="verbose", tools_available=ALL_TOOLS.copy()),
    AblationConfig("no-calculate", max_steps=10, prompt_style="standard",
                   tools_available=["search", "retrieve_document", "submit_answer"]),
]

print("Ablation configs:")
for cfg in ABLATION_CONFIGS:
    print(f"  {cfg.name:16s} | max_steps={cfg.max_steps:2d} | prompt={cfg.prompt_style:8s} | tools={cfg.tools_available}")

In [None]:
def make_ablation_script(
    task: AgentTask,
    config: AblationConfig,
    solve: bool = True,
) -> list[str]:
    """Generate a MockAgent script adapted to an ablation config.

    Key adaptations:
    - If a required tool is not in config.tools_available, the agent cannot solve
      tasks needing that tool.
    - If max_steps is too low for the task, the script will exceed the limit.
    - Verbose prompt style adds extra thought steps.
    """
    # Check if the agent has the tools needed
    needed_tools = set(task.available_tools)
    available = set(config.tools_available)
    missing_tools = needed_tools - available - {"submit_answer"}  # submit always available

    if missing_tools and solve:
        # Cannot solve -- submit wrong answer
        return [
            'Thought: I do not have the required tools.\n'
            'Action: submit_answer\n'
            'Action Input: {"answer": "cannot compute"}'
        ]

    base_script = make_agent_script_for_task(task, solve=solve)

    # Verbose prompt adds a planning step at the beginning
    if config.prompt_style == "verbose" and solve:
        planning_step = (
            f'Thought: Let me carefully plan my approach to: {task.question}\n'
            f'Action: search\n'
            f'Action Input: {{"query": "planning"}}'
        )
        base_script = [planning_step] + base_script

    return base_script


def create_ablation_agent_factory(
    config: AblationConfig,
    easy_solve_rate: float = 1.0,
    medium_solve_rate: float = 0.57,
    hard_solve_rate: float = 0.33,
    seed: int = 42,
) -> Callable[[AgentTask], MockAgent]:
    """Create an agent factory for a specific ablation config."""
    rng = np.random.RandomState(seed)

    def factory(task: AgentTask) -> MockAgent:
        rate_map = {"easy": easy_solve_rate, "medium": medium_solve_rate, "hard": hard_solve_rate}
        solve = rng.random() < rate_map[task.difficulty]
        # Override max_steps on the task for this ablation
        task_copy = copy.copy(task)
        task_copy.max_steps = config.max_steps
        script = make_ablation_script(task_copy, config, solve=solve)
        return MockAgent(script)

    return factory

In [None]:
def run_ablation(
    tasks: list[AgentTask],
    configs: list[AblationConfig],
    registry: ToolRegistry,
) -> pd.DataFrame:
    """Run evaluation for each ablation config and return a summary DataFrame.

    Columns: config_name, accuracy, avg_steps, num_successes.
    """
    rows: list[dict] = []

    for config in configs:
        factory = create_ablation_agent_factory(config, seed=42)

        # Adapt tasks to config max_steps
        adapted_tasks = []
        for t in tasks:
            t_copy = copy.copy(t)
            t_copy.max_steps = config.max_steps
            adapted_tasks.append(t_copy)

        eval_results = run_evaluation(adapted_tasks, factory, registry)
        num_successes = sum(
            1 for r in eval_results.task_results if r["failure_mode"] == "success"
        )

        rows.append({
            "config_name": config.name,
            "accuracy": eval_results.metrics["accuracy"],
            "avg_steps": eval_results.metrics["avg_steps"],
            "num_successes": num_successes,
        })

    return pd.DataFrame(rows)

In [None]:
def scaffold_sensitivity(accuracies: list[float]) -> float:
    """Coefficient of variation of accuracy across scaffold configs.

    CV = std / mean. High value means scaffold matters more than model.
    Returns 0.0 if mean is 0 (to avoid division by zero).
    """
    arr = np.array(accuracies)
    mean_val = np.mean(arr)
    if mean_val == 0:
        return 0.0
    return float(np.std(arr) / mean_val)

In [None]:
# --- Run ablation experiments ---

registry = build_default_registry()
ablation_df = run_ablation(TASKS, ABLATION_CONFIGS, registry)

print("Ablation Results")
print("=" * 60)
print(ablation_df.to_string(index=False))

sensitivity = scaffold_sensitivity(ablation_df["accuracy"].tolist())
print(f"\nScaffold sensitivity (CV): {sensitivity:.3f}")
if sensitivity > 0.15:
    print("  -> HIGH: scaffold configuration dominates model differences")
elif sensitivity > 0.05:
    print("  -> MODERATE: scaffold has meaningful impact")
else:
    print("  -> LOW: model capability dominates")

In [None]:
# --- Visualization: Tufte-style grouped bar chart ---

fig, ax = plt.subplots(figsize=(8, 4))

configs = ablation_df["config_name"].tolist()
accuracies = ablation_df["accuracy"].tolist()

# Minimal bar chart following Tufte principles:
# high data-ink ratio, no chartjunk, direct labeling
bars = ax.bar(
    range(len(configs)),
    accuracies,
    color="#4C72B0",
    edgecolor="none",
    width=0.6,
)

# Direct labels on bars
for bar, acc in zip(bars, accuracies):
    ax.text(
        bar.get_x() + bar.get_width() / 2,
        bar.get_height() + 0.01,
        f"{acc:.0%}",
        ha="center",
        va="bottom",
        fontsize=10,
        fontweight="bold",
    )

ax.set_xticks(range(len(configs)))
ax.set_xticklabels(configs, rotation=25, ha="right", fontsize=9)
ax.set_ylabel("Accuracy", fontsize=11)
ax.set_ylim(0, 1.1)

# Tufte: remove top and right spines, minimize grid
ax.spines["top"].set_visible(False)
ax.spines["right"].set_visible(False)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f"{y:.0%}"))

# Add a reference line for the baseline config
baseline_acc = ablation_df.loc[ablation_df["config_name"] == "steps-10", "accuracy"].values[0]
ax.axhline(y=baseline_acc, color="gray", linestyle="--", linewidth=0.8, alpha=0.6)
ax.text(len(configs) - 0.5, baseline_acc + 0.02, f"baseline ({baseline_acc:.0%})",
        fontsize=8, color="gray", ha="right")

ax.set_title(
    f"Scaffold Ablation: Accuracy across Configs  (CV = {sensitivity:.2f})",
    fontsize=12,
    pad=12,
)

plt.tight_layout()
plt.show()

In [None]:
# --- Tests for Part 5 ---

print("Part 5 Tests")
print("=" * 60)

# Test 1: All configs produced results
assert len(ablation_df) == len(ABLATION_CONFIGS)
print(f"[PASS] All {len(ABLATION_CONFIGS)} configs produced results")

# Test 2: scaffold_sensitivity >= 0
assert sensitivity >= 0.0
print(f"[PASS] scaffold_sensitivity = {sensitivity:.3f} >= 0")

# Test 3: All accuracy values are valid
for _, row in ablation_df.iterrows():
    assert 0.0 <= row["accuracy"] <= 1.0, f"Invalid accuracy: {row['accuracy']}"
    assert row["avg_steps"] > 0, f"avg_steps must be > 0"
    assert row["num_successes"] >= 0
print("[PASS] All accuracy/step/success values are valid")

# Test 4: steps-2 should have <= accuracy of steps-10 (fewer steps = harder)
acc_2 = ablation_df.loc[ablation_df["config_name"] == "steps-2", "accuracy"].values[0]
acc_10 = ablation_df.loc[ablation_df["config_name"] == "steps-10", "accuracy"].values[0]
assert acc_2 <= acc_10, f"steps-2 ({acc_2:.2f}) should be <= steps-10 ({acc_10:.2f})"
print(f"[PASS] steps-2 ({acc_2:.0%}) <= steps-10 ({acc_10:.0%})")

# Test 5: no-calculate should have lower accuracy for medium tasks
# (medium tasks need calculate tool)
no_calc_acc = ablation_df.loc[ablation_df["config_name"] == "no-calculate", "accuracy"].values[0]
full_acc = ablation_df.loc[ablation_df["config_name"] == "steps-10", "accuracy"].values[0]
assert no_calc_acc <= full_acc, f"no-calculate ({no_calc_acc:.2f}) should be <= full ({full_acc:.2f})"
print(f"[PASS] no-calculate ({no_calc_acc:.0%}) <= full tools ({full_acc:.0%})")

# Test 6: scaffold_sensitivity with identical values should be 0
assert scaffold_sensitivity([0.5, 0.5, 0.5]) == 0.0
print("[PASS] scaffold_sensitivity([0.5, 0.5, 0.5]) == 0.0")

# Test 7: scaffold_sensitivity with all zeros should be 0
assert scaffold_sensitivity([0.0, 0.0, 0.0]) == 0.0
print("[PASS] scaffold_sensitivity handles all-zero edge case")

print("\nAll Part 5 tests passed.")

---
## Interview Tips: Agent Evaluation at Vals AI

### Why agentic benchmarks are harder than static QA

Static QA benchmarks (MMLU, HellaSwag) have a fixed input-output mapping. Agent benchmarks introduce **sequential decision-making** with feedback loops. This creates several challenges:

1. **Path dependence**: A wrong early action can cascade into failure, even if the model is capable.
2. **Non-determinism**: Even with temperature=0, tool outputs and parsing can vary.
3. **Partial observability**: The agent only sees its own trajectory, not the full state.
4. **Credit assignment**: Was a failure due to bad reasoning, bad tool use, or bad luck?

### Scaffold confounds and how Vals standardizes

The scaffold (max steps, prompt template, tool definitions, retry logic) is a massive confound. Two models with identical capabilities can score very differently under different scaffolds. Vals addresses this by:

- Providing a **standardized harness** with fixed scaffold parameters per benchmark
- Reporting scaffold sensitivity alongside model scores
- Running ablation experiments to quantify scaffold contribution
- Using identical tool definitions and system prompts across model comparisons

The **coefficient of variation** (std/mean of accuracy across scaffold configs) is a useful summary statistic. If CV > 0.15, the scaffold is likely dominating the signal.

### Reproducibility: practical considerations

- **Seed management**: Fix numpy, torch, and random seeds. But LLM APIs are not fully deterministic.
- **Temperature=0**: Necessary but not sufficient (some APIs still have nondeterminism).
- **Retry policies**: Define how many retries on tool failure, parse failure, etc.
- **Versioning**: Pin model versions, tool definitions, and prompt templates.
- **Logging**: Record full trajectories, not just final scores. This enables post-hoc debugging.

### Connection to Vals' Finance Agent benchmark

The Finance Agent benchmark uses real-world financial tools (SEC_API for filing retrieval, ParseHTML for document processing). The architecture mirrors what we built here:

- Tool registry with SEC-specific tools
- ReAct-style agent loop with trajectory recording
- Tasks graded on financial accuracy (numeric tolerance for dollar amounts)
- Failure classification to distinguish model errors from API errors

The key difference is that production benchmarks must handle **real API failures**, **rate limits**, and **non-deterministic tool outputs** -- which is why the harness design (retry policies, error classification, trajectory logging) matters as much as the model.