Skip to content

Eval gap between “did the agent call the right tool?” and “did the agent correctly use the tool output?” #388

@Ruthwik-Data

Description

@Ruthwik-Data

When evaluating tool-using agents, I’m running into a gap between two different questions:

Did the agent call the right tool?

Given that the tool returned a valid response, did the agent correctly use that tool output?

Current tool-related evals do a good job of catching the first class of error (wrong or missing tool calls), but they often miss silent failures in the second class: the agent chooses the appropriate tool, the tool executes successfully and returns valid data, and then the agent misinterprets or misuses that data in its subsequent reasoning or final answer.

Some concrete examples of this second category:

The agent ignores a critical field in the tool response.

The agent flips a boolean or misreads a status.

The agent draws a conclusion that is not supported by the tool output, even though the tool call itself was “correct.”

From an evaluation standpoint, these show up as “green” on tool selection and the raw tool logs look fine, but the overall behavior is still wrong. I usually only catch them by manually inspecting trajectories.

I’d love a two-stage eval pattern or metric type that explicitly separates:

Tool-choice correctness: did the agent select an appropriate tool (or set of tools) for this task?

Tool-output usage correctness (conditional on a valid tool response): given that the tool executed successfully and returned a valid response, did the agent’s next step or final answer correctly reflect that tool output and the ground-truth label?

Concretely, this could look like:

Stage 1: score tool selection.

Stage 2: only run on traces where the tool call succeeded (no API error, structurally valid response), and have a judge (LLM or objective check) compare:

the tool response,

the agent’s subsequent message or final answer,

the expected outcome for the task,
and score whether the agent correctly incorporated the tool output.

Aggregated metrics or breakdowns that separate:

“Tool selection errors”

“Tool execution / API errors”

“Tool interpretation errors” (right tool, valid response, wrong usage)

would make it much easier to debug and prioritize work on tool-using agents

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions