When evaluating tool-using agents, I’m running into a gap between two different questions:
Did the agent call the right tool?
Given that the tool returned a valid response, did the agent correctly use that tool output?
Current tool-related evals do a good job of catching the first class of error (wrong or missing tool calls), but they often miss silent failures in the second class: the agent chooses the appropriate tool, the tool executes successfully and returns valid data, and then the agent misinterprets or misuses that data in its subsequent reasoning or final answer.
Some concrete examples of this second category:
The agent ignores a critical field in the tool response.
The agent flips a boolean or misreads a status.
The agent draws a conclusion that is not supported by the tool output, even though the tool call itself was “correct.”
From an evaluation standpoint, these show up as “green” on tool selection and the raw tool logs look fine, but the overall behavior is still wrong. I usually only catch them by manually inspecting trajectories.
I’d love a two-stage eval pattern or metric type that explicitly separates:
Tool-choice correctness: did the agent select an appropriate tool (or set of tools) for this task?
Tool-output usage correctness (conditional on a valid tool response): given that the tool executed successfully and returned a valid response, did the agent’s next step or final answer correctly reflect that tool output and the ground-truth label?
Concretely, this could look like:
Stage 1: score tool selection.
Stage 2: only run on traces where the tool call succeeded (no API error, structurally valid response), and have a judge (LLM or objective check) compare:
the tool response,
the agent’s subsequent message or final answer,
the expected outcome for the task,
and score whether the agent correctly incorporated the tool output.
Aggregated metrics or breakdowns that separate:
“Tool selection errors”
“Tool execution / API errors”
“Tool interpretation errors” (right tool, valid response, wrong usage)
would make it much easier to debug and prioritize work on tool-using agents
When evaluating tool-using agents, I’m running into a gap between two different questions:
Did the agent call the right tool?
Given that the tool returned a valid response, did the agent correctly use that tool output?
Current tool-related evals do a good job of catching the first class of error (wrong or missing tool calls), but they often miss silent failures in the second class: the agent chooses the appropriate tool, the tool executes successfully and returns valid data, and then the agent misinterprets or misuses that data in its subsequent reasoning or final answer.
Some concrete examples of this second category:
The agent ignores a critical field in the tool response.
The agent flips a boolean or misreads a status.
The agent draws a conclusion that is not supported by the tool output, even though the tool call itself was “correct.”
From an evaluation standpoint, these show up as “green” on tool selection and the raw tool logs look fine, but the overall behavior is still wrong. I usually only catch them by manually inspecting trajectories.
I’d love a two-stage eval pattern or metric type that explicitly separates:
Tool-choice correctness: did the agent select an appropriate tool (or set of tools) for this task?
Tool-output usage correctness (conditional on a valid tool response): given that the tool executed successfully and returned a valid response, did the agent’s next step or final answer correctly reflect that tool output and the ground-truth label?
Concretely, this could look like:
Stage 1: score tool selection.
Stage 2: only run on traces where the tool call succeeded (no API error, structurally valid response), and have a judge (LLM or objective check) compare:
the tool response,
the agent’s subsequent message or final answer,
the expected outcome for the task,
and score whether the agent correctly incorporated the tool output.
Aggregated metrics or breakdowns that separate:
“Tool selection errors”
“Tool execution / API errors”
“Tool interpretation errors” (right tool, valid response, wrong usage)
would make it much easier to debug and prioritize work on tool-using agents