Summary
RITEway's AI testing currently evaluates the output of a result agent, but has no visibility into how that output was produced — specifically, whether the result agent delegated work to subagents.
Problem
When testing prompts that instruct an agent to delegate to subagents, there is no semantic way to verify that delegation actually happened.
The current flow:
- Result agent executes the user prompt and produces output
- Judge agent evaluates that output against each assertion
- The judge only sees the final text result — it has no access to tool calls, subagent spawning, or any intermediate process metadata
This means an assertion like - Given the spec, should delegate research to a subagent cannot be meaningfully evaluated, because the judge has no information about whether subagents were used.
Actual
The judge agent receives only the result agent's final text output. There is no mechanism to surface tool use or subagent metadata from the result agent's execution.
Expected
There should be a way to write assertions that verify subagent delegation occurred, including at minimum whether subagents were spawned, and ideally which subagents were used.
Real-world examples
Three PRs in paralleldrive/aidd dispatch subagents and attempt to assert on delegation behavior, but are limited to checking the agent's text output rather than verifying actual tool calls:
Possible approaches (non-exhaustive, open for discussion)
- Feed the full tool call log to the judge agent alongside the result
- Optionally expose the full thinking/execution chain to the judge
- Detect which assertions are about subagent behavior and selectively provide tool call metadata only for those (recommended)
- Expose only subagent-related tool calls to the judge, stripping other tool use (recommended)
- Expose just the count of subagent tool calls (recommended)
We should probably combine the last three approaches to keep it minimal: detect subagent-related assertions, and for those, expose only the subagent tool calls (or even just the count) to the judge — rather than flooding it with the full execution trace.
Decisions from discussion
Cost mitigation: Subagent delegation tests are inherently expensive (E2E by design — the agent actually dispatches subagents). To manage cost, use --runs 1 for delegation-heavy test files and separate them into their own npm scripts:
{
"test:ai-eval-review": "riteway ai ai-evals/review/test.sudo --runs 4 --threshold 75 ...",
"test:ai-eval-pipeline": "riteway ai ai-evals/pipeline/test.sudo --runs 1 --threshold 75 ...",
"test:ai-eval": "npm run test:ai-eval-review && npm run test:ai-eval-pipeline"
}
This keeps expensive delegation tests isolated from cheaper output-only tests — no framework changes needed for cost control, just how you organize your package.json scripts.
Summary
RITEway's AI testing currently evaluates the output of a result agent, but has no visibility into how that output was produced — specifically, whether the result agent delegated work to subagents.
Problem
When testing prompts that instruct an agent to delegate to subagents, there is no semantic way to verify that delegation actually happened.
The current flow:
This means an assertion like
- Given the spec, should delegate research to a subagentcannot be meaningfully evaluated, because the judge has no information about whether subagents were used.Actual
The judge agent receives only the result agent's final text output. There is no mechanism to surface tool use or subagent metadata from the result agent's execution.
Expected
There should be a way to write assertions that verify subagent delegation occurred, including at minimum whether subagents were spawned, and ideally which subagents were used.
Real-world examples
Three PRs in paralleldrive/aidd dispatch subagents and attempt to assert on delegation behavior, but are limited to checking the agent's text output rather than verifying actual tool calls:
/aidd-fixprompts to subagents. Evals cover prompt generation only — no eval for the actual dispatch.Possible approaches (non-exhaustive, open for discussion)
We should probably combine the last three approaches to keep it minimal: detect subagent-related assertions, and for those, expose only the subagent tool calls (or even just the count) to the judge — rather than flooding it with the full execution trace.
Decisions from discussion
Cost mitigation: Subagent delegation tests are inherently expensive (E2E by design — the agent actually dispatches subagents). To manage cost, use
--runs 1for delegation-heavy test files and separate them into their own npm scripts:{ "test:ai-eval-review": "riteway ai ai-evals/review/test.sudo --runs 4 --threshold 75 ...", "test:ai-eval-pipeline": "riteway ai ai-evals/pipeline/test.sudo --runs 1 --threshold 75 ...", "test:ai-eval": "npm run test:ai-eval-review && npm run test:ai-eval-pipeline" }This keeps expensive delegation tests isolated from cheaper output-only tests — no framework changes needed for cost control, just how you organize your
package.jsonscripts.