Add ability to test subagent delegation in AI testing

## Summary

RITEway's AI testing currently evaluates the **output** of a result agent, but has no visibility into **how** that output was produced — specifically, whether the result agent delegated work to subagents.

## Problem

When testing prompts that instruct an agent to delegate to subagents, there is no semantic way to verify that delegation actually happened.

The current flow:

1. **Result agent** executes the user prompt and produces output
2. **Judge agent** evaluates that output against each assertion
3. The judge only sees the **final text result** — it has no access to tool calls, subagent spawning, or any intermediate process metadata

This means an assertion like `- Given the spec, should delegate research to a subagent` cannot be meaningfully evaluated, because the judge has no information about whether subagents were used.

## Actual

The judge agent receives only the result agent's final text output. There is no mechanism to surface tool use or subagent metadata from the result agent's execution.

## Expected

There should be a way to write assertions that verify subagent delegation occurred, including at minimum **whether** subagents were spawned, and ideally **which** subagents were used.

## Real-world examples

Three PRs in [paralleldrive/aidd](https://github.com/paralleldrive/aidd) dispatch subagents and attempt to assert on delegation behavior, but are limited to checking the agent's text output rather than verifying actual tool calls:

- **paralleldrive/aidd#191** — Pipeline skill dispatches each step to a subagent. Evals assert on subagent type selection and prompt shape, but can't verify the Task tool was actually invoked.
- **paralleldrive/aidd#187** — Parallel skill fans out `/aidd-fix` prompts to subagents. Evals cover prompt generation only — no eval for the actual dispatch.
- **paralleldrive/aidd#192** — PR review skill generates delegation prompts for unresolved issues. Evals cover triage and prompt generation, but not the dispatch step.

## Possible approaches (non-exhaustive, open for discussion)

- Feed the full tool call log to the judge agent alongside the result
- Optionally expose the full thinking/execution chain to the judge
- Detect which assertions are about subagent behavior and selectively provide tool call metadata only for those (recommended)
- Expose only subagent-related tool calls to the judge, stripping other tool use (recommended)
- Expose just the count of subagent tool calls (recommended)

We should probably combine the last three approaches to keep it minimal: detect subagent-related assertions, and for those, expose only the subagent tool calls (or even just the count) to the judge — rather than flooding it with the full execution trace.

## Decisions from discussion

**Cost mitigation:** Subagent delegation tests are inherently expensive (E2E by design — the agent actually dispatches subagents). To manage cost, use `--runs 1` for delegation-heavy test files and separate them into their own npm scripts:

```json
{
  "test:ai-eval-review": "riteway ai ai-evals/review/test.sudo --runs 4 --threshold 75 ...",
  "test:ai-eval-pipeline": "riteway ai ai-evals/pipeline/test.sudo --runs 1 --threshold 75 ...",
  "test:ai-eval": "npm run test:ai-eval-review && npm run test:ai-eval-pipeline"
}
```

This keeps expensive delegation tests isolated from cheaper output-only tests — no framework changes needed for cost control, just how you organize your `package.json` scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to test subagent delegation in AI testing #437

Summary

Problem

Actual

Expected

Real-world examples

Possible approaches (non-exhaustive, open for discussion)

Decisions from discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add ability to test subagent delegation in AI testing #437

Description

Summary

Problem

Actual

Expected

Real-world examples

Possible approaches (non-exhaustive, open for discussion)

Decisions from discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions