Skip to content

fix(subagent): include role, session key, and timing in error payloads#68726

Merged
gumadeiras merged 8 commits intoopenclaw:mainfrom
BKF-Gitty:fix/subagent-error-context
Apr 19, 2026
Merged

fix(subagent): include role, session key, and timing in error payloads#68726
gumadeiras merged 8 commits intoopenclaw:mainfrom
BKF-Gitty:fix/subagent-error-context

Conversation

@BKF-Gitty
Copy link
Copy Markdown
Contributor

@BKF-Gitty BKF-Gitty commented Apr 18, 2026

Error payloads from sessions_spawn and subagent wait outcomes now carry the context a parent needs to retry or report clearly:

  • sessions-spawn-tool: add role (requested agentId) to early validation errors, to the ACP register-failure payload, and to forwarded error results from both the ACP and subagent spawn paths. childSessionKey and runId are already populated by the inner spawn for the errors that know them; this just plumbs role through alongside.
  • subagent-announce-output: extend SubagentRunOutcome with optional startedAt/endedAt/elapsedMs and populate them in applySubagentWaitOutcome so timeout and error outcomes convey how long the child ran before failing.

Scoped verification: tsgo:core, tsgo:core:test, and 43 targeted tests in src/agents (sessions-spawn-tool, subagent-registry lifecycle retry grace, subagent-announce timeout, subagent-announce, and capture-completion-reply) all green. Repo-wide pnpm check is red on latest origin/main for unrelated extensions/discord and extensions/qa-lab surfaces (missing @buape/carbon and @copilotkit/aimock members); not addressed here.

Summary

  • Problem: When a subagent fails or times out, the payload returned to the parent LLM was {status: "error", error: "<message>"} with no subagent identifier and no duration. The parent cannot re-target the specific child session, correlate failures across concurrent subagents, or distinguish "timed out after real work" from "failed instantly."
  • Why it matters: Parent agents cannot retry intelligently or report failures clearly without that context. It also hides dead subagents behind generic error strings during post-hoc log review.
  • What changed: Additive context on error payloads from sessions_spawn plus timing fields on SubagentRunOutcome so timeout/error outcomes surface how long the child ran. No control-flow changes, no type removals.
  • What did NOT change (scope boundary): No changes to subagent-spawn.ts, run-wait.ts, the gateway protocol, or any liveness / heartbeat mechanism. The known gap where a silently-dead subagent isn't detected until the 120s timeout fires is unchanged.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Closes #
  • Related #
  • This PR fixes a bug or regression

Root Cause (if applicable)

  • Root cause: The tool-level error paths in sessions-spawn-tool.ts emitted {status, error} without forwarding the spawn-time context that was already in scope (requestedAgentId). Separately, applySubagentWaitOutcome computed startedAt/endedAt on the wrapper object but never embedded them on the outcome object that downstream consumers read.
  • Missing detection / guardrail: No test asserted that error payloads carry enough identity to distinguish concurrent subagent failures; the outcome type itself had no slot for timing, so the gap was invisible at the type level.
  • Contributing context (if known): The inner spawn (subagent-spawn.ts) already attaches childSessionKey/runId on most error branches, which masked the missing role plumbing at the tool layer.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/agents/tools/sessions-spawn-tool.test.ts and (follow-up) a focused test for applySubagentWaitOutcome in src/agents/subagent-announce-output.
  • Scenario the test should lock in: on an error/timeout wait outcome, role is present when agentId was requested, and startedAt/endedAt/elapsedMs are present when timestamps are known.
  • Why this is the smallest reliable guardrail: both changes are pure payload-shape additions; a unit-level assertion on the returned object catches regressions without a full spawn runtime.
  • Existing test that already covers this (if any): sessions-spawn-tool.test.ts and subagent-registry.lifecycle-retry-grace.e2e.test.ts exercise the success and retry paths; no existing test asserts the error payload shape added here.
  • If no new test is added, why not: this PR is scoped to plumbing; happy to add a targeted unit test if reviewers want it before merge.

User-visible / Behavior Changes

Additive only. Error payloads emitted by the sessions_spawn tool may now include a role field (attached to error and forbidden results when an agentId was requested), and SubagentRunOutcome values may now include startedAt, endedAt, and elapsedMs. Timing fields are populated on all three outcome branches (ok, timeout, error) when the timestamps are known — not only failures — since duration is also useful for success-case stats. No existing field is removed or renamed; no status codes change.

Diagram (if applicable)

Before (error payload to parent LLM):
{ status: "error", error: "<message>" }            # ACP early validation
{ status: "error", error, childSessionKey, runId } # inner spawn failure

After:
{ status: "error", error, role? }                                    # early validation
{ status: "error", error, childSessionKey, runId, role? }            # register/forwarded

Before (SubagentRunOutcome on timeout):
{ status: "timeout" }

After:
{ status: "timeout", startedAt?, endedAt?, elapsedMs? }

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No (payload shape only)
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A. The added role field is the same agentId the caller passed in as input; no new data crosses a trust boundary.

Repro + Verification

Environment

  • OS: macOS 25.4.0 (Darwin)
  • Runtime/container: Node 22+ local dev
  • Model/provider: N/A (type/structure change, model-agnostic)
  • Integration/channel (if any): N/A
  • Relevant config (redacted): N/A

Steps

  1. Check out fix/subagent-error-context.
  2. Run pnpm tsgo:core and pnpm tsgo:core:test.
  3. Run pnpm test src/agents/tools/sessions-spawn-tool.test.ts src/agents/subagent-registry.lifecycle-retry-grace.e2e.test.ts src/agents/subagent-announce.timeout.test.ts src/agents/subagent-announce.test.ts src/agents/subagent-announce.capture-completion-reply.test.ts.

Expected

  • All scoped typechecks and scoped tests green.

Actual

  • All green: 19 tests across sessions-spawn-tool + lifecycle retry grace, plus 24 tests across the announce family (43 total).

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Relevant scoped-test output (trimmed):

Test Files  1 passed (1)
Tests  12 passed (12)          # sessions-spawn-tool
Test Files  1 passed (1)
Tests  7 passed (7)            # subagent-registry lifecycle retry grace
Test Files  3 passed (3)
Tests  24 passed (24)          # subagent-announce family

Repo-wide pnpm check was red on latest origin/main before this branch existed, entirely in extensions/discord/** and extensions/qa-lab/** (missing @buape/carbon exports and @copilotkit/aimock module). Not in scope here.

Human Verification (required)

  • Verified scenarios: scoped typecheck (tsgo:core, tsgo:core:test) clean on the two touched files; targeted unit + e2e tests listed above all green against the edits.
  • Edge cases checked: requestedAgentId undefined → roleContext spread is a no-op; startedAt/endedAt missing → timing object is empty and not spread onto outcome; result.status !== "error" in the forwarded jsonResult path → original result is returned unmodified.
  • What you did NOT verify: live end-to-end run of a real subagent failure against the gateway; CI on the unrelated extensions/discord + extensions/qa-lab failures that are already red on main.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A. All added fields are optional.

Risks and Mitigations

  • Risk: Consumers that serialize SubagentRunOutcome and strictly reject unknown keys could fail on the new timing fields.
    • Mitigation: Fields are optional and only emitted when timestamps are known; no existing persistence/serialization path examined rejects unknown keys. Happy to gate behind a feature flag if reviewers prefer.

🤖 Generated with Claude Code

@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS labels Apr 18, 2026
@BKF-Gitty BKF-Gitty marked this pull request as ready for review April 18, 2026 23:54
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 18, 2026

Greptile Summary

This PR enriches error payloads from sessions_spawn with the requested agentId as a role field, and extends SubagentRunOutcome with optional startedAt/endedAt/elapsedMs timing fields populated by applySubagentWaitOutcome. All changes are strictly additive — no fields removed, no control-flow altered — and the implementation is clean and well-guarded (empty-object spread when requestedAgentId is absent, Math.max(0, …) on elapsed time).

Confidence Score: 5/5

Safe to merge; changes are purely additive with no control-flow modifications.

Both findings are P2 style observations. Neither represents a defect or behavioral regression. The implementation is correct and well-guarded.

No files require special attention.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/agents/subagent-announce-output.ts
Line: 324-325

Comment:
**Timing also applied to "ok" outcomes (undocumented scope)**

The PR description and root-cause section say timing fields are added to "timeout and error outcomes," but the implementation spreads `timing` into all three branches, including `status: "ok"`. This is strictly additive and arguably more useful, but it's worth noting the actual scope is broader than the description to avoid surprises for callers that pattern-match on status to decide whether to inspect timing fields.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/tools/sessions-spawn-tool.ts
Line: 316-320

Comment:
**`role` added to any non-`"accepted"` status, not just errors**

`result.status !== "accepted"` also covers any other non-error terminal statuses (`"queued"`, `"pending"`, etc.) that `spawnAcpDirect` might return now or in the future. If the intent is specifically to enrich failure payloads, a tighter condition (`result.status === "error"`) would be more explicit and resilient to new status codes being introduced upstream. The same pattern appears on the subagent path at line 358.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "Merge branch 'main' into fix/subagent-er..." | Re-trigger Greptile

Comment thread src/agents/subagent-announce-output.ts Outdated
Comment thread src/agents/tools/sessions-spawn-tool.ts Outdated
@BKF-Gitty
Copy link
Copy Markdown
Contributor Author

Thanks @greptile — addressed both:

Re: timing on ok outcomes (subagent-announce-output.ts:324-325) — the scope widening is intentional and additive. Timing on ok is useful for success-case stats/duration reporting (and already consumed upstream via startedAt/endedAt for the parent-visible stats line). No behavior change for callers that only inspect status; strictly more data otherwise. Updating the PR body to call this out rather than narrowing the code.

Re: status !== \"accepted\" (sessions-spawn-tool.ts:316-320 and :358) — good point. Switched both forward sites to an explicit enumeration of error | forbidden in 3b471a0 so new upstream statuses must opt in deliberately rather than being silently enrolled. This also still covers the forbidden paths Codex flagged earlier (ACP disabled, runtime policy, allowlist denial, depth/children limits, requireAgentId, sandbox policy).

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 19, 2026

Both look good.

The explicit status === "error" || status === "forbidden" guard is confirmed in the updated file at both the ACP and subagent forward sites — new upstream statuses now have to opt in deliberately. That's the right call.

On timing for ok outcomes: agreed, additive-only and useful for success-path duration stats. No objection to leaving it as-is with the PR body updated for clarity.

Changes LGTM.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules and .greptile/config.

@gumadeiras gumadeiras self-assigned this Apr 19, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92c4c5dc0c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/agents/subagent-registry-completion.ts
BKF-Gitty and others added 8 commits April 19, 2026 18:23
Error payloads from sessions_spawn and subagent wait outcomes now carry the
context a parent needs to retry or report clearly:

- sessions-spawn-tool: add role (requested agentId) to early validation
  errors, to the ACP register-failure payload, and to forwarded error
  results from both the ACP and subagent spawn paths. childSessionKey and
  runId are already populated by the inner spawn for the errors that know
  them; this just plumbs role through alongside.
- subagent-announce-output: extend SubagentRunOutcome with optional
  startedAt/endedAt/elapsedMs and populate them in applySubagentWaitOutcome
  so timeout and error outcomes convey how long the child ran before
  failing.

Scoped verification: tsgo:core, tsgo:core:test, and 43 targeted tests in
src/agents (sessions-spawn-tool, subagent-registry lifecycle retry grace,
subagent-announce timeout, subagent-announce, and capture-completion-reply)
all green. Repo-wide pnpm check is red on latest origin/main for unrelated
extensions/discord and extensions/qa-lab surfaces (missing @buape/carbon
and @copilotkit/aimock members); not addressed here.
Codex review on PR 68726 noted the role context was only merged when
status === "error", but both ACP and subagent spawns return
status: "forbidden" for real failures the parent most needs to identify
in parallel spawns: ACP disabled, runtime policy, agent allowlist denial,
depth limit, max children, requireAgentId, and sandbox policy. Switch both
forwarded result sites to "status !== accepted" so forbidden also carries
the requested role.
…ing accepted

Greptile review on PR 68726 flagged that result.status !== "accepted" would
silently enroll any future non-accepted status (e.g. queued/pending) into the
role-augmentation path. Switch both the ACP and subagent forward sites to an
explicit enumeration of (error | forbidden) so new upstream statuses must
opt in deliberately. Behavior unchanged for today's closed union.
@gumadeiras gumadeiras force-pushed the fix/subagent-error-context branch from 3dcfc05 to 55c7561 Compare April 19, 2026 22:31
@gumadeiras gumadeiras merged commit 4277078 into openclaw:main Apr 19, 2026
10 checks passed
@gumadeiras
Copy link
Copy Markdown
Member

Merged via squash.

Thanks @BKF-Gitty!

omarshahine added a commit that referenced this pull request Apr 20, 2026
Aligns the steer-restart killed-run test with the timing fields added to
subagent run outcomes in #68726. The production code now returns
startedAt/endedAt/elapsedMs alongside status and error on the error
outcome, but this test's toEqual still asserted only status+error, so it
has been failing on main since #68726 landed. Uses the same expect.any(Number)
matcher already in use a few lines below for the ended hook payload.
omarshahine added a commit that referenced this pull request Apr 20, 2026
…and #65986 (#69173)

* test(agents): expect timing fields in killed-run outcome

Aligns the steer-restart killed-run test with the timing fields added to
subagent run outcomes in #68726. The production code now returns
startedAt/endedAt/elapsedMs alongside status and error on the error
outcome, but this test's toEqual still asserted only status+error, so it
has been failing on main since #68726 landed. Uses the same expect.any(Number)
matcher already in use a few lines below for the ended hook payload.

* test(gateway): register ops agent in sessions.create task-start test

The "sessions.create can start the first agent turn from an initial task"
test triggers the auto chat.send path by passing `task:`. After #65986
added a deleted-agent guard to chat.send, an unregistered `ops` agent
triggers the reject path and the auto-started run never happens, so
runStarted comes back false.

Register `ops` via testState.agentsConfig (matching the pattern already
used by other ops-agent tests in this file) so the guard lets chat.send
through and the first turn starts as expected.

---------

Co-authored-by: Omar Shahine <10343873+omarshahine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants