fix(subagent): include role, session key, and timing in error payloads by BKF-Gitty · Pull Request #68726 · openclaw/openclaw

BKF-Gitty · 2026-04-18T23:45:47Z

Error payloads from sessions_spawn and subagent wait outcomes now carry the context a parent needs to retry or report clearly:

sessions-spawn-tool: add role (requested agentId) to early validation errors, to the ACP register-failure payload, and to forwarded error results from both the ACP and subagent spawn paths. childSessionKey and runId are already populated by the inner spawn for the errors that know them; this just plumbs role through alongside.
subagent-announce-output: extend SubagentRunOutcome with optional startedAt/endedAt/elapsedMs and populate them in applySubagentWaitOutcome so timeout and error outcomes convey how long the child ran before failing.

Scoped verification: tsgo:core, tsgo:core:test, and 43 targeted tests in src/agents (sessions-spawn-tool, subagent-registry lifecycle retry grace, subagent-announce timeout, subagent-announce, and capture-completion-reply) all green. Repo-wide pnpm check is red on latest origin/main for unrelated extensions/discord and extensions/qa-lab surfaces (missing @buape/carbon and @copilotkit/aimock members); not addressed here.

Summary

Problem: When a subagent fails or times out, the payload returned to the parent LLM was {status: "error", error: "<message>"} with no subagent identifier and no duration. The parent cannot re-target the specific child session, correlate failures across concurrent subagents, or distinguish "timed out after real work" from "failed instantly."
Why it matters: Parent agents cannot retry intelligently or report failures clearly without that context. It also hides dead subagents behind generic error strings during post-hoc log review.
What changed: Additive context on error payloads from sessions_spawn plus timing fields on SubagentRunOutcome so timeout/error outcomes surface how long the child ran. No control-flow changes, no type removals.
What did NOT change (scope boundary): No changes to subagent-spawn.ts, run-wait.ts, the gateway protocol, or any liveness / heartbeat mechanism. The known gap where a silently-dead subagent isn't detected until the 120s timeout fires is unchanged.

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #
Related #
This PR fixes a bug or regression

Root Cause (if applicable)

Root cause: The tool-level error paths in sessions-spawn-tool.ts emitted {status, error} without forwarding the spawn-time context that was already in scope (requestedAgentId). Separately, applySubagentWaitOutcome computed startedAt/endedAt on the wrapper object but never embedded them on the outcome object that downstream consumers read.
Missing detection / guardrail: No test asserted that error payloads carry enough identity to distinguish concurrent subagent failures; the outcome type itself had no slot for timing, so the gap was invisible at the type level.
Contributing context (if known): The inner spawn (subagent-spawn.ts) already attaches childSessionKey/runId on most error branches, which masked the missing role plumbing at the tool layer.

Regression Test Plan (if applicable)

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file: src/agents/tools/sessions-spawn-tool.test.ts and (follow-up) a focused test for applySubagentWaitOutcome in src/agents/subagent-announce-output.
Scenario the test should lock in: on an error/timeout wait outcome, role is present when agentId was requested, and startedAt/endedAt/elapsedMs are present when timestamps are known.
Why this is the smallest reliable guardrail: both changes are pure payload-shape additions; a unit-level assertion on the returned object catches regressions without a full spawn runtime.
Existing test that already covers this (if any): sessions-spawn-tool.test.ts and subagent-registry.lifecycle-retry-grace.e2e.test.ts exercise the success and retry paths; no existing test asserts the error payload shape added here.
If no new test is added, why not: this PR is scoped to plumbing; happy to add a targeted unit test if reviewers want it before merge.

User-visible / Behavior Changes

Additive only. Error payloads emitted by the sessions_spawn tool may now include a role field (attached to error and forbidden results when an agentId was requested), and SubagentRunOutcome values may now include startedAt, endedAt, and elapsedMs. Timing fields are populated on all three outcome branches (ok, timeout, error) when the timestamps are known — not only failures — since duration is also useful for success-case stats. No existing field is removed or renamed; no status codes change.

Diagram (if applicable)

Before (error payload to parent LLM):
{ status: "error", error: "<message>" }            # ACP early validation
{ status: "error", error, childSessionKey, runId } # inner spawn failure

After:
{ status: "error", error, role? }                                    # early validation
{ status: "error", error, childSessionKey, runId, role? }            # register/forwarded

Before (SubagentRunOutcome on timeout):
{ status: "timeout" }

After:
{ status: "timeout", startedAt?, endedAt?, elapsedMs? }

Security Impact (required)

New permissions/capabilities? No
Secrets/tokens handling changed? No
New/changed network calls? No
Command/tool execution surface changed? No (payload shape only)
Data access scope changed? No
If any Yes, explain risk + mitigation: N/A. The added role field is the same agentId the caller passed in as input; no new data crosses a trust boundary.

Repro + Verification

Environment

OS: macOS 25.4.0 (Darwin)
Runtime/container: Node 22+ local dev
Model/provider: N/A (type/structure change, model-agnostic)
Integration/channel (if any): N/A
Relevant config (redacted): N/A

Steps

Check out fix/subagent-error-context.
Run pnpm tsgo:core and pnpm tsgo:core:test.
Run pnpm test src/agents/tools/sessions-spawn-tool.test.ts src/agents/subagent-registry.lifecycle-retry-grace.e2e.test.ts src/agents/subagent-announce.timeout.test.ts src/agents/subagent-announce.test.ts src/agents/subagent-announce.capture-completion-reply.test.ts.

Expected

All scoped typechecks and scoped tests green.

Actual

All green: 19 tests across sessions-spawn-tool + lifecycle retry grace, plus 24 tests across the announce family (43 total).

Evidence

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Relevant scoped-test output (trimmed):

Test Files  1 passed (1)
Tests  12 passed (12)          # sessions-spawn-tool
Test Files  1 passed (1)
Tests  7 passed (7)            # subagent-registry lifecycle retry grace
Test Files  3 passed (3)
Tests  24 passed (24)          # subagent-announce family

Repo-wide pnpm check was red on latest origin/main before this branch existed, entirely in extensions/discord/** and extensions/qa-lab/** (missing @buape/carbon exports and @copilotkit/aimock module). Not in scope here.

Human Verification (required)

Verified scenarios: scoped typecheck (tsgo:core, tsgo:core:test) clean on the two touched files; targeted unit + e2e tests listed above all green against the edits.
Edge cases checked: requestedAgentId undefined → roleContext spread is a no-op; startedAt/endedAt missing → timing object is empty and not spread onto outcome; result.status !== "error" in the forwarded jsonResult path → original result is returned unmodified.
What you did NOT verify: live end-to-end run of a real subagent failure against the gateway; CI on the unrelated extensions/discord + extensions/qa-lab failures that are already red on main.

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? No
Migration needed? No
If yes, exact upgrade steps: N/A. All added fields are optional.

Risks and Mitigations

Risk: Consumers that serialize SubagentRunOutcome and strictly reject unknown keys could fail on the new timing fields.
- Mitigation: Fields are optional and only emitted when timestamps are known; no existing persistence/serialization path examined rejects unknown keys. Happy to gate behind a feature flag if reviewers prefer.

🤖 Generated with Claude Code

greptile-apps · 2026-04-18T23:56:53Z

Greptile Summary

This PR enriches error payloads from sessions_spawn with the requested agentId as a role field, and extends SubagentRunOutcome with optional startedAt/endedAt/elapsedMs timing fields populated by applySubagentWaitOutcome. All changes are strictly additive — no fields removed, no control-flow altered — and the implementation is clean and well-guarded (empty-object spread when requestedAgentId is absent, Math.max(0, …) on elapsed time).

Confidence Score: 5/5

Safe to merge; changes are purely additive with no control-flow modifications.

Both findings are P2 style observations. Neither represents a defect or behavioral regression. The implementation is correct and well-guarded.

No files require special attention.

Prompt To Fix All With AI

This is a comment left during a code review.
Path: src/agents/subagent-announce-output.ts
Line: 324-325

Comment:
**Timing also applied to "ok" outcomes (undocumented scope)**

The PR description and root-cause section say timing fields are added to "timeout and error outcomes," but the implementation spreads `timing` into all three branches, including `status: "ok"`. This is strictly additive and arguably more useful, but it's worth noting the actual scope is broader than the description to avoid surprises for callers that pattern-match on status to decide whether to inspect timing fields.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/agents/tools/sessions-spawn-tool.ts
Line: 316-320

Comment:
**`role` added to any non-`"accepted"` status, not just errors**

`result.status !== "accepted"` also covers any other non-error terminal statuses (`"queued"`, `"pending"`, etc.) that `spawnAcpDirect` might return now or in the future. If the intent is specifically to enrich failure payloads, a tighter condition (`result.status === "error"`) would be more explicit and resilient to new status codes being introduced upstream. The same pattern appears on the subagent path at line 358.

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (1): Last reviewed commit: "Merge branch 'main' into fix/subagent-er..." | Re-trigger Greptile}

BKF-Gitty · 2026-04-19T00:02:07Z

Thanks @greptile — addressed both:

Re: timing on ok outcomes (subagent-announce-output.ts:324-325) — the scope widening is intentional and additive. Timing on ok is useful for success-case stats/duration reporting (and already consumed upstream via startedAt/endedAt for the parent-visible stats line). No behavior change for callers that only inspect status; strictly more data otherwise. Updating the PR body to call this out rather than narrowing the code.

Re: status !== \"accepted\" (sessions-spawn-tool.ts:316-320 and :358) — good point. Switched both forward sites to an explicit enumeration of error | forbidden in 3b471a0 so new upstream statuses must opt in deliberately rather than being silently enrolled. This also still covers the forbidden paths Codex flagged earlier (ACP disabled, runtime policy, allowlist denial, depth/children limits, requireAgentId, sandbox policy).

greptile-apps · 2026-04-19T00:02:41Z

Both look good.

The explicit status === "error" || status === "forbidden" guard is confirmed in the updated file at both the ACP and subagent forward sites — new upstream statuses now have to opt in deliberately. That's the right call.

On timing for ok outcomes: agreed, additive-only and useful for success-path duration stats. No objection to leaving it as-is with the PR body updated for clarity.

Changes LGTM.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules and .greptile/config.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 92c4c5dc0c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Error payloads from sessions_spawn and subagent wait outcomes now carry the context a parent needs to retry or report clearly: - sessions-spawn-tool: add role (requested agentId) to early validation errors, to the ACP register-failure payload, and to forwarded error results from both the ACP and subagent spawn paths. childSessionKey and runId are already populated by the inner spawn for the errors that know them; this just plumbs role through alongside. - subagent-announce-output: extend SubagentRunOutcome with optional startedAt/endedAt/elapsedMs and populate them in applySubagentWaitOutcome so timeout and error outcomes convey how long the child ran before failing. Scoped verification: tsgo:core, tsgo:core:test, and 43 targeted tests in src/agents (sessions-spawn-tool, subagent-registry lifecycle retry grace, subagent-announce timeout, subagent-announce, and capture-completion-reply) all green. Repo-wide pnpm check is red on latest origin/main for unrelated extensions/discord and extensions/qa-lab surfaces (missing @buape/carbon and @copilotkit/aimock members); not addressed here.

Codex review on PR 68726 noted the role context was only merged when status === "error", but both ACP and subagent spawns return status: "forbidden" for real failures the parent most needs to identify in parallel spawns: ACP disabled, runtime policy, agent allowlist denial, depth limit, max children, requireAgentId, and sandbox policy. Switch both forwarded result sites to "status !== accepted" so forbidden also carries the requested role.

…ing accepted Greptile review on PR 68726 flagged that result.status !== "accepted" would silently enroll any future non-accepted status (e.g. queued/pending) into the role-augmentation path. Switch both the ACP and subagent forward sites to an explicit enumeration of (error | forbidden) so new upstream statuses must opt in deliberately. Behavior unchanged for today's closed union.

gumadeiras · 2026-04-19T22:31:35Z

Merged via squash.

Prepared head SHA: 55c756142f29ec707f80e186d2f6b98f9b8a4665
Merge commit: 4277078bc515914c48190721897a8fba92ed37c7

Thanks @BKF-Gitty!

Aligns the steer-restart killed-run test with the timing fields added to subagent run outcomes in #68726. The production code now returns startedAt/endedAt/elapsedMs alongside status and error on the error outcome, but this test's toEqual still asserted only status+error, so it has been failing on main since #68726 landed. Uses the same expect.any(Number) matcher already in use a few lines below for the ended hook payload.

…and #65986 (#69173) * test(agents): expect timing fields in killed-run outcome Aligns the steer-restart killed-run test with the timing fields added to subagent run outcomes in #68726. The production code now returns startedAt/endedAt/elapsedMs alongside status and error on the error outcome, but this test's toEqual still asserted only status+error, so it has been failing on main since #68726 landed. Uses the same expect.any(Number) matcher already in use a few lines below for the ended hook payload. * test(gateway): register ops agent in sessions.create task-start test The "sessions.create can start the first agent turn from an initial task" test triggers the auto chat.send path by passing `task:`. After #65986 added a deleted-agent guard to chat.send, an unregistered `ops` agent triggers the reject path and the auto-started run never happens, so runStarted comes back false. Register `ops` via testState.agentsConfig (matching the pattern already used by other ops-agent tests in this file) so the guard lets chat.send through and the first turn starts as expected. --------- Co-authored-by: Omar Shahine <10343873+omarshahine@users.noreply.github.com>

openclaw-barnacle Bot added agents Agent runtime and tooling size: XS labels Apr 18, 2026

BKF-Gitty marked this pull request as ready for review April 18, 2026 23:54

greptile-apps Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread src/agents/subagent-announce-output.ts Outdated

Comment thread src/agents/tools/sessions-spawn-tool.ts Outdated

gumadeiras self-assigned this Apr 19, 2026

openclaw-barnacle Bot added size: M and removed size: XS labels Apr 19, 2026

chatgpt-codex-connector Bot reviewed Apr 19, 2026

View reviewed changes

Comment thread src/agents/subagent-registry-completion.ts

BKF-Gitty and others added 8 commits April 19, 2026 18:23

fix: preserve subagent failure timing context

98e00b1

refactor: simplify subagent failure context helpers

bd20428

fix: stabilize subagent outcome timing equality

40c71bd

refactor: name subagent outcome update decision

3abc92d

fix: enrich direct subagent terminal outcomes

55c7561

gumadeiras force-pushed the fix/subagent-error-context branch from 3dcfc05 to 55c7561 Compare April 19, 2026 22:31

gumadeiras merged commit 4277078 into openclaw:main Apr 19, 2026
10 checks passed

omarshahine mentioned this pull request Apr 20, 2026

bluebubbles: fall back unsupported reactions to like #64693

Merged

3 tasks

omarshahine mentioned this pull request Apr 20, 2026

test(agents,gateway): fix two main-baseline test breakages from #68726 and #65986 #69173

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(subagent): include role, session key, and timing in error payloads#68726

fix(subagent): include role, session key, and timing in error payloads#68726
gumadeiras merged 8 commits intoopenclaw:mainfrom
BKF-Gitty:fix/subagent-error-context

BKF-Gitty commented Apr 18, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 18, 2026

Uh oh!

Uh oh!

Uh oh!

BKF-Gitty commented Apr 19, 2026

Uh oh!

greptile-apps Bot commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

gumadeiras commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

BKF-Gitty commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Uh oh!

greptile-apps Bot commented Apr 18, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

Uh oh!

Uh oh!

BKF-Gitty commented Apr 19, 2026

Uh oh!

greptile-apps Bot commented Apr 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

gumadeiras commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BKF-Gitty commented Apr 18, 2026 •

edited

Loading