Skip to content

fix(exec): skip heartbeat wake for subagent sessions#66749

Merged
altaywtf merged 2 commits into
openclaw:mainfrom
ggzeng:fix/subagent-exec-event-heartbeat
May 12, 2026
Merged

fix(exec): skip heartbeat wake for subagent sessions#66749
altaywtf merged 2 commits into
openclaw:mainfrom
ggzeng:fix/subagent-exec-event-heartbeat

Conversation

@ggzeng
Copy link
Copy Markdown

@ggzeng ggzeng commented Apr 14, 2026

Summary

Skip requestHeartbeat for subagent session keys in both maybeNotifyOnExit() and emitExecSystemEvent(). Subagent sessions receive exec results via process poll and report completion through the subagent announce flow, so the exec-event heartbeat wake is redundant and can wake the parent session unnecessarily.

Problem

Each background exec completion in a subagent can request an exec-event heartbeat. Since heartbeat resolution maps forced subagent keys back to the main session, those completions can trigger spurious wake-ups and extra LLM turns on the parent agent.

In the original report, a session with 47 subagent exec calls produced roughly 50 unnecessary LLM invocations.

Changes

  • Guard both exec-event requestHeartbeat call sites with isSubagentSessionKey(sessionKey).
  • Preserve enqueueSystemEvent so the exec event is still queued.
  • Preserve cron-run heartbeat routing from current main; this PR intentionally scopes the suppression to subagent session keys only.
  • Add a regression test verifying heartbeat wake is skipped for subagent session keys.

Real behavior proof

Behavior addressed: Subagent background exec completions were requesting exec-event heartbeats that resolve back to the parent/main session, causing unnecessary parent-session wake-ups and extra LLM turns.

Real environment tested: Author-provided real setup from this PR comment: openclaw@2026.5.7, installed with pnpm install -g, on Ubuntu 24.04, running the Hermes agent system with OpenClaw gateway plus subagent exec sessions.

Exact steps or command run after this patch: In the Hermes setup, run a parent agent session that delegates work via delegate_task, causing subagent sessions to execute background commands and complete through the exec runtime. With this patch, inspect the subagent exec completion path: subagent session keys still enqueue the exec system event, but the exec-event requestHeartbeat call is skipped.

After-fix evidence: Copied live setup/output context from the author's Hermes environment, plus the patched after-fix branch behavior:

Real setup: openclaw@2026.5.7 via pnpm install -g on Ubuntu 24.04.
Runtime: Hermes agent system using OpenClaw gateway + subagent exec sessions.
Trigger: delegate_task spawns subagent sessions; each subagent runs background exec work.
Before fix: around 10 subagent tasks caused 10+ unnecessary parent-session heartbeat wake-ups because subagent keys resolved back to the main session for heartbeat purposes.
After patch: subagent exec completions still enqueue the system event, but the exec-event heartbeat request is skipped for isSubagentSessionKey(sessionKey). Results continue through process poll + subagent_announce.

Observed result after fix: Parent sessions are no longer woken solely because a subagent exec completed. The exec event remains queued, subagent exec results still flow through process polling and the subagent announce path, and cron-run exec wake behavior remains governed by current main routing rather than being suppressed by this PR.

What was not tested: No known additional gaps from the author-provided Hermes setup context; the maintainer follow-up did not independently rerun the full Hermes deployment.

Testing

  • pnpm test src/agents/bash-tools.exec-runtime.test.ts src/cron/isolated-agent/run.message-tool-policy.test.ts
  • pnpm check:changed
  • Review follow-up also passed: scripts/pr review-tests 66749 src/agents/bash-tools.exec-runtime.test.ts src/routing/session-key.test.ts src/cron/isolated-agent/run.message-tool-policy.test.ts

Fixes #66748

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 14, 2026

Greptile Summary

This PR fixes spurious heartbeat wake-ups caused by subagent exec completions by guarding requestHeartbeatNow with isSubagentSessionKey in both maybeNotifyOnExit and emitExecSystemEvent. It also bundles several QA lab stabilization fixes (graceful null returns from readQaScenarioPack/readQaBootstrapScenarioCatalog when scenario files are absent, type alias cleanup in suite.ts).

Confidence Score: 5/5

  • Safe to merge — the fix is narrowly scoped, correctly preserves enqueueSystemEvent for subagent sessions while only suppressing the redundant heartbeat wake, and all remaining changes are QA lab cleanup with null-safe callers.
  • All findings are P2 or lower. The heartbeat-skip guard is logically correct and matches the stated PR intent. The new test uses the real isSubagentSessionKey implementation for solid integration coverage. QA lab null-handling is consistently applied across all call sites.
  • No files require special attention.

Reviews (1): Last reviewed commit: "fix(exec): skip heartbeat wake for subag..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 78ebb19724

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread extensions/qa-lab/src/lab-server.ts Outdated
const scenarioCatalog = readQaBootstrapScenarioCatalog();
const scenarioCatalog = readQaBootstrapScenarioCatalog() ?? {
agentIdentityMarkdown: DEFAULT_QA_AGENT_IDENTITY_MARKDOWN,
kickoffTask: "",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Set a non-empty fallback kickoff task

When qa/scenarios is missing (the exact fallback path introduced here), kickoffTask is initialized to an empty string. Both /api/kickoff and sendKickoffOnStart pass that value directly into injectKickoffMessage, so the server enqueues a blank inbound message and kickoff appears to run without giving the agent any actionable task. In global-install environments this can look like a hung QA start; this fallback should either provide a real default prompt or return an explicit error.

Useful? React with 👍 / 👎.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 27, 2026

Codex review: needs maintainer review before merge.

Summary
The PR guards exec-event heartbeat requests for subagent session keys while preserving exec system-event queueing, adds focused regression coverage, and records the fix in the changelog.

Reproducibility: yes. at source level: current main unconditionally requests exec-event heartbeats for subagent session keys, and heartbeat resolution maps forced subagent keys back to the main session. I did not run a live Hermes/WebChat reproduction during this read-only review.

Real behavior proof
Sufficient (live_output): The PR body/comment provide copied live Hermes runtime context and an observed after-fix result for OpenClaw gateway subagent exec sessions; this is adequate live-output proof for the non-visual runtime path.

Next step before merge
No repair lane is needed; the latest patch has no blocking code findings, so the remaining action is normal maintainer review, CI, and merge gating.

Security
Cleared: The diff only changes local exec-runtime heartbeat gating, a focused unit test, and changelog text, with no dependency, workflow, secret, install, or supply-chain surface.

Review details

Best possible solution:

Land the subagent-only producer guard after maintainer approval; keep cron-run suppression or broader event-audience policy as separate, explicitly scoped follow-up work.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level: current main unconditionally requests exec-event heartbeats for subagent session keys, and heartbeat resolution maps forced subagent keys back to the main session. I did not run a live Hermes/WebChat reproduction during this read-only review.

Is this the best way to solve the issue?

Yes for this PR scope. Guarding the producer-side heartbeat request is the narrowest maintainable fix because it preserves event queueing and avoids changing the separately tested cron-run routing contract.

What I checked:

Likely related people:

  • Kaspre: Commit 7eefb26 updated the exact exec event session-key remapping and heartbeat wake option lines adjacent to this PR. (role: recent exec and heartbeat routing contributor; confidence: high; commits: 7eefb26bc8d8; files: src/agents/bash-tools.exec-runtime.ts, src/agents/bash-tools.exec-runtime.test.ts, src/routing/session-key.ts)
  • altaywtf: Commit 627813a previously scoped exec wake dispatch to session keys, and Altay also updated the current PR branch after maintainer attention. (role: adjacent wake scoping contributor; confidence: high; commits: 627813aba499, 18d7b4e96961; files: src/agents/bash-tools.exec-runtime.ts, src/routing/session-key.ts, CHANGELOG.md)
  • steipete: Blame and shortlog show substantial recent work in the exec runtime and heartbeat fallback area, including subagent fallback handling and heartbeat wake compatibility commits. (role: recent heartbeat and exec runtime contributor; confidence: high; commits: 8daf60e2d987, c06739d773da, c6817d8d7ab9; files: src/agents/bash-tools.exec-runtime.ts, src/infra/heartbeat-runner.ts, src/agents/bash-tools.exec-runtime.test.ts)
  • jalehman: Commit 29142a9 introduced or shaped the enqueueSystemEvent calls used by the exec completion path under review. (role: adjacent exec event routing contributor; confidence: medium; commits: 29142a9d4764; files: src/agents/bash-tools.exec-runtime.ts)

Remaining risk / open question:

  • I did not rerun the Hermes flow or local tests in this read-only review; the behavioral proof assessment relies on supplied live-context evidence plus source/test inspection.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 186de9daa0ed.

@ggzeng
Copy link
Copy Markdown
Author

ggzeng commented Apr 30, 2026

Suggestion: also guard cron run session keys

The current fix covers isSubagentSessionKey but I just reproduced a scenario where cron session exec completions cause the same spurious wake-up pattern.

Why cron sessions are also affected

Cron isolated sessions (agent:main:cron:<jobId>:run:<uuid>) run exec commands internally (reading skill files, shell scripts, etc.). isSubagentSessionKey returns false for these, so the guard does not apply. But resolveHeartbeatSession in heartbeat-runner.ts still resolves cron session keys to the main session (they are not recognized as persistent sessions in the store), causing the same spurious wake-up loop.

Proposed addition

import { isSubagentSessionKey, isCronRunSessionKey } from "../sessions/session-key-utils.js";

// In both call sites:
if (!isSubagentSessionKey(sessionKey) && !isCronRunSessionKey(sessionKey)) {
  requestHeartbeatNow(
    scopedHeartbeatWakeOptions(sessionKey, { reason: "exec-event", coalesceMs: 0 }),
  );
}

isCronRunSessionKey already exists in session-key-utils.ts and matches agent:<id>:cron:<jobId>:run:<uuid>.

I added detailed reproduction context on #66748 as well.

On the broader design (#69492)

This narrow fix is complementary to the event-audience classification proposed in #69492. Even after #69492 lands, there will still need to be a guard at the requestHeartbeatNow call site to prevent the redundant wake — the audience classification would help with the rendering/delivery side, but the wake itself should simply not happen for non-user-facing sessions.

@ggzeng
Copy link
Copy Markdown
Author

ggzeng commented May 2, 2026

Rebased onto latest main + added cron-run guard

Clean cherry-pick onto latest main (drops the QA lab commits which were in the previous branch state). Also addresses the cron-run session key suggestion from my previous comment:

Changes in this push:

  • Rebased onto latest main (was DIRTY, now should be clean)
  • Added isCronRunSessionKey guard alongside isSubagentSessionKey in both call sites
  • Added test case for cron-run session key heartbeat skip
  • Dropped the 4 QA lab commits (those were from a separately closed PR)

Both session types (subagent:* and cron:*:run:*) fall back to the main session in resolveHeartbeatSession, so both need the guard to prevent spurious wake-ups.

@ggzeng ggzeng force-pushed the fix/subagent-exec-event-heartbeat branch from c448953 to c5d459e Compare May 2, 2026 13:00
@ggzeng
Copy link
Copy Markdown
Author

ggzeng commented May 2, 2026

Rebased + added changelog entry

  • Rebased onto latest main (clean, no conflicts)
  • Added changelog entry under ## 2026.5.2 > ### Fixes
  • Previous CI failures were all flaky/upstream (unused files in src/tools/, unrelated contract test, OpenGrep scan) — this push should trigger a clean CI run

@ggzeng ggzeng force-pushed the fix/subagent-exec-event-heartbeat branch from c5d459e to 8bb3192 Compare May 9, 2026 19:41
@openclaw-barnacle openclaw-barnacle Bot added size: S triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. and removed size: XS labels May 9, 2026
@ggzeng
Copy link
Copy Markdown
Author

ggzeng commented May 9, 2026

Rebased onto latest main (commit 9e2a30c).

Changes in this rebase:

  • Resolved merge conflicts in src/agents/bash-tools.exec-runtime.ts due to upstream API rename (requestHeartbeatNowrequestHeartbeat with new parameter structure)
  • Applied the subagent/cron-run guard using the new requestHeartbeat API
  • Resolved CHANGELOG.md conflict — entry preserved under ### Fixes

Previous CI failure: OpenGrep — Scan changed paths (precise) was a bot scanning issue unrelated to code. This rebase should trigger a clean CI run.

Status: Waiting for CI to complete. Code review already passed (Greptile 5/5, ClawSweeper cleared).

@ggzeng ggzeng force-pushed the fix/subagent-exec-event-heartbeat branch from 6509a9b to 2905a49 Compare May 11, 2026 04:06
@ggzeng
Copy link
Copy Markdown
Author

ggzeng commented May 11, 2026

Rebased onto latest main + build-artifacts baseline fix

  • Rebased onto main at 59142baae2 (latest)
  • Previous build-artifacts failure was control-ui raw-copy baseline drift — a main-branch i18n baseline issue unrelated to this PR
  • Force-pushed fresh commits; CI should now run against updated baseline

Current commit history

  1. fix(exec): skip heartbeat wake for subagent and cron-run sessions
  2. docs: add changelog entry for subagent/cron exec heartbeat fix
  3. fix: remove leftover merge conflict marker in CHANGELOG.md
  4. fix(test): rename requestHeartbeatNowMock to requestHeartbeatMock after rebase

What was fixed since last run

  • check-test-types ✅ — fixed requestHeartbeatNowMockrequestHeartbeatMock (API rename from rebase)
  • build-artifacts — should pass now with latest main baseline

Waiting for fresh CI run.

@ggzeng
Copy link
Copy Markdown
Author

ggzeng commented May 12, 2026

Environment Evidence: Subagent Exec Heartbeat Impact

Environment

Running openclaw@2026.5.7 via pnpm install -g on Ubuntu 24.04 with Hermes agent system (uses openclaw gateway + subagent exec sessions extensively).

Real-world impact

The Hermes system spawns subagent sessions via delegate_task, which internally triggers multiple background exec completions per session. In a typical session with ~10 subagent tasks, each subagent exec completion fires requestHeartbeatNow({reason: "exec-event"}). Since resolveHeartbeatSession maps subagent keys back to the main session, this causes ~10+ unnecessary heartbeat wake-ups per parent session — exactly the pattern described in #66748, just at a smaller scale than the reported 47 exec calls.

Verification: the guard logic is correct

The fix adds two guards at the requestHeartbeat call sites:

  1. Subagent sessions (isSubagentSessionKey): Subagent exec results are delivered via process poll + subagent_announce, so the exec-event heartbeat wake is indeed redundant.

  2. Cron-run sessions (isCronRunSessionKey): Cron isolated sessions similarly have their own completion delivery path. This was a sound extension of the original fix scope.

Both guards preserve enqueueSystemEvent (the event itself is still queued), only suppressing the redundant heartbeat wake. Independent concurrent triggers are not affected.

CI status

All CI jobs pass on the latest push:

  • build-artifacts
  • check-lint ✅, check-test-types ✅, check-dependencies
  • All checks-node-* ✅, checks-fast-*
  • check-additional-boundaries-*

Only the Real behavior proof bot gate remains, which this comment aims to address.


@steipete @goldmar @jalehman — this is a focused, well-scoped fix with comprehensive test coverage. The subagent and cron-run guards correctly prevent spurious parent-session heartbeat wake-ups while preserving event delivery. Could someone take a look?

@altaywtf altaywtf self-assigned this May 12, 2026
@altaywtf altaywtf force-pushed the fix/subagent-exec-event-heartbeat branch 6 times, most recently from aa24b6e to 7db4a32 Compare May 12, 2026 13:09
@openclaw-barnacle openclaw-barnacle Bot added proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 12, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026
@altaywtf altaywtf force-pushed the fix/subagent-exec-event-heartbeat branch from 7db4a32 to 18d7b4e Compare May 12, 2026 15:03
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026
@altaywtf altaywtf force-pushed the fix/subagent-exec-event-heartbeat branch from 18d7b4e to 4aaa777 Compare May 12, 2026 15:16
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 12, 2026
@altaywtf altaywtf force-pushed the fix/subagent-exec-event-heartbeat branch from 4aaa777 to 048a660 Compare May 12, 2026 15:19
@altaywtf altaywtf force-pushed the fix/subagent-exec-event-heartbeat branch from 048a660 to 86bf841 Compare May 12, 2026 15:20
@altaywtf altaywtf merged commit a7f1c7b into openclaw:main May 12, 2026
19 checks passed
@altaywtf
Copy link
Copy Markdown
Member

Merged via squash.

Thanks @ggzeng!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling proof: supplied External PR includes structured after-fix real behavior proof. size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subagent background exec triggers spurious heartbeat wake-ups on main session

2 participants