Skip to content

fix: prune replay control messages#82242

Merged
steipete merged 1 commit into
mainfrom
fix/replay-prune-control-messages-76629
May 15, 2026
Merged

fix: prune replay control messages#82242
steipete merged 1 commit into
mainfrom
fix/replay-prune-control-messages-76629

Conversation

@steipete
Copy link
Copy Markdown
Contributor

@steipete steipete commented May 15, 2026

Summary

  • Problem: restart and heartbeat replay paths could feed persisted internal runtime metadata or silent control payloads back into provider context.
  • Why it matters: replaying NO_REPLY, copied runtime context, or inbound metadata can change the next model turn and leak control text into user-visible recovery.
  • What changed: sanitize pending final-delivery text at capture and replay boundaries, strip copied runtime/inbound metadata from assistant replay, and drop standalone silent assistant replay turns.
  • What did NOT change (scope boundary): heartbeat ack-size classification still owns HEARTBEAT_OK decisions; provider/tool replay repair behavior outside display/control text pruning is unchanged.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Real behavior proof

Behavior addressed: restart/heartbeat/provider replay no longer preserves internal runtime context, inbound metadata, or standalone silent control replies in model-visible replay text.
Real environment tested: local OpenClaw source checkout on macOS, Node/Vitest repo test lanes, plus a direct local OpenClaw source command invoking the landed sanitizer.
Exact steps or command run after this patch: pnpm exec tsx -e ...sanitizePendingFinalDeliveryText(...); supplemental regression suite: pnpm test src/auto-reply/reply/pending-final-delivery.test.ts src/agents/pi-embedded-runner/replay-history.test.ts src/agents/main-session-restart-recovery.test.ts src/auto-reply/reply/get-reply.fast-path.test.ts src/gateway/chat-sanitize.test.ts -- --reporter=verbose
Evidence after fix: terminal output from local OpenClaw source command:

$ pnpm exec tsx -e 'import { sanitizePendingFinalDeliveryText } from "./src/auto-reply/reply/pending-final-delivery.ts"; ...'
dirty -> "Visible reply"
mixed -> "The user is saying hello"
heartbeat -> "HEARTBEAT_OK short"

Supplemental regression output: 4 Vitest shards passed; 62 tests passed after rebase onto current origin/main.
Observed result after fix: direct terminal capture shows runtime-context text stripped to Visible reply, glued NO_REPLY stripped to user-visible text, and HEARTBEAT_OK short preserved for the ack-aware classifier; regression tests also passed for pending final-delivery recovery, provider replay, restart recovery, heartbeat replay, and gateway display sanitization.
What was not tested: live provider restart/heartbeat replay against an external channel; coverage is focused regression tests plus local sanitizer/replay seams.

Root Cause

  • Root cause: pending final-delivery and provider replay boundaries reused persisted text that normal display/delivery paths sanitize elsewhere.
  • Missing detection / guardrail: no regression coverage for stale pending-final-delivery text containing runtime context or silent control payloads.
  • Contributing context: gateway display sanitization, provider replay sanitization, and pending delivery recovery duplicated overlapping text cleanup responsibilities.

Regression Test Plan

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/auto-reply/reply/pending-final-delivery.test.ts, src/agents/pi-embedded-runner/replay-history.test.ts, src/auto-reply/reply/get-reply.fast-path.test.ts, src/agents/main-session-restart-recovery.test.ts, src/gateway/chat-sanitize.test.ts
  • Scenario the test should lock in: stale pending/replay text strips internal metadata and silent sentinels without breaking heartbeat ack classification.
  • Why this is the smallest reliable guardrail: the bug is at text-normalization and replay boundaries, so focused seam tests cover the branchy behavior without live provider flake.
  • Existing test that already covers this (if any): gateway inbound metadata stripping had partial coverage; pending delivery and provider replay control text did not.
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

Restart and heartbeat recovery no longer feed OpenClaw internal runtime metadata or NO_REPLY control text back into the next model turn.

Diagram

Before:
[persisted pending/replay text] -> [provider/recovery prompt] -> [internal metadata/control text visible to model]

After:
[persisted pending/replay text] -> [boundary sanitizer] -> [provider/recovery prompt with user-visible text only]

Security Impact

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: local Node 22 OpenClaw checkout
  • Model/provider: mocked/unit replay paths
  • Integration/channel (if any): gateway/auto-reply/replay seams
  • Relevant config (redacted): default test configs

Steps

  1. Seed pending final-delivery or assistant replay text with runtime-context delimiters, inbound metadata, and/or NO_REPLY.
  2. Run pending delivery, heartbeat replay, restart recovery, and provider replay sanitizers.
  3. Assert provider/recovery text contains only user-visible content and heartbeat ack classification remains ack-aware.

Expected

  • Internal metadata and standalone silent control text are removed before replay.
  • Heartbeat ack text is not stripped before ackMaxChars classification.

Actual

  • Matches expected after this patch.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification

  • Verified scenarios: focused replay, pending final-delivery, heartbeat pending replay, restart recovery, and gateway display sanitizer tests.
  • Edge cases checked: exact NO_REPLY, JSON NO_REPLY, mixed leading/trailing NO_REPLY, copied internal runtime context, inbound metadata, and HEARTBEAT_OK short ack preservation.
  • What you did not verify: live external channel restart replay.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: over-stripping heartbeat acknowledgements before replay.
    • Mitigation: preserve HEARTBEAT_OK in the generic sanitizer and keep ack-size stripping in the existing heartbeat classifier; regression test covers HEARTBEAT_OK short.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime agents Agent runtime and tooling size: M maintainer Maintainer-authored PR labels May 15, 2026
@steipete
Copy link
Copy Markdown
Contributor Author

Verification for landing:

  • Local focused tests: pnpm test src/auto-reply/reply/pending-final-delivery.test.ts src/agents/pi-embedded-runner/replay-history.test.ts src/agents/main-session-restart-recovery.test.ts src/auto-reply/reply/get-reply.fast-path.test.ts src/gateway/chat-sanitize.test.ts -- --reporter=verbose
  • Result: passed after rebase onto current origin/main; 4 Vitest shards, 62 tests.
  • Diff hygiene: git diff --check origin/main...HEAD && node scripts/check-changelog-attributions.mjs
  • Result: passed.
  • Review: /Users/steipete/Projects/agent-scripts/skills/codex-review/scripts/codex-review
  • Result: clean; no accepted/actionable findings.

Known proof gap: no live external-channel restart replay; covered with focused pending-delivery, heartbeat replay, restart-recovery, provider replay, and gateway sanitizer regression tests.

@steipete steipete merged commit daef8e7 into main May 15, 2026
94 of 101 checks passed
@steipete steipete deleted the fix/replay-prune-control-messages-76629 branch May 15, 2026 17:35
@steipete steipete added the proof: override Maintainer override for the external PR real behavior proof gate. label May 15, 2026
@steipete
Copy link
Copy Markdown
Contributor Author

Landed via rebase onto main.

  • Source commit: 99cc779
  • Landed commit: daef8e7
  • Gate: focused replay/pending-delivery/gateway tests passed locally; diff hygiene and changelog attribution passed; codex-review clean.
  • Real behavior proof gate: passed after PR body terminal-capture proof update.

Thanks @steipete!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 99cc7795b9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

import { stripInternalMetadataForDisplay } from "./display-text-sanitize.js";

export function sanitizePendingFinalDeliveryText(text: string): string {
let stripped = stripInternalMetadataForDisplay(text).trim();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid stripping user-visible timestamp prefixes in pending replay text

Calling stripInternalMetadataForDisplay on every pending-final-delivery string can silently alter legitimate assistant output, because that sanitizer removes leading timestamp-like prefixes (via the inbound-metadata timestamp regex) even when the text is not OpenClaw metadata. If a reply starts with content like [Mon 2026-05-15 09:30 UTC] ... and delivery is retried/replayed, the recovered message no longer matches what the model produced, which corrupts user-visible replay behavior.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling gateway Gateway runtime maintainer Maintainer-authored PR proof: override Maintainer override for the external PR real behavior proof gate. size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] replay should prune internal/no-op control messages before provider context

1 participant