fix(outbound): persist recovery attempt before replay delivery by KimGLee · Pull Request #40374 · openclaw/openclaw

KimGLee · 2026-03-09T00:02:56Z

Summary

fix replay-loop risk in outbound delivery recovery by recording a recovery attempt before trying to re-deliver queued entries
ensure failed recovery updates error text without double-incrementing retry count

Is #35959 real?

Yes. This is a real reliability bug.

Root cause

During startup recovery (recoverPendingDeliveries), queued entries were delivered first, and retry metadata was only updated on explicit catch/failure.

That means if the process/provider dies mid-delivery (e.g. stale-socket restart) after send side effects but before queue ack/fail update, the entry remains in queue with:

retryCount = 0
no lastAttemptAt

On next restart, the same entry is immediately eligible again, causing repeated replays and potential duplicate sends (loop behavior).

Fix details

In src/infra/outbound/delivery-queue.ts:

Added markDeliveryAttempt(id, stateDir):
- increments retryCount
- sets lastAttemptAt = Date.now()
In recoverPendingDeliveries, call markDeliveryAttempt(...) before deliver(...).
Updated failDelivery(...) to support opts.incrementRetry (default true).
- recovery path now calls failDelivery(..., { incrementRetry: false }) to avoid double increment because attempt was already recorded.

This guarantees that crash/restart after replay attempt leaves persisted retry/backoff state, preventing immediate infinite replay loops.

Local validation

Ran full outbound recovery tests:
- pnpm exec vitest run src/infra/outbound/outbound.test.ts
Result: 58/58 passed.

Fixes #35959.

greptile-apps · 2026-03-09T00:05:28Z

Greptile Summary

This PR fixes a real crash-recovery replay-loop bug in src/infra/outbound/delivery-queue.ts by introducing markDeliveryAttempt, which pre-persists incremented retry metadata to disk before the delivery attempt is made. This ensures that a process crash mid-delivery leaves the queue entry with an incremented retryCount and a lastAttemptAt timestamp, so the entry is subject to normal backoff on the next restart instead of immediately replaying. The failDelivery function is updated to accept { incrementRetry: false } to avoid double-counting when the pre-persist already recorded the attempt. The >= to > boundary change for MAX_RETRIES is correct given the new pre-increment semantics. CI gains an artifact-hygiene gate.

Logic bug in failure path: When markDeliveryAttempt silently fails (filesystem error) and the subsequent deliver call also throws, failDelivery is called with { incrementRetry: false }. Because the pre-persist did not succeed, retryCount never advances on disk. If the filesystem issue persists across restarts, the entry can loop indefinitely without progressing toward the MAX_RETRIES eviction boundary. The fix is to track whether markDeliveryAttempt succeeded and pass { incrementRetry: !attemptMarked } to failDelivery.
CI docs_only guard removal: The changed-scope job and the macos job both lose their needs.docs-scope.outputs.docs_only != 'true' condition. This is safe only if ci-changed-scope.mjs reliably outputs run_macos = false for documentation-only changes; worth a smoke-check if docs-only PRs start triggering macOS runners unexpectedly.
The new scripts/check-no-binary-artifacts.mjs script and the artifact-hygiene CI job are clean and straightforward.

Confidence Score: 3/5

The core approach is sound but contains a logic gap: if markDeliveryAttempt silently fails and delivery also fails, failDelivery is invoked with incrementRetry: false, leaving retryCount permanently stalled and potentially causing an indefinite retry loop.
Score of 3 reflects that the PR's primary intent (prevent crash-replay loops) is correct and well-tested, but the introduced logic gap around markDeliveryAttempt failure + delivery failure creates a new reliability issue in an edge case that the existing tests do not cover.
src/infra/outbound/delivery-queue.ts — specifically the recoverPendingDeliveries function's inner try/catch block for markDeliveryAttempt and the downstream failDelivery call.

_{Last reviewed commit: 2690b0c}

greptile-apps · 2026-03-09T00:05:31Z

@@ -360,7 +386,7 @@ export async function recoverPendingDeliveries(opts: {
        continue;
      }
      try {
-        await failDelivery(entry.id, errMsg, opts.stateDir);
+        await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: false });
      } catch {
        // Best-effort update.
      }


incrementRetry: false applied even when markDeliveryAttempt failed

When markDeliveryAttempt throws and its error is silently swallowed, retryCount on disk was never incremented. If deliver then also throws, failDelivery is called with { incrementRetry: false }, leaving retryCount unchanged. On the next restart the entry re-enters the loop with the same count, backoff is computed against the same number, and — if the filesystem keeps failing — the entry can loop indefinitely without ever progressing toward the MAX_RETRIES guard.

To make the incrementRetry decision conditional on whether the pre-persist actually succeeded:

let attemptMarked = false; try { await markDeliveryAttempt(entry.id, opts.stateDir); attemptMarked = true; } catch { // Best-effort pre-attempt persistence. } await opts.deliver({ cfg: opts.cfg, channel: entry.channel, to: entry.to, accountId: entry.accountId, payloads: entry.payloads, threadId: entry.threadId, replyToId: entry.replyToId, bestEffort: entry.bestEffort, gifPlayback: entry.gifPlayback, silent: entry.silent, mirror: entry.mirror, skipQueue: true, // Prevent re-enqueueing during recovery }); await ackDelivery(entry.id, opts.stateDir); recovered += 1; opts.log.info(`Recovered delivery ${entry.id} to ${entry.channel}:${entry.to}`); } catch (err) { const errMsg = err instanceof Error ? err.message : String(err); if (isPermanentDeliveryError(errMsg)) { opts.log.warn(`Delivery ${entry.id} hit permanent error — moving to failed/: ${errMsg}`); try { await moveToFailed(entry.id, opts.stateDir); } catch (moveErr) { opts.log.error(`Failed to move entry ${entry.id} to failed/: ${String(moveErr)}`); } failed += 1; continue; } try { await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: !attemptMarked }); } catch { // Best-effort update. }

This ensures retryCount always advances when a delivery attempt fails, regardless of whether the pre-persist succeeded.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/infra/outbound/delivery-queue.ts Line: 354-392 Comment: **`incrementRetry: false` applied even when `markDeliveryAttempt` failed** When `markDeliveryAttempt` throws and its error is silently swallowed, `retryCount` on disk was **never** incremented. If `deliver` then also throws, `failDelivery` is called with `{ incrementRetry: false }`, leaving `retryCount` unchanged. On the next restart the entry re-enters the loop with the same count, backoff is computed against the same number, and — if the filesystem keeps failing — the entry can loop indefinitely without ever progressing toward the `MAX_RETRIES` guard. To make the `incrementRetry` decision conditional on whether the pre-persist actually succeeded: ``` let attemptMarked = false; try { await markDeliveryAttempt(entry.id, opts.stateDir); attemptMarked = true; } catch { // Best-effort pre-attempt persistence. } await opts.deliver({ cfg: opts.cfg, channel: entry.channel, to: entry.to, accountId: entry.accountId, payloads: entry.payloads, threadId: entry.threadId, replyToId: entry.replyToId, bestEffort: entry.bestEffort, gifPlayback: entry.gifPlayback, silent: entry.silent, mirror: entry.mirror, skipQueue: true, // Prevent re-enqueueing during recovery }); await ackDelivery(entry.id, opts.stateDir); recovered += 1; opts.log.info(`Recovered delivery ${entry.id} to ${entry.channel}:${entry.to}`); } catch (err) { const errMsg = err instanceof Error ? err.message : String(err); if (isPermanentDeliveryError(errMsg)) { opts.log.warn(`Delivery ${entry.id} hit permanent error — moving to failed/: ${errMsg}`); try { await moveToFailed(entry.id, opts.stateDir); } catch (moveErr) { opts.log.error(`Failed to move entry ${entry.id} to failed/: ${String(moveErr)}`); } failed += 1; continue; } try { await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: !attemptMarked }); } catch { // Best-effort update. } ``` This ensures `retryCount` always advances when a delivery attempt fails, regardless of whether the pre-persist succeeded. How can I resolve this? If you propose a fix, please make it concise.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2690b0cb1c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T00:07:36Z

      }
      try {
-        await failDelivery(entry.id, errMsg, opts.stateDir);
+        await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: false });


Increment retries when pre-attempt persistence fails

recoverPendingDeliveries swallows errors from markDeliveryAttempt, but the failure path always calls failDelivery(..., { incrementRetry: false }). If the pre-attempt write fails (for example a transient filesystem write/rename error) and deliver then throws, retryCount is not incremented at all, so the entry can be retried immediately on the next restart and reintroduce the replay-loop behavior this change is trying to prevent. The retry increment should only be disabled when the pre-attempt mark actually succeeded.

Useful? React with 👍 / 👎.

KimGLee added 7 commits March 5, 2026 18:42

ci(security): block binary/release artifacts in PR diffs

1d8d92c

ci(security): close artifact-hygiene bypasses and broaden dist matching

84055c5

ci(security): ignore deleted files in artifact hygiene diff

b68ca5e

fix(outbound): mark recovery attempts before replay to prevent loop

9db050d

fix(outbound): wire pre-delivery retry persistence in recovery

e4135d2

fix(outbound): avoid dropping retry at max threshold

127c54f

fix(ci): satisfy lint style rules in artifact hygiene script

2690b0c

openclaw-barnacle Bot added scripts Repository scripts size: S labels Mar 9, 2026

KimGLee closed this Mar 9, 2026

KimGLee deleted the fix/35959-recovery-replay-loop-r2 branch March 9, 2026 00:04

greptile-apps Bot reviewed Mar 9, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

This was referenced Mar 9, 2026

🦞 OpenClaw 生态日报 2026-03-09 duanyytop/agents-radar#107

Open

🦞 OpenClaw Ecosystem Digest 2026-03-09 duanyytop/agents-radar#108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(outbound): persist recovery attempt before replay delivery#40374

fix(outbound): persist recovery attempt before replay delivery#40374
KimGLee wants to merge 7 commits intoopenclaw:mainfrom
KimGLee:fix/35959-recovery-replay-loop-r2

KimGLee commented Mar 9, 2026

Uh oh!

greptile-apps Bot commented Mar 9, 2026

Uh oh!

greptile-apps Bot Mar 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

KimGLee commented Mar 9, 2026

Summary

Is #35959 real?

Root cause

Fix details

Local validation

Uh oh!

greptile-apps Bot commented Mar 9, 2026

Greptile Summary

Confidence Score: 3/5

Uh oh!

greptile-apps Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant