Skip to content

fix(outbound): persist recovery attempt before replay delivery#40374

Closed
KimGLee wants to merge 7 commits intoopenclaw:mainfrom
KimGLee:fix/35959-recovery-replay-loop-r2
Closed

fix(outbound): persist recovery attempt before replay delivery#40374
KimGLee wants to merge 7 commits intoopenclaw:mainfrom
KimGLee:fix/35959-recovery-replay-loop-r2

Conversation

@KimGLee
Copy link
Copy Markdown
Contributor

@KimGLee KimGLee commented Mar 9, 2026

Summary

  • fix replay-loop risk in outbound delivery recovery by recording a recovery attempt before trying to re-deliver queued entries
  • ensure failed recovery updates error text without double-incrementing retry count

Is #35959 real?

Yes. This is a real reliability bug.

Root cause

During startup recovery (recoverPendingDeliveries), queued entries were delivered first, and retry metadata was only updated on explicit catch/failure.

That means if the process/provider dies mid-delivery (e.g. stale-socket restart) after send side effects but before queue ack/fail update, the entry remains in queue with:

  • retryCount = 0
  • no lastAttemptAt

On next restart, the same entry is immediately eligible again, causing repeated replays and potential duplicate sends (loop behavior).

Fix details

In src/infra/outbound/delivery-queue.ts:

  1. Added markDeliveryAttempt(id, stateDir):
    • increments retryCount
    • sets lastAttemptAt = Date.now()
  2. In recoverPendingDeliveries, call markDeliveryAttempt(...) before deliver(...).
  3. Updated failDelivery(...) to support opts.incrementRetry (default true).
    • recovery path now calls failDelivery(..., { incrementRetry: false }) to avoid double increment because attempt was already recorded.

This guarantees that crash/restart after replay attempt leaves persisted retry/backoff state, preventing immediate infinite replay loops.

Local validation

  • Ran full outbound recovery tests:
    • pnpm exec vitest run src/infra/outbound/outbound.test.ts
  • Result: 58/58 passed.

Fixes #35959.

@openclaw-barnacle openclaw-barnacle Bot added scripts Repository scripts size: S labels Mar 9, 2026
@KimGLee KimGLee closed this Mar 9, 2026
@KimGLee KimGLee deleted the fix/35959-recovery-replay-loop-r2 branch March 9, 2026 00:04
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 9, 2026

Greptile Summary

This PR fixes a real crash-recovery replay-loop bug in src/infra/outbound/delivery-queue.ts by introducing markDeliveryAttempt, which pre-persists incremented retry metadata to disk before the delivery attempt is made. This ensures that a process crash mid-delivery leaves the queue entry with an incremented retryCount and a lastAttemptAt timestamp, so the entry is subject to normal backoff on the next restart instead of immediately replaying. The failDelivery function is updated to accept { incrementRetry: false } to avoid double-counting when the pre-persist already recorded the attempt. The >= to > boundary change for MAX_RETRIES is correct given the new pre-increment semantics. CI gains an artifact-hygiene gate.

  • Logic bug in failure path: When markDeliveryAttempt silently fails (filesystem error) and the subsequent deliver call also throws, failDelivery is called with { incrementRetry: false }. Because the pre-persist did not succeed, retryCount never advances on disk. If the filesystem issue persists across restarts, the entry can loop indefinitely without progressing toward the MAX_RETRIES eviction boundary. The fix is to track whether markDeliveryAttempt succeeded and pass { incrementRetry: !attemptMarked } to failDelivery.
  • CI docs_only guard removal: The changed-scope job and the macos job both lose their needs.docs-scope.outputs.docs_only != 'true' condition. This is safe only if ci-changed-scope.mjs reliably outputs run_macos = false for documentation-only changes; worth a smoke-check if docs-only PRs start triggering macOS runners unexpectedly.
  • The new scripts/check-no-binary-artifacts.mjs script and the artifact-hygiene CI job are clean and straightforward.

Confidence Score: 3/5

  • The core approach is sound but contains a logic gap: if markDeliveryAttempt silently fails and delivery also fails, failDelivery is invoked with incrementRetry: false, leaving retryCount permanently stalled and potentially causing an indefinite retry loop.
  • Score of 3 reflects that the PR's primary intent (prevent crash-replay loops) is correct and well-tested, but the introduced logic gap around markDeliveryAttempt failure + delivery failure creates a new reliability issue in an edge case that the existing tests do not cover.
  • src/infra/outbound/delivery-queue.ts — specifically the recoverPendingDeliveries function's inner try/catch block for markDeliveryAttempt and the downstream failDelivery call.

Last reviewed commit: 2690b0c

Comment on lines 354 to 392
@@ -360,7 +386,7 @@ export async function recoverPendingDeliveries(opts: {
continue;
}
try {
await failDelivery(entry.id, errMsg, opts.stateDir);
await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: false });
} catch {
// Best-effort update.
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incrementRetry: false applied even when markDeliveryAttempt failed

When markDeliveryAttempt throws and its error is silently swallowed, retryCount on disk was never incremented. If deliver then also throws, failDelivery is called with { incrementRetry: false }, leaving retryCount unchanged. On the next restart the entry re-enters the loop with the same count, backoff is computed against the same number, and — if the filesystem keeps failing — the entry can loop indefinitely without ever progressing toward the MAX_RETRIES guard.

To make the incrementRetry decision conditional on whether the pre-persist actually succeeded:

      let attemptMarked = false;
      try {
        await markDeliveryAttempt(entry.id, opts.stateDir);
        attemptMarked = true;
      } catch {
        // Best-effort pre-attempt persistence.
      }
      await opts.deliver({
        cfg: opts.cfg,
        channel: entry.channel,
        to: entry.to,
        accountId: entry.accountId,
        payloads: entry.payloads,
        threadId: entry.threadId,
        replyToId: entry.replyToId,
        bestEffort: entry.bestEffort,
        gifPlayback: entry.gifPlayback,
        silent: entry.silent,
        mirror: entry.mirror,
        skipQueue: true, // Prevent re-enqueueing during recovery
      });
      await ackDelivery(entry.id, opts.stateDir);
      recovered += 1;
      opts.log.info(`Recovered delivery ${entry.id} to ${entry.channel}:${entry.to}`);
    } catch (err) {
      const errMsg = err instanceof Error ? err.message : String(err);
      if (isPermanentDeliveryError(errMsg)) {
        opts.log.warn(`Delivery ${entry.id} hit permanent error — moving to failed/: ${errMsg}`);
        try {
          await moveToFailed(entry.id, opts.stateDir);
        } catch (moveErr) {
          opts.log.error(`Failed to move entry ${entry.id} to failed/: ${String(moveErr)}`);
        }
        failed += 1;
        continue;
      }
      try {
        await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: !attemptMarked });
      } catch {
        // Best-effort update.
      }

This ensures retryCount always advances when a delivery attempt fails, regardless of whether the pre-persist succeeded.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/outbound/delivery-queue.ts
Line: 354-392

Comment:
**`incrementRetry: false` applied even when `markDeliveryAttempt` failed**

When `markDeliveryAttempt` throws and its error is silently swallowed, `retryCount` on disk was **never** incremented. If `deliver` then also throws, `failDelivery` is called with `{ incrementRetry: false }`, leaving `retryCount` unchanged. On the next restart the entry re-enters the loop with the same count, backoff is computed against the same number, and — if the filesystem keeps failing — the entry can loop indefinitely without ever progressing toward the `MAX_RETRIES` guard.

To make the `incrementRetry` decision conditional on whether the pre-persist actually succeeded:

```
      let attemptMarked = false;
      try {
        await markDeliveryAttempt(entry.id, opts.stateDir);
        attemptMarked = true;
      } catch {
        // Best-effort pre-attempt persistence.
      }
      await opts.deliver({
        cfg: opts.cfg,
        channel: entry.channel,
        to: entry.to,
        accountId: entry.accountId,
        payloads: entry.payloads,
        threadId: entry.threadId,
        replyToId: entry.replyToId,
        bestEffort: entry.bestEffort,
        gifPlayback: entry.gifPlayback,
        silent: entry.silent,
        mirror: entry.mirror,
        skipQueue: true, // Prevent re-enqueueing during recovery
      });
      await ackDelivery(entry.id, opts.stateDir);
      recovered += 1;
      opts.log.info(`Recovered delivery ${entry.id} to ${entry.channel}:${entry.to}`);
    } catch (err) {
      const errMsg = err instanceof Error ? err.message : String(err);
      if (isPermanentDeliveryError(errMsg)) {
        opts.log.warn(`Delivery ${entry.id} hit permanent error — moving to failed/: ${errMsg}`);
        try {
          await moveToFailed(entry.id, opts.stateDir);
        } catch (moveErr) {
          opts.log.error(`Failed to move entry ${entry.id} to failed/: ${String(moveErr)}`);
        }
        failed += 1;
        continue;
      }
      try {
        await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: !attemptMarked });
      } catch {
        // Best-effort update.
      }
```

This ensures `retryCount` always advances when a delivery attempt fails, regardless of whether the pre-persist succeeded.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2690b0cb1c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

}
try {
await failDelivery(entry.id, errMsg, opts.stateDir);
await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: false });
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Increment retries when pre-attempt persistence fails

recoverPendingDeliveries swallows errors from markDeliveryAttempt, but the failure path always calls failDelivery(..., { incrementRetry: false }). If the pre-attempt write fails (for example a transient filesystem write/rename error) and deliver then throws, retryCount is not incremented at all, so the entry can be retried immediately on the next restart and reintroduce the replay-loop behavior this change is trying to prevent. The retry increment should only be disabled when the pre-attempt mark actually succeeded.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scripts Repository scripts size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Telegram provider stale-socket restart causes stuck message replay loop

1 participant