fix(outbound): persist recovery attempt before replay delivery#40374
fix(outbound): persist recovery attempt before replay delivery#40374KimGLee wants to merge 7 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR fixes a real crash-recovery replay-loop bug in
Confidence Score: 3/5
Last reviewed commit: 2690b0c |
| @@ -360,7 +386,7 @@ export async function recoverPendingDeliveries(opts: { | |||
| continue; | |||
| } | |||
| try { | |||
| await failDelivery(entry.id, errMsg, opts.stateDir); | |||
| await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: false }); | |||
| } catch { | |||
| // Best-effort update. | |||
| } | |||
There was a problem hiding this comment.
incrementRetry: false applied even when markDeliveryAttempt failed
When markDeliveryAttempt throws and its error is silently swallowed, retryCount on disk was never incremented. If deliver then also throws, failDelivery is called with { incrementRetry: false }, leaving retryCount unchanged. On the next restart the entry re-enters the loop with the same count, backoff is computed against the same number, and — if the filesystem keeps failing — the entry can loop indefinitely without ever progressing toward the MAX_RETRIES guard.
To make the incrementRetry decision conditional on whether the pre-persist actually succeeded:
let attemptMarked = false;
try {
await markDeliveryAttempt(entry.id, opts.stateDir);
attemptMarked = true;
} catch {
// Best-effort pre-attempt persistence.
}
await opts.deliver({
cfg: opts.cfg,
channel: entry.channel,
to: entry.to,
accountId: entry.accountId,
payloads: entry.payloads,
threadId: entry.threadId,
replyToId: entry.replyToId,
bestEffort: entry.bestEffort,
gifPlayback: entry.gifPlayback,
silent: entry.silent,
mirror: entry.mirror,
skipQueue: true, // Prevent re-enqueueing during recovery
});
await ackDelivery(entry.id, opts.stateDir);
recovered += 1;
opts.log.info(`Recovered delivery ${entry.id} to ${entry.channel}:${entry.to}`);
} catch (err) {
const errMsg = err instanceof Error ? err.message : String(err);
if (isPermanentDeliveryError(errMsg)) {
opts.log.warn(`Delivery ${entry.id} hit permanent error — moving to failed/: ${errMsg}`);
try {
await moveToFailed(entry.id, opts.stateDir);
} catch (moveErr) {
opts.log.error(`Failed to move entry ${entry.id} to failed/: ${String(moveErr)}`);
}
failed += 1;
continue;
}
try {
await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: !attemptMarked });
} catch {
// Best-effort update.
}
This ensures retryCount always advances when a delivery attempt fails, regardless of whether the pre-persist succeeded.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/infra/outbound/delivery-queue.ts
Line: 354-392
Comment:
**`incrementRetry: false` applied even when `markDeliveryAttempt` failed**
When `markDeliveryAttempt` throws and its error is silently swallowed, `retryCount` on disk was **never** incremented. If `deliver` then also throws, `failDelivery` is called with `{ incrementRetry: false }`, leaving `retryCount` unchanged. On the next restart the entry re-enters the loop with the same count, backoff is computed against the same number, and — if the filesystem keeps failing — the entry can loop indefinitely without ever progressing toward the `MAX_RETRIES` guard.
To make the `incrementRetry` decision conditional on whether the pre-persist actually succeeded:
```
let attemptMarked = false;
try {
await markDeliveryAttempt(entry.id, opts.stateDir);
attemptMarked = true;
} catch {
// Best-effort pre-attempt persistence.
}
await opts.deliver({
cfg: opts.cfg,
channel: entry.channel,
to: entry.to,
accountId: entry.accountId,
payloads: entry.payloads,
threadId: entry.threadId,
replyToId: entry.replyToId,
bestEffort: entry.bestEffort,
gifPlayback: entry.gifPlayback,
silent: entry.silent,
mirror: entry.mirror,
skipQueue: true, // Prevent re-enqueueing during recovery
});
await ackDelivery(entry.id, opts.stateDir);
recovered += 1;
opts.log.info(`Recovered delivery ${entry.id} to ${entry.channel}:${entry.to}`);
} catch (err) {
const errMsg = err instanceof Error ? err.message : String(err);
if (isPermanentDeliveryError(errMsg)) {
opts.log.warn(`Delivery ${entry.id} hit permanent error — moving to failed/: ${errMsg}`);
try {
await moveToFailed(entry.id, opts.stateDir);
} catch (moveErr) {
opts.log.error(`Failed to move entry ${entry.id} to failed/: ${String(moveErr)}`);
}
failed += 1;
continue;
}
try {
await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: !attemptMarked });
} catch {
// Best-effort update.
}
```
This ensures `retryCount` always advances when a delivery attempt fails, regardless of whether the pre-persist succeeded.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2690b0cb1c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| try { | ||
| await failDelivery(entry.id, errMsg, opts.stateDir); | ||
| await failDelivery(entry.id, errMsg, opts.stateDir, { incrementRetry: false }); |
There was a problem hiding this comment.
Increment retries when pre-attempt persistence fails
recoverPendingDeliveries swallows errors from markDeliveryAttempt, but the failure path always calls failDelivery(..., { incrementRetry: false }). If the pre-attempt write fails (for example a transient filesystem write/rename error) and deliver then throws, retryCount is not incremented at all, so the entry can be retried immediately on the next restart and reintroduce the replay-loop behavior this change is trying to prevent. The retry increment should only be disabled when the pre-attempt mark actually succeeded.
Useful? React with 👍 / 👎.
Summary
Is #35959 real?
Yes. This is a real reliability bug.
Root cause
During startup recovery (
recoverPendingDeliveries), queued entries were delivered first, and retry metadata was only updated on explicit catch/failure.That means if the process/provider dies mid-delivery (e.g. stale-socket restart) after send side effects but before queue ack/fail update, the entry remains in queue with:
retryCount = 0lastAttemptAtOn next restart, the same entry is immediately eligible again, causing repeated replays and potential duplicate sends (loop behavior).
Fix details
In
src/infra/outbound/delivery-queue.ts:markDeliveryAttempt(id, stateDir):retryCountlastAttemptAt = Date.now()recoverPendingDeliveries, callmarkDeliveryAttempt(...)beforedeliver(...).failDelivery(...)to supportopts.incrementRetry(default true).failDelivery(..., { incrementRetry: false })to avoid double increment because attempt was already recorded.This guarantees that crash/restart after replay attempt leaves persisted retry/backoff state, preventing immediate infinite replay loops.
Local validation
pnpm exec vitest run src/infra/outbound/outbound.test.tsFixes #35959.