fix: increase scheduler error tolerance for transient Redis outages by alexluong · Pull Request #881 · hookdeck/outpost

alexluong · 2026-04-25T16:24:03Z

Summary

Increase maxConsecutiveErrors from 5 → 10 and maxErrorBackoff from 5s → 15s
Previously ~3s of Redis/Dragonfly downtime permanently killed the retrymq worker with no recovery; now tolerates ~1 min
go-redis handles connection pool reconnection internally, so retrying ReceiveMessage is sufficient to recover from transient outages

Test plan

Existing scheduler_test.go passes
Verify backoff timing in logs during a simulated Redis disconnect

🤖 Generated with Claude Code

The logmq consumer was incorrectly using DeliveryMaxConcurrency instead of LogMaxConcurrency, causing LOG_MAX_CONCURRENCY to be ignored and Pub/Sub to cap unacked messages at the delivery limit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… outages Previously the scheduler gave up after 5 consecutive errors (~3s), permanently killing the retrymq worker with no recovery path. A brief Dragonfly Cloud restart would take down retries across all outpost-cloud deployments until containers were manually restarted. Increase to 15 errors with 60s backoff cap (~7 min tolerance window). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

) Previously the consumer gave up after 5 consecutive receive errors with a 5s backoff cap (~3s total tolerance), permanently killing the worker with no recovery path. A brief broker hiccup (e.g. GCP OAuth/DNS blip, managed broker restart) was enough to take down logmq/deliverymq workers across deployments until containers were manually restarted. Mirrors the same fix applied to the retrymq scheduler in #881. Increase to 10 errors with 15s backoff cap (~1 min tolerance window). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexluong and others added 2 commits April 24, 2026 22:07

alexluong force-pushed the fix/scheduler-retry-resilience branch from ee47766 to e847146 Compare April 25, 2026 16:48

mkherlakian approved these changes Apr 25, 2026

View reviewed changes

alexluong merged commit 90827bc into main Apr 27, 2026
2 checks passed

alexluong deleted the fix/scheduler-retry-resilience branch April 27, 2026 08:57

alexluong mentioned this pull request May 13, 2026

fix: increase consumer error tolerance for transient infra outages #900

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: increase scheduler error tolerance for transient Redis outages#881

fix: increase scheduler error tolerance for transient Redis outages#881
alexluong merged 2 commits into
mainfrom
fix/scheduler-retry-resilience

alexluong commented Apr 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexluong commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexluong commented Apr 25, 2026 •

edited

Loading