Skip to content

fix: increase scheduler error tolerance for transient Redis outages#881

Merged
alexluong merged 2 commits into
mainfrom
fix/scheduler-retry-resilience
Apr 27, 2026
Merged

fix: increase scheduler error tolerance for transient Redis outages#881
alexluong merged 2 commits into
mainfrom
fix/scheduler-retry-resilience

Conversation

@alexluong
Copy link
Copy Markdown
Collaborator

@alexluong alexluong commented Apr 25, 2026

Summary

  • Increase maxConsecutiveErrors from 5 → 10 and maxErrorBackoff from 5s → 15s
  • Previously ~3s of Redis/Dragonfly downtime permanently killed the retrymq worker with no recovery; now tolerates ~1 min
  • go-redis handles connection pool reconnection internally, so retrying ReceiveMessage is sufficient to recover from transient outages

Test plan

  • Existing scheduler_test.go passes
  • Verify backoff timing in logs during a simulated Redis disconnect

🤖 Generated with Claude Code

alexluong and others added 2 commits April 24, 2026 22:07
The logmq consumer was incorrectly using DeliveryMaxConcurrency instead
of LogMaxConcurrency, causing LOG_MAX_CONCURRENCY to be ignored and
Pub/Sub to cap unacked messages at the delivery limit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… outages

Previously the scheduler gave up after 5 consecutive errors (~3s), permanently
killing the retrymq worker with no recovery path. A brief Dragonfly Cloud restart
would take down retries across all outpost-cloud deployments until containers
were manually restarted.

Increase to 15 errors with 60s backoff cap (~7 min tolerance window).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexluong alexluong force-pushed the fix/scheduler-retry-resilience branch from ee47766 to e847146 Compare April 25, 2026 16:48
@alexluong alexluong merged commit 90827bc into main Apr 27, 2026
2 checks passed
@alexluong alexluong deleted the fix/scheduler-retry-resilience branch April 27, 2026 08:57
alexluong added a commit that referenced this pull request May 13, 2026
)

Previously the consumer gave up after 5 consecutive receive errors with
a 5s backoff cap (~3s total tolerance), permanently killing the worker
with no recovery path. A brief broker hiccup (e.g. GCP OAuth/DNS blip,
managed broker restart) was enough to take down logmq/deliverymq workers
across deployments until containers were manually restarted.

Mirrors the same fix applied to the retrymq scheduler in #881. Increase
to 10 errors with 15s backoff cap (~1 min tolerance window).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants