Skip to content

fix: increase consumer error tolerance for transient infra outages#900

Merged
alexluong merged 1 commit into
mainfrom
fix/consumer-error-tolerance
May 13, 2026
Merged

fix: increase consumer error tolerance for transient infra outages#900
alexluong merged 1 commit into
mainfrom
fix/consumer-error-tolerance

Conversation

@alexluong
Copy link
Copy Markdown
Collaborator

Summary

Bump consumer retry defaults from 5 errors / 5s cap (~3s tolerance) to 10 errors / 15s cap (~1 min tolerance).

The consumer (internal/consumer/consumer.go) drives logmq-consumer, deliverymq-consumer, and other Pub/Sub-style workers. On a subscription.Receive() error it retries with exponential backoff, then permanently kills the worker once maxConsecutiveErrors is reached — the supervisor does not restart it.

With the previous defaults, a ~3-second blip (e.g. transient GCP OAuth token fetch / DNS timeout, brief managed broker restart) was enough to take a worker down until the container was manually restarted. Example real failure mode:

oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token":
dial tcp ...:443: i/o timeout
... max consecutive receive errors reached (5) ...

This mirrors the exact fix applied to the retrymq scheduler in #881, where a brief Dragonfly Cloud restart was killing retry workers across all deployments. Same pattern, different file — the consumer side never got the matching bump.

New retry schedule:

Error Backoff Cumulative
1 200ms 0.2s
2 400ms 0.6s
3 800ms 1.4s
4 1.6s 3.0s
5 3.2s 6.2s
6 6.4s 12.6s
7 12.8s 25.4s
8 15s (cap) 40.4s
9 15s (cap) 55.4s
10 15s (cap) 70.4s ← worker dies (~1 min total)

Defaults remain overridable via WithMaxConsecutiveErrors / WithInitialBackoff / WithMaxBackoff.

Test plan

  • go build ./internal/consumer/...
  • go test ./internal/consumer/... — passes (existing tests pass explicit values via WithMaxConsecutiveErrors, unaffected by default change)

🤖 Generated with Claude Code

Previously the consumer gave up after 5 consecutive receive errors with
a 5s backoff cap (~3s total tolerance), permanently killing the worker
with no recovery path. A brief broker hiccup (e.g. GCP OAuth/DNS blip,
managed broker restart) was enough to take down logmq/deliverymq workers
across deployments until containers were manually restarted.

Mirrors the same fix applied to the retrymq scheduler in #881. Increase
to 10 errors with 15s backoff cap (~1 min tolerance window).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexluong alexluong merged commit 2aa4cec into main May 13, 2026
2 checks passed
@alexluong alexluong deleted the fix/consumer-error-tolerance branch May 13, 2026 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants