fix: increase consumer error tolerance for transient infra outages by alexluong · Pull Request #900 · hookdeck/outpost

alexluong · 2026-05-13T07:53:56Z

Summary

Bump consumer retry defaults from 5 errors / 5s cap (~3s tolerance) to 10 errors / 15s cap (~1 min tolerance).

The consumer (internal/consumer/consumer.go) drives logmq-consumer, deliverymq-consumer, and other Pub/Sub-style workers. On a subscription.Receive() error it retries with exponential backoff, then permanently kills the worker once maxConsecutiveErrors is reached — the supervisor does not restart it.

With the previous defaults, a ~3-second blip (e.g. transient GCP OAuth token fetch / DNS timeout, brief managed broker restart) was enough to take a worker down until the container was manually restarted. Example real failure mode:

oauth2: cannot fetch token: Post "https://oauth2.googleapis.com/token":
dial tcp ...:443: i/o timeout
... max consecutive receive errors reached (5) ...

This mirrors the exact fix applied to the retrymq scheduler in #881, where a brief Dragonfly Cloud restart was killing retry workers across all deployments. Same pattern, different file — the consumer side never got the matching bump.

New retry schedule:

Error	Backoff	Cumulative
1	200ms	0.2s
2	400ms	0.6s
3	800ms	1.4s
4	1.6s	3.0s
5	3.2s	6.2s
6	6.4s	12.6s
7	12.8s	25.4s
8	15s (cap)	40.4s
9	15s (cap)	55.4s
10	15s (cap)	70.4s ← worker dies (~1 min total)

Defaults remain overridable via WithMaxConsecutiveErrors / WithInitialBackoff / WithMaxBackoff.

Test plan

go build ./internal/consumer/...
go test ./internal/consumer/... — passes (existing tests pass explicit values via WithMaxConsecutiveErrors, unaffected by default change)

🤖 Generated with Claude Code

Previously the consumer gave up after 5 consecutive receive errors with a 5s backoff cap (~3s total tolerance), permanently killing the worker with no recovery path. A brief broker hiccup (e.g. GCP OAuth/DNS blip, managed broker restart) was enough to take down logmq/deliverymq workers across deployments until containers were manually restarted. Mirrors the same fix applied to the retrymq scheduler in #881. Increase to 10 errors with 15s backoff cap (~1 min tolerance window). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexbouchardd approved these changes May 13, 2026

View reviewed changes

alexluong merged commit 2aa4cec into main May 13, 2026
2 checks passed

alexluong deleted the fix/consumer-error-tolerance branch May 13, 2026 17:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: increase consumer error tolerance for transient infra outages#900

fix: increase consumer error tolerance for transient infra outages#900
alexluong merged 1 commit into
mainfrom
fix/consumer-error-tolerance

alexluong commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexluong commented May 13, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants