docs(webhooks): document retry policy (exponential backoff + jitter + configurable cap)#55
Conversation
… configurable cap) Updates Delivery and retries to reflect the formula-driven retry schedule landing in photon-hq/spectrum-webhook — exponential backoff with ±50% jitter, four operator-tunable knobs (initial delay, growth factor, per-attempt cap, total attempts), and an explicit honest statement that there is no DLQ today. Customer-facing changes: - Contract bullet now mentions jitter and the ~9.3s worst-case sleep budget alongside the existing ~6.2s expected case. - Retry-policy table grows an "Actual jittered range" column so customers know to expect [100ms, 300ms), [500ms, 1500ms), etc. - New "Why jitter matters" section explains the thundering-herd failure mode that motivates the design. - New "Tunable on our side" section lists the four env knobs as operator-only, with a pointer for customers who need different retry behaviour (issue / Discord). - Dedupe-key advice now points explicitly at X-Spectrum-Webhook-Id plus payload.message.id rather than naming the variables locally. PR body intentionally references the webhook-side PR placeholder; the cross-link gets filled in after both PRs are open. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Caution Review failedThe pull request is closed. ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
Disabled knowledge base sources:
📝 WalkthroughWalkthroughThe PR updates Spectrum's webhook delivery contract documentation to specify 6 delivery attempts (increased from 4) with exponential backoff and jitter, clarifies which HTTP status codes and network conditions trigger retries, refines idempotency deduplication guidance, and updates failure-mode scenarios to reflect the new timing model. ChangesWebhook Delivery Contract Update
🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Schedule example and budget numbers updated to match the new defaults shipping in spectrum-webhook#36. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Documents the new exponential-backoff-plus-jitter retry policy for webhook delivery, adds an explicit "no DLQ today" statement, introduces an operator-tunable knobs section, and clarifies the idempotency/dedupe-key surface. Doc-only change paired with spectrum-hq/spectrum-webhook#36.
Changes:
- Rewrites the contract bullets and retry table for the new schedule (now documented as 6 attempts with jitter, ~39s worst-case sleep budget) and adds an "Actual jittered range" column.
- Adds two new sections — Why jitter matters and Tunable on our side — and annotates the sequence-diagram wait edges with
±50%. - Updates idempotency prose to call out
X-Spectrum-Webhook-Id + payload.message.idexplicitly and reframes the TTL guidance and failure-modes row to reflect the new retry window.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - **Strong retry behaviour.** Up to 6 attempts per event by default, with exponential backoff plus jitter on `5xx`, `408`, `429`, network errors, and worker-side timeouts. The vast majority of deliveries land on attempt 1; the retries are there for the occasional bad minute on your side. | ||
| - **Fast acknowledgement.** Any `2xx` ends it — the worker stops as soon as your server says ok. | ||
| - **Fast permanent failure.** Other `4xx` codes (`400`/`401`/`404`/etc.) are treated as fatal — we don't waste your retry budget when the request will never succeed. | ||
| - **Bounded budget.** 30-second per-attempt timeout, with ~6.2 seconds of backoff sleeps between attempts. If your server is still down after the final attempt, the event is logged and the worker moves on. | ||
| - **Bounded budget.** 30-second per-attempt timeout, with up to ~39 seconds of backoff sleeps between attempts (jittered). If your server is still down after the final attempt, the event is logged and the worker moves on — there is no dead-letter queue today. |
|
|
||
| ### Tunable on our side | ||
|
|
||
| The retry schedule is operator-configurable. The Photon team can adjust these knobs per environment to trade latency for durability — useful, for example, if a regulated workload needs to tolerate a longer outage than the default ~30s budget covers. The full set: |
| Y-->>W: 200 OK | ||
| Note over W: ✓ delivered after retry | ||
| ``` | ||
|
|
| ``` | ||
|
|
||
| The backoff *sleeps* total ~6.2 seconds (200ms + 1s + 5s). Wall-clock time also includes per-attempt network time, bounded by the 30-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~2 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless. | ||
| The backoff *sleeps* sum to ~26.2 seconds in the average case (200ms + 1s + 5s + 10s + 10s) and ~39.3 seconds in the worst case (jitter ceiling). Wall-clock time also includes per-attempt network time, bounded by the 30-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~3.5 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless. |
| | Per-attempt cap | 10 seconds | Ceiling applied to every computed delay before jitter, so the curve can't run away. | | ||
| | Total attempts | 6 (initial + 5 retries) | Higher values trade wall-clock latency for more retries against a flaky endpoint. | | ||
|
|
||
| These are *internal* env vars on the spectrum-webhook worker — customers can't set them per-webhook today. If you have a use case that needs different retry behaviour (more retries, longer ceiling), reach out and we'll discuss tuning the deployment-wide defaults or adding a per-project override. Open an issue on the [docs repo](https://github.com/photon-hq/docs) or message us in the [Discord](https://discord.gg/4c3VJzDfNA). |
Summary
Updates
webhooks/delivery.mdxto describe the formula-driven retry schedule landing in photon-hq/spectrum-webhook#36 — exponential backoff with ±50% jitter, four operator-tunable knobs, and an explicit honest statement that there is no DLQ today.This is a doc-only PR; the behavior change lives in the linked webhook PR.
What changed (customer-facing)
[200ms, 1s, 5s]to the millisecond — now they see[100ms, 300ms),[500ms, 1500ms),[2.5s, 7.5s)and learn to expect a window, not a point.~200ms (±50%)and~1s (±50%)annotations on the wait edges so the picture matches the prose.X-Spectrum-Webhook-Id(header) +payload.message.id(body) as the dedupe-key surface, and rewords the TTL guidance.The page's existing tone, structure, and
<Note>/<Tip>/<Warning>component usage are preserved.Where in the docs repo
webhooks/delivery.mdx— the page rendered at https://photon.codes/docs/webhooks/delivery. This is a hand-written.mdxfile at the repo root (the only ones that go through vellum's.mdx.veltemplate path aredocs-src/webhooks/events.mdx.vel); the page was last edited as part of the URL-guard documentation work in #53.Related PRs
Reviewer notes
min(initial * base^i, maxDelay) * (0.5 + rand)) but it's not what customers need — they need to know what range to expect for the next retry, not how the worker arrives at it. The AWS jitter post link is enough for the curious reader.X-Spectrum-Event-Idheader once that lands; for now it points at the existingX-Spectrum-Webhook-Id + message.idcomposite because that's what's actually shipping.pnpm lint+pnpm typecheck:docs+pnpm docs:generatecleanly. No new lints, no rendered.mdxartifacts staged (they're gitignored onmain).Test plan
pnpm lint— cleanpnpm typecheck:docs— clean (all 130 code blocks across 3 typecheck configs)pnpm docs:generate(vellum build) — clean, 32 templates renderedmint devlocal preview — visual smoke check of the new section and table (not run; reviewer can spot-check during review)Made with Cursor
Summary by CodeRabbit