docs(webhooks): document retry policy (exponential backoff + jitter + configurable cap) by yanxue06 · Pull Request #55 · photon-hq/docs

Yan Xue (yanxue06) · 2026-05-27T01:01:59Z

Summary

Updates webhooks/delivery.mdx to describe the formula-driven retry schedule landing in photon-hq/spectrum-webhook#36 — exponential backoff with ±50% jitter, four operator-tunable knobs, and an explicit honest statement that there is no DLQ today.

This is a doc-only PR; the behavior change lives in the linked webhook PR.

What changed (customer-facing)

Contract at a glance. "Up to 4 attempts" is now "up to 4 attempts by default, with exponential backoff plus jitter," and the bounded-budget bullet is updated to the new ~9.3s worst-case sleep total. Also adds an explicit "no dead-letter queue today" sentence so customers can plan accordingly.
Retry-policy table grows a new "Actual jittered range" column. Customers used to look at our docs and see [200ms, 1s, 5s] to the millisecond — now they see [100ms, 300ms), [500ms, 1500ms), [2.5s, 7.5s) and learn to expect a window, not a point.
New "Why jitter matters" section. A short paragraph on the thundering-herd failure mode the formula prevents — important context for the customer trying to understand why their retry timing isn't deterministic anymore.
New "Tunable on our side" section. Lists the four env knobs (initial, base, max-delay, max-attempts) with their defaults and effects, framed as Photon-team-only with a pointer for customers who need different retry behavior (file an issue, ping Discord). This sets expectations without committing to per-customer overrides.
Sequence diagram gets ~200ms (±50%) and ~1s (±50%) annotations on the wait edges so the picture matches the prose.
"Be idempotent" code block is unchanged but the surrounding prose now points explicitly at X-Spectrum-Webhook-Id (header) + payload.message.id (body) as the dedupe-key surface, and rewords the TTL guidance.
"Failure modes" table row "Endpoint down for >6 seconds" → "Endpoint down for the full retry window (~6.5s default, more if you've requested tuning)" — clearer about what the default is and that there's a tuning path.

The page's existing tone, structure, and <Note> / <Tip> / <Warning> component usage are preserved.

Where in the docs repo

webhooks/delivery.mdx — the page rendered at https://photon.codes/docs/webhooks/delivery. This is a hand-written .mdx file at the repo root (the only ones that go through vellum's .mdx.vel template path are docs-src/webhooks/events.mdx.vel); the page was last edited as part of the URL-guard documentation work in #53.

Related PRs

Code change this documents: photon-hq/spectrum-webhook#36 (open, ready for review). The two PRs land together.

Reviewer notes

The "Tunable on our side" section is the most opinionated piece — it intentionally documents the env vars by effect rather than by name, because customers can't set them directly. The names show up in the spectrum-webhook README (where Photon operators look) and in the webhook PR body, not here.
I considered moving the formula into the page too (min(initial * base^i, maxDelay) * (0.5 + rand)) but it's not what customers need — they need to know what range to expect for the next retry, not how the worker arrives at it. The AWS jitter post link is enough for the curious reader.
The "Be idempotent" prose could call out PR Add Spectrum intro page and reorder nav tabs #32's forthcoming X-Spectrum-Event-Id header once that lands; for now it points at the existing X-Spectrum-Webhook-Id + message.id composite because that's what's actually shipping.
Pre-commit hook ran pnpm lint + pnpm typecheck:docs + pnpm docs:generate cleanly. No new lints, no rendered .mdx artifacts staged (they're gitignored on main).

Test plan

pnpm lint — clean
pnpm typecheck:docs — clean (all 130 code blocks across 3 typecheck configs)
pnpm docs:generate (vellum build) — clean, 32 templates rendered
mint dev local preview — visual smoke check of the new section and table (not run; reviewer can spot-check during review)

Made with Cursor

Summary by CodeRabbit

Documentation
- Webhook delivery attempts increased from 4 to 6 by default
- Added exponential backoff with jitter for retry logic
- Clarified status code handling for retried requests
- Updated per-attempt timeout to 30 seconds
- Refined idempotency deduplication guidance
- Updated failure mode and timing descriptions

… configurable cap) Updates Delivery and retries to reflect the formula-driven retry schedule landing in photon-hq/spectrum-webhook — exponential backoff with ±50% jitter, four operator-tunable knobs (initial delay, growth factor, per-attempt cap, total attempts), and an explicit honest statement that there is no DLQ today. Customer-facing changes: - Contract bullet now mentions jitter and the ~9.3s worst-case sleep budget alongside the existing ~6.2s expected case. - Retry-policy table grows an "Actual jittered range" column so customers know to expect [100ms, 300ms), [500ms, 1500ms), etc. - New "Why jitter matters" section explains the thundering-herd failure mode that motivates the design. - New "Tunable on our side" section lists the four env knobs as operator-only, with a pointer for customers who need different retry behaviour (issue / Discord). - Dedupe-key advice now points explicitly at X-Spectrum-Webhook-Id plus payload.message.id rather than naming the variables locally. PR body intentionally references the webhook-side PR placeholder; the cross-link gets filled in after both PRs are open. Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-05-27T01:02:06Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d0a49c2f-085a-41d3-92a6-6be1fdb9a4a9

📥 Commits

Reviewing files that changed from the base of the PR and between fb979b0 and 0bb287a.

📒 Files selected for processing (1)

webhooks/delivery.mdx

Disabled knowledge base sources:

Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📝 Walkthrough

Walkthrough

The PR updates Spectrum's webhook delivery contract documentation to specify 6 delivery attempts (increased from 4) with exponential backoff and jitter, clarifies which HTTP status codes and network conditions trigger retries, refines idempotency deduplication guidance, and updates failure-mode scenarios to reflect the new timing model.

Changes

Webhook Delivery Contract Update

Layer / File(s)	Summary
Retry policy and timing model `webhooks/delivery.mdx`	The contract-at-a-glance section increases default attempts to 6 with exponential backoff plus jitter on `5xx`, `408`, `429`, network errors, and timeouts. The detailed retry section replaces earlier timing totals with full-jitter framing, adds a per-attempt delay/jitter table for attempts 2–6, revises per-attempt timeout guidance (30s default), documents operator-tunable knobs (initial delay, growth factor, per-attempt cap, total attempts), and updates status-code classification with `5xx` retriable and most `4xx` fatal.
Idempotency and deduplication `webhooks/delivery.mdx`	Idempotency guidance specifies the dedupe key as the combination of the `X-Spectrum-Webhook-Id` header plus an event-scoped payload identifier (example: `payload.message.id`). The dedupe-table TTL section justifies a 24–48 hour retention based on the retry budget being bounded to only a few minutes with jitter and per-attempt timeouts.
Failure scenarios documentation `webhooks/delivery.mdx`	The failure-modes table is updated to match the new contract: `503` recovery within ~30s, timeout-then-success scenarios marked as possibly processed twice, and endpoints down through the full retry window being dropped with updated timing wording (~30s default with "more if tuning requested").

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

photon-hq/docs#24: Both PRs update webhooks/delivery.mdx to define Spectrum's webhook delivery/retry contract (attempt counts, per-attempt timeout/backoff, and related failure/idempotency semantics).

Poem

🐰 Webhooks and Retries
Six attempts now, with jitter in tow,
Backoff exponential, a retry glow,
Idempotent safe with dedupe so keen,
The finest webhook contract you've seen! ✨

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch docs/webhook-retry-policy

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Schedule example and budget numbers updated to match the new defaults shipping in spectrum-webhook#36. Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Documents the new exponential-backoff-plus-jitter retry policy for webhook delivery, adds an explicit "no DLQ today" statement, introduces an operator-tunable knobs section, and clarifies the idempotency/dedupe-key surface. Doc-only change paired with spectrum-hq/spectrum-webhook#36.

Changes:

Rewrites the contract bullets and retry table for the new schedule (now documented as 6 attempts with jitter, ~39s worst-case sleep budget) and adds an "Actual jittered range" column.
Adds two new sections — Why jitter matters and Tunable on our side — and annotates the sequence-diagram wait edges with ±50%.
Updates idempotency prose to call out X-Spectrum-Webhook-Id + payload.message.id explicitly and reframes the TTL guidance and failure-modes row to reflect the new retry window.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+- **Strong retry behaviour.** Up to 6 attempts per event by default, with exponential backoff plus jitter on `5xx`, `408`, `429`, network errors, and worker-side timeouts. The vast majority of deliveries land on attempt 1; the retries are there for the occasional bad minute on your side.
 - **Fast acknowledgement.** Any `2xx` ends it — the worker stops as soon as your server says ok.
 - **Fast permanent failure.** Other `4xx` codes (`400`/`401`/`404`/etc.) are treated as fatal — we don't waste your retry budget when the request will never succeed.
- **Bounded budget.** 30-second per-attempt timeout, with ~6.2 seconds of backoff sleeps between attempts. If your server is still down after the final attempt, the event is logged and the worker moves on.
+- **Bounded budget.** 30-second per-attempt timeout, with up to ~39 seconds of backoff sleeps between attempts (jittered). If your server is still down after the final attempt, the event is logged and the worker moves on — there is no dead-letter queue today.


+
+### Tunable on our side
+
+The retry schedule is operator-configurable. The Photon team can adjust these knobs per environment to trade latency for durability — useful, for example, if a regulated workload needs to tolerate a longer outage than the default ~30s budget covers. The full set:


  Y-->>W: 200 OK
  Note over W: ✓ delivered after retry
 ```



 ```

-The backoff *sleeps* total ~6.2 seconds (200ms + 1s + 5s). Wall-clock time also includes per-attempt network time, bounded by the 30-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~2 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless.
+The backoff *sleeps* sum to ~26.2 seconds in the average case (200ms + 1s + 5s + 10s + 10s) and ~39.3 seconds in the worst case (jitter ceiling). Wall-clock time also includes per-attempt network time, bounded by the 30-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~3.5 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless.


+| Per-attempt cap | 10 seconds | Ceiling applied to every computed delay before jitter, so the curve can't run away. |
+| Total attempts | 6 (initial + 5 retries) | Higher values trade wall-clock latency for more retries against a flaky endpoint. |
+
+These are *internal* env vars on the spectrum-webhook worker — customers can't set them per-webhook today. If you have a use case that needs different retry behaviour (more retries, longer ceiling), reach out and we'll discuss tuning the deployment-wide defaults or adding a per-project override. Open an issue on the [docs repo](https://github.com/photon-hq/docs) or message us in the [Discord](https://discord.gg/4c3VJzDfNA).


docs(webhooks): bump default retry attempts 4 → 6 (sync with #36)

0bb287a

Schedule example and budget numbers updated to match the new defaults shipping in spectrum-webhook#36. Co-authored-by: Cursor <cursoragent@cursor.com>

Yan Xue (yanxue06) marked this pull request as ready for review May 27, 2026 06:50

Copilot AI review requested due to automatic review settings May 27, 2026 06:50

Yan Xue (yanxue06) merged commit 2285f4c into main May 27, 2026
4 checks passed

Copilot AI reviewed May 27, 2026

View reviewed changes

coderabbitai Bot mentioned this pull request May 28, 2026

docs(webhooks): sync wire format + verification docs with shipped/about-to-ship behaviour #60

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(webhooks): document retry policy (exponential backoff + jitter + configurable cap)#55

docs(webhooks): document retry policy (exponential backoff + jitter + configurable cap)#55
Yan Xue (yanxue06) merged 2 commits into
mainfrom
docs/webhook-retry-policy

Yan Xue (yanxue06) commented May 27, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 27, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Possibly related PRs

Poem

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		### Tunable on our side

		The retry schedule is operator-configurable. The Photon team can adjust these knobs per environment to trade latency for durability — useful, for example, if a regulated workload needs to tolerate a longer outage than the default ~30s budget covers. The full set:

Conversation

Yan Xue (yanxue06) commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed (customer-facing)

Where in the docs repo

Related PRs

Reviewer notes

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Possibly related PRs

Poem

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yan Xue (yanxue06) commented May 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 27, 2026 •

edited

Loading