Skip to content

docs(webhooks): document retry policy (exponential backoff + jitter + configurable cap)#55

Merged
Yan Xue (yanxue06) merged 2 commits into
mainfrom
docs/webhook-retry-policy
May 27, 2026
Merged

docs(webhooks): document retry policy (exponential backoff + jitter + configurable cap)#55
Yan Xue (yanxue06) merged 2 commits into
mainfrom
docs/webhook-retry-policy

Conversation

@yanxue06
Copy link
Copy Markdown
Member

@yanxue06 Yan Xue (yanxue06) commented May 27, 2026

Summary

Updates webhooks/delivery.mdx to describe the formula-driven retry schedule landing in photon-hq/spectrum-webhook#36 — exponential backoff with ±50% jitter, four operator-tunable knobs, and an explicit honest statement that there is no DLQ today.

This is a doc-only PR; the behavior change lives in the linked webhook PR.

What changed (customer-facing)

  • Contract at a glance. "Up to 4 attempts" is now "up to 4 attempts by default, with exponential backoff plus jitter," and the bounded-budget bullet is updated to the new ~9.3s worst-case sleep total. Also adds an explicit "no dead-letter queue today" sentence so customers can plan accordingly.
  • Retry-policy table grows a new "Actual jittered range" column. Customers used to look at our docs and see [200ms, 1s, 5s] to the millisecond — now they see [100ms, 300ms), [500ms, 1500ms), [2.5s, 7.5s) and learn to expect a window, not a point.
  • New "Why jitter matters" section. A short paragraph on the thundering-herd failure mode the formula prevents — important context for the customer trying to understand why their retry timing isn't deterministic anymore.
  • New "Tunable on our side" section. Lists the four env knobs (initial, base, max-delay, max-attempts) with their defaults and effects, framed as Photon-team-only with a pointer for customers who need different retry behavior (file an issue, ping Discord). This sets expectations without committing to per-customer overrides.
  • Sequence diagram gets ~200ms (±50%) and ~1s (±50%) annotations on the wait edges so the picture matches the prose.
  • "Be idempotent" code block is unchanged but the surrounding prose now points explicitly at X-Spectrum-Webhook-Id (header) + payload.message.id (body) as the dedupe-key surface, and rewords the TTL guidance.
  • "Failure modes" table row "Endpoint down for >6 seconds" → "Endpoint down for the full retry window (~6.5s default, more if you've requested tuning)" — clearer about what the default is and that there's a tuning path.

The page's existing tone, structure, and <Note> / <Tip> / <Warning> component usage are preserved.

Where in the docs repo

webhooks/delivery.mdx — the page rendered at https://photon.codes/docs/webhooks/delivery. This is a hand-written .mdx file at the repo root (the only ones that go through vellum's .mdx.vel template path are docs-src/webhooks/events.mdx.vel); the page was last edited as part of the URL-guard documentation work in #53.

Related PRs

  • Code change this documents: photon-hq/spectrum-webhook#36 (open, ready for review). The two PRs land together.

Reviewer notes

  • The "Tunable on our side" section is the most opinionated piece — it intentionally documents the env vars by effect rather than by name, because customers can't set them directly. The names show up in the spectrum-webhook README (where Photon operators look) and in the webhook PR body, not here.
  • I considered moving the formula into the page too (min(initial * base^i, maxDelay) * (0.5 + rand)) but it's not what customers need — they need to know what range to expect for the next retry, not how the worker arrives at it. The AWS jitter post link is enough for the curious reader.
  • The "Be idempotent" prose could call out PR Add Spectrum intro page and reorder nav tabs #32's forthcoming X-Spectrum-Event-Id header once that lands; for now it points at the existing X-Spectrum-Webhook-Id + message.id composite because that's what's actually shipping.
  • Pre-commit hook ran pnpm lint + pnpm typecheck:docs + pnpm docs:generate cleanly. No new lints, no rendered .mdx artifacts staged (they're gitignored on main).

Test plan

  • pnpm lint — clean
  • pnpm typecheck:docs — clean (all 130 code blocks across 3 typecheck configs)
  • pnpm docs:generate (vellum build) — clean, 32 templates rendered
  • mint dev local preview — visual smoke check of the new section and table (not run; reviewer can spot-check during review)

Made with Cursor

Summary by CodeRabbit

  • Documentation
    • Webhook delivery attempts increased from 4 to 6 by default
    • Added exponential backoff with jitter for retry logic
    • Clarified status code handling for retried requests
    • Updated per-attempt timeout to 30 seconds
    • Refined idempotency deduplication guidance
    • Updated failure mode and timing descriptions

Review Change Stack

… configurable cap)

Updates Delivery and retries to reflect the formula-driven retry
schedule landing in photon-hq/spectrum-webhook — exponential backoff
with ±50% jitter, four operator-tunable knobs (initial delay, growth
factor, per-attempt cap, total attempts), and an explicit honest
statement that there is no DLQ today.

Customer-facing changes:
- Contract bullet now mentions jitter and the ~9.3s worst-case sleep
  budget alongside the existing ~6.2s expected case.
- Retry-policy table grows an "Actual jittered range" column so
  customers know to expect [100ms, 300ms), [500ms, 1500ms), etc.
- New "Why jitter matters" section explains the thundering-herd
  failure mode that motivates the design.
- New "Tunable on our side" section lists the four env knobs as
  operator-only, with a pointer for customers who need different
  retry behaviour (issue / Discord).
- Dedupe-key advice now points explicitly at X-Spectrum-Webhook-Id
  plus payload.message.id rather than naming the variables locally.

PR body intentionally references the webhook-side PR placeholder; the
cross-link gets filled in after both PRs are open.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 27, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d0a49c2f-085a-41d3-92a6-6be1fdb9a4a9

📥 Commits

Reviewing files that changed from the base of the PR and between fb979b0 and 0bb287a.

📒 Files selected for processing (1)
  • webhooks/delivery.mdx

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.


📝 Walkthrough

Walkthrough

The PR updates Spectrum's webhook delivery contract documentation to specify 6 delivery attempts (increased from 4) with exponential backoff and jitter, clarifies which HTTP status codes and network conditions trigger retries, refines idempotency deduplication guidance, and updates failure-mode scenarios to reflect the new timing model.

Changes

Webhook Delivery Contract Update

Layer / File(s) Summary
Retry policy and timing model
webhooks/delivery.mdx
The contract-at-a-glance section increases default attempts to 6 with exponential backoff plus jitter on 5xx, 408, 429, network errors, and timeouts. The detailed retry section replaces earlier timing totals with full-jitter framing, adds a per-attempt delay/jitter table for attempts 2–6, revises per-attempt timeout guidance (30s default), documents operator-tunable knobs (initial delay, growth factor, per-attempt cap, total attempts), and updates status-code classification with 5xx retriable and most 4xx fatal.
Idempotency and deduplication
webhooks/delivery.mdx
Idempotency guidance specifies the dedupe key as the combination of the X-Spectrum-Webhook-Id header plus an event-scoped payload identifier (example: payload.message.id). The dedupe-table TTL section justifies a 24–48 hour retention based on the retry budget being bounded to only a few minutes with jitter and per-attempt timeouts.
Failure scenarios documentation
webhooks/delivery.mdx
The failure-modes table is updated to match the new contract: 503 recovery within ~30s, timeout-then-success scenarios marked as possibly processed twice, and endpoints down through the full retry window being dropped with updated timing wording (~30s default with "more if tuning requested").

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • photon-hq/docs#24: Both PRs update webhooks/delivery.mdx to define Spectrum's webhook delivery/retry contract (attempt counts, per-attempt timeout/backoff, and related failure/idempotency semantics).

Poem

🐰 Webhooks and Retries
Six attempts now, with jitter in tow,
Backoff exponential, a retry glow,
Idempotent safe with dedupe so keen,
The finest webhook contract you've seen! ✨

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/webhook-retry-policy

Comment @coderabbitai help to get the list of available commands and usage tips.

Schedule example and budget numbers updated to match the new defaults
shipping in spectrum-webhook#36.

Co-authored-by: Cursor <cursoragent@cursor.com>
@yanxue06 Yan Xue (yanxue06) marked this pull request as ready for review May 27, 2026 06:50
Copilot AI review requested due to automatic review settings May 27, 2026 06:50
@yanxue06 Yan Xue (yanxue06) merged commit 2285f4c into main May 27, 2026
4 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Documents the new exponential-backoff-plus-jitter retry policy for webhook delivery, adds an explicit "no DLQ today" statement, introduces an operator-tunable knobs section, and clarifies the idempotency/dedupe-key surface. Doc-only change paired with spectrum-hq/spectrum-webhook#36.

Changes:

  • Rewrites the contract bullets and retry table for the new schedule (now documented as 6 attempts with jitter, ~39s worst-case sleep budget) and adds an "Actual jittered range" column.
  • Adds two new sections — Why jitter matters and Tunable on our side — and annotates the sequence-diagram wait edges with ±50%.
  • Updates idempotency prose to call out X-Spectrum-Webhook-Id + payload.message.id explicitly and reframes the TTL guidance and failure-modes row to reflect the new retry window.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread webhooks/delivery.mdx
Comment on lines +10 to +13
- **Strong retry behaviour.** Up to 6 attempts per event by default, with exponential backoff plus jitter on `5xx`, `408`, `429`, network errors, and worker-side timeouts. The vast majority of deliveries land on attempt 1; the retries are there for the occasional bad minute on your side.
- **Fast acknowledgement.** Any `2xx` ends it — the worker stops as soon as your server says ok.
- **Fast permanent failure.** Other `4xx` codes (`400`/`401`/`404`/etc.) are treated as fatal — we don't waste your retry budget when the request will never succeed.
- **Bounded budget.** 30-second per-attempt timeout, with ~6.2 seconds of backoff sleeps between attempts. If your server is still down after the final attempt, the event is logged and the worker moves on.
- **Bounded budget.** 30-second per-attempt timeout, with up to ~39 seconds of backoff sleeps between attempts (jittered). If your server is still down after the final attempt, the event is logged and the worker moves on — there is no dead-letter queue today.
Comment thread webhooks/delivery.mdx

### Tunable on our side

The retry schedule is operator-configurable. The Photon team can adjust these knobs per environment to trade latency for durability — useful, for example, if a regulated workload needs to tolerate a longer outage than the default ~30s budget covers. The full set:
Comment thread webhooks/delivery.mdx
Y-->>W: 200 OK
Note over W: ✓ delivered after retry
```

Comment thread webhooks/delivery.mdx
```

The backoff *sleeps* total ~6.2 seconds (200ms + 1s + 5s). Wall-clock time also includes per-attempt network time, bounded by the 30-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~2 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless.
The backoff *sleeps* sum to ~26.2 seconds in the average case (200ms + 1s + 5s + 10s + 10s) and ~39.3 seconds in the worst case (jitter ceiling). Wall-clock time also includes per-attempt network time, bounded by the 30-second per-attempt timeout: a healthy delivery finishes in milliseconds, while a worst case where every attempt hangs to the timeout can run up to ~3.5 minutes before the worker gives up. It stops as soon as it gets a 2xx or determines further retries are pointless.
Comment thread webhooks/delivery.mdx
| Per-attempt cap | 10 seconds | Ceiling applied to every computed delay before jitter, so the curve can't run away. |
| Total attempts | 6 (initial + 5 retries) | Higher values trade wall-clock latency for more retries against a flaky endpoint. |

These are *internal* env vars on the spectrum-webhook worker — customers can't set them per-webhook today. If you have a use case that needs different retry behaviour (more retries, longer ceiling), reach out and we'll discuss tuning the deployment-wide defaults or adding a per-project override. Open an issue on the [docs repo](https://github.com/photon-hq/docs) or message us in the [Discord](https://discord.gg/4c3VJzDfNA).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants