fix(auth): harden email verification pipeline so silent failures fail loud#329
Conversation
… loud The verification email pipeline previously swallowed both boot-time misconfiguration (missing RESEND_API_KEY / sandbox EMAIL_FROM) and runtime send failures into console.error. This let a regression go unnoticed for weeks: signups completed in DB but no verification email was ever sent in production. This change keeps the existing behavior on the happy path and makes failure observable + recoverable across instances. Changes - /api/health: returns 503 in production-like environments when Resend is unconfigured or still on the resend.dev sandbox, so platform monitors and deploy gates fail closed. - lib/monitoring.ts: structured error reporter that emits a tagged `[ALERT]` JSON line and forwards to Sentry when `@sentry/nextjs` is installed and `SENTRY_DSN` is set. Dependency stays optional via dynamic import; no-op when absent. - lib/email.ts: send failures and boot-time misconfig now flow through reportError (verification + password-reset paths). `validateEmailConfig` is now pure (safe for /api/health to call repeatedly); a new `logEmailConfigWarnings` hook runs once on module load. - lib/auth/resend-cooldown.ts: replaces the in-process `Map` used to throttle resend with a Redis-backed (@naap/cache) cooldown keyed by hashed email + purpose, so the cooldown holds across Vercel Fluid Compute instances. Falls back to in-memory when Redis is unavailable. - Tests: 14 new unit tests (monitoring, email config, cooldown memory + Redis paths) + Playwright @Pre-release smoke that asserts /api/health reports email configured and register/resend-verification do not 5xx. - Docs: env.local.example clarifies that RESEND_API_KEY and EMAIL_FROM are required in production and that the sandbox sender only delivers to the Resend account owner. Co-authored-by: Cursor <cursoragent@cursor.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
📝 WalkthroughWalkthroughThis PR introduces structured error monitoring, cross-instance email resend throttling, and email configuration health checks. It adds a monitoring module with ChangesEmail monitoring, cooldown, and health integration
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@apps/web-next/src/lib/__tests__/monitoring.test.ts`:
- Around line 11-19: The tests mutate process.env.SENTRY_DSN and don't restore
it, risking cross-test contamination; modify the setup/teardown around
__resetMonitoringForTests() so beforeEach captures the current
process.env.SENTRY_DSN (e.g., save to a local variable) and afterEach restores
it (reassign or delete if originally undefined), and ensure consoleErrorSpy is
still restored in afterEach; update the beforeEach/afterEach surrounding the
existing __resetMonitoringForTests, process.env.SENTRY_DSN mutation, and
consoleErrorSpy usage accordingly.
In `@apps/web-next/src/lib/auth/resend-cooldown.ts`:
- Around line 94-103: tryAcquireCooldown currently does a non-atomic
cache.cacheGet followed by cache.cacheSet which allows race conditions; change
it to perform an atomic acquire (SET NX with TTL) instead of GET+SETEX: use the
cache client's atomic "set if not exists" with expiration (Redis SET ... NX EX)
or reuse the project's distributed lock/SWR lock utility to attempt to create
the key once and only set TTL when acquired (referencing tryAcquireCooldown,
cache.cacheGet, cache.cacheSet, PREFIX, ttlSeconds, ttlMs). If the atomic set
succeeds return true, otherwise return false; preserve existing TTL calculation
and error handling while removing the non-atomic check-then-set path.
In `@apps/web-next/tests/auth-email-smoke.spec.ts`:
- Around line 23-26: The skip guard currently uses baseURL.includes('localhost')
and misses other local addresses; update the test.skip checks (the three
occurrences using test.skip(!!baseURL && baseURL.includes('localhost'), ...)) to
detect local environments by parsing baseURL (new URL(baseURL).hostname) and
skipping if the hostname is 'localhost', '127.0.0.1' or '::1' (or otherwise
matches a local host check), so all three occurrences consistently skip for
those hostnames.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f6f1a1cf-d453-4fe1-9785-20d0ce7e4205
📒 Files selected for processing (10)
apps/web-next/.env.local.exampleapps/web-next/src/app/api/health/route.tsapps/web-next/src/lib/__tests__/email.test.tsapps/web-next/src/lib/__tests__/monitoring.test.tsapps/web-next/src/lib/api/auth.tsapps/web-next/src/lib/auth/__tests__/resend-cooldown.test.tsapps/web-next/src/lib/auth/resend-cooldown.tsapps/web-next/src/lib/email.tsapps/web-next/src/lib/monitoring.tsapps/web-next/tests/auth-email-smoke.spec.ts
Three valid findings: 1. resend-cooldown: switch from non-atomic cacheGet+cacheSet to an atomic Redis SET NX PX round-trip via @naap/cache's getRedis(). Removes the tiny TOCTOU window where two concurrent acquirers on different instances could both think the slot was free. 2. monitoring.test.ts: save and restore process.env.SENTRY_DSN in beforeEach/afterEach so the SENTRY_DSN mutation doesn't leak into sibling tests in the same vitest worker. 3. auth-email-smoke: broaden the local-environment skip guard to cover 127.0.0.1, 0.0.0.0, and [::1] in addition to "localhost", parsing baseURL via URL.hostname instead of substring match. Also updates the cooldown unit tests to exercise the new SET NX PX path plus the throw-then-fallback-to-memory path. Co-authored-by: Cursor <cursoragent@cursor.com>
Review cycle statusCode review (CodeRabbit):
Copilot:
CI: 18 success / 5 skipped / 0 failures (incl. Lint & TypeCheck, Build, Quality Gates, CodeQL, Vercel preview deploy). Tests added: 15 vitest unit tests + 3 Playwright Local verification (run on
Merge gate: needs |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Why
We just shipped operational fixes for a missing-Resend-key incident on
operator.livepeer.org: user signups completed in DB but verification emails were never sent in production. Diagnosing it took a while because every failure mode along the email path was silently swallowed intoconsole.error. This PR makes those same failure modes loud, and replaces a Vercel Fluid Compute foot-gun (per-instance cooldownMap) with a Redis-backed implementation.Summary
VERCEL_ENV=productionorDEPLOY_ENV=production) whenRESEND_API_KEYis missing orEMAIL_FROMis still on theresend.devsandbox sender. Catches the exact regression we just lived through, on the next deploy. (apps/web-next/src/app/api/health/route.ts)lib/monitoring.tsemits a tagged[ALERT] {...json...}log line — pickable up by Vercel Log Drains today (Datadog/Logflare pattern-match), and additionally forwards toSentry.captureExceptionwhen@sentry/nextjsis installed andSENTRY_DSNis set. The Sentry SDK stays optional via dynamic import; no-op when absent. (apps/web-next/src/lib/monitoring.ts)console.error. Both verification and password-reset paths inlib/email.ts, plus the silent.catchinregister(), route throughreportErrorwith tagged area + kind so they're trivially alertable. (apps/web-next/src/lib/email.ts,apps/web-next/src/lib/api/auth.ts)validateEmailConfigis now pure so/api/healthcan call it per-request without spamming logs. Boot-time warnings moved to a newlogEmailConfigWarnings()hook that runs once on module load and reports the production-critical case viareportError. (apps/web-next/src/lib/email.ts)Map. Newlib/auth/resend-cooldown.tsuses@naap/cache(Redis) with hashed-email keys + per-purpose namespacing, so the cooldown holds across serverless instances on Fluid Compute. Falls back to bounded in-memory when Redis is unavailable (local dev parity). (apps/web-next/src/lib/auth/resend-cooldown.ts,apps/web-next/src/lib/api/auth.ts)@pre-releasesmoke (tests/auth-email-smoke.spec.ts) that asserts/api/healthreports email configured andregister/resend-verificationdo not 5xx. Picked up by the nightly e2e-ga workflow against production / preview..env.local.examplenow spells out thatRESEND_API_KEY+ a verified-domainEMAIL_FROMare required in production, and that the sandbox sender only delivers to the Resend account owner.Why these four, why now
Background: the missing-email incident root-caused to no
RESEND_API_KEYin the Vercel project. The code was wired correctly; the failure modes were just invisible. Each item below removes one specific way that the failure could hide again:/api/healthreturns 503[ALERT]log + Sentry hookconsole.errorMapreset on every cold startTest plan
Already run locally:
npx tsc --noEmit— no new errors from these files (pre-existing errors onmainfor@naap/crypto,@naap/plugin-sdk, etc. are unchanged)npx vitest run— 662/663 pass; the single failure (integration.test.tsfrom PR feat(leaderboard): union membership strategy + improved data sources UI #325) is a pre-existing flaky external-network test, not touched herenpx next lintover changed files — clean@pre-releasesmoke against a preview deploy — will run in CI; locally skipped because it requires a deployed envManual production smoke we just ran end-to-end as part of the incident response (independent of this PR):
operator.livepeer.orgverified for sendingvercel env add RESEND_API_KEY+EMAIL_FROMset on production + previewnaap-platform-741o2zvczlive onoperator.livepeer.org"last_event": "delivered")Risk / rollback
@sentry/nextjsis optional via dynamic import — adding it is a separate PR.tryAcquireCooldownis called (existing-email registration flow). Failure to reach Redis falls back to in-memory — strictly no-worse than today.Follow-ups (not in this PR)
@sentry/nextjsand addinstrumentation.tsso the dynamic-import path inlib/monitoring.tslights up. Tracking separately so this PR stays scoped./api/health/servicessimilarly so it surfaces Redis + Resend reachability, not just upstream service/healthz.Made with Cursor
Summary by CodeRabbit
New Features
Documentation
Tests