Skip to content

fix(auth): harden email verification pipeline so silent failures fail loud#329

Merged
seanhanca merged 2 commits into
mainfrom
fix/auth-email-hardening
May 21, 2026
Merged

fix(auth): harden email verification pipeline so silent failures fail loud#329
seanhanca merged 2 commits into
mainfrom
fix/auth-email-hardening

Conversation

@seanhanca
Copy link
Copy Markdown
Contributor

@seanhanca seanhanca commented May 21, 2026

Why

We just shipped operational fixes for a missing-Resend-key incident on operator.livepeer.org: user signups completed in DB but verification emails were never sent in production. Diagnosing it took a while because every failure mode along the email path was silently swallowed into console.error. This PR makes those same failure modes loud, and replaces a Vercel Fluid Compute foot-gun (per-instance cooldown Map) with a Redis-backed implementation.

Summary

  • /api/health fails closed on broken email config. Returns 503 in production-like environments (VERCEL_ENV=production or DEPLOY_ENV=production) when RESEND_API_KEY is missing or EMAIL_FROM is still on the resend.dev sandbox sender. Catches the exact regression we just lived through, on the next deploy. (apps/web-next/src/app/api/health/route.ts)
  • Structured error reporter ready for Sentry. New lib/monitoring.ts emits a tagged [ALERT] {...json...} log line — pickable up by Vercel Log Drains today (Datadog/Logflare pattern-match), and additionally forwards to Sentry.captureException when @sentry/nextjs is installed and SENTRY_DSN is set. The Sentry SDK stays optional via dynamic import; no-op when absent. (apps/web-next/src/lib/monitoring.ts)
  • Email failures now go through the reporter, not console.error. Both verification and password-reset paths in lib/email.ts, plus the silent .catch in register(), route through reportError with tagged area + kind so they're trivially alertable. (apps/web-next/src/lib/email.ts, apps/web-next/src/lib/api/auth.ts)
  • validateEmailConfig is now pure so /api/health can call it per-request without spamming logs. Boot-time warnings moved to a new logEmailConfigWarnings() hook that runs once on module load and reports the production-critical case via reportError. (apps/web-next/src/lib/email.ts)
  • Resend cooldown moved off the in-process Map. New lib/auth/resend-cooldown.ts uses @naap/cache (Redis) with hashed-email keys + per-purpose namespacing, so the cooldown holds across serverless instances on Fluid Compute. Falls back to bounded in-memory when Redis is unavailable (local dev parity). (apps/web-next/src/lib/auth/resend-cooldown.ts, apps/web-next/src/lib/api/auth.ts)
  • Tests:
    • 14 new unit tests covering monitoring (structured payload, log-injection guard, Sentry no-op), email config validation (pure, sandbox detection), and cooldown (memory fallback, Redis path, case/whitespace normalization, purpose isolation).
    • New Playwright @pre-release smoke (tests/auth-email-smoke.spec.ts) that asserts /api/health reports email configured and register / resend-verification do not 5xx. Picked up by the nightly e2e-ga workflow against production / preview.
  • Docs. .env.local.example now spells out that RESEND_API_KEY + a verified-domain EMAIL_FROM are required in production, and that the sandbox sender only delivers to the Resend account owner.

Why these four, why now

Background: the missing-email incident root-caused to no RESEND_API_KEY in the Vercel project. The code was wired correctly; the failure modes were just invisible. Each item below removes one specific way that the failure could hide again:

Hardening What it would have done last time
/api/health returns 503 Deploy gate / uptime monitor would have flagged the broken state on the deploy that first lost the key
Structured [ALERT] log + Sentry hook First failed send would have paged on-call instead of being lost in console.error
Redis cooldown Existing 15-min throttle on duplicate signups now actually works on Vercel; today's Map reset on every cold start
Playwright smoke Nightly e2e-ga catches a misconfigured deployment within 24h regardless of monitoring coverage

Test plan

Already run locally:

  • npx tsc --noEmit — no new errors from these files (pre-existing errors on main for @naap/crypto, @naap/plugin-sdk, etc. are unchanged)
  • npx vitest run — 662/663 pass; the single failure (integration.test.ts from PR feat(leaderboard): union membership strategy + improved data sources UI #325) is a pre-existing flaky external-network test, not touched here
  • npx next lint over changed files — clean
  • New unit tests pass: 14/14
  • Playwright @pre-release smoke against a preview deploy — will run in CI; locally skipped because it requires a deployed env

Manual production smoke we just ran end-to-end as part of the incident response (independent of this PR):

  • Resend domain operator.livepeer.org verified for sending
  • vercel env add RESEND_API_KEY + EMAIL_FROM set on production + preview
  • New deployment naap-platform-741o2zvcz live on operator.livepeer.org
  • Resend API smoke send delivered ("last_event": "delivered")

Risk / rollback

  • No new runtime dependencies. @sentry/nextjs is optional via dynamic import — adding it is a separate PR.
  • No behavior change on the happy path. When configured correctly, all changed paths return the same values they did before.
  • Redis is only consulted when tryAcquireCooldown is called (existing-email registration flow). Failure to reach Redis falls back to in-memory — strictly no-worse than today.
  • Rollback: revert this single commit.

Follow-ups (not in this PR)

  • Actually install @sentry/nextjs and add instrumentation.ts so the dynamic-import path in lib/monitoring.ts lights up. Tracking separately so this PR stays scoped.
  • Consider extending /api/health/services similarly so it surfaces Redis + Resend reachability, not just upstream service /healthz.

Made with Cursor

Summary by CodeRabbit

  • New Features

    • Health check now reports email configuration status.
    • Lightweight structured error reporting and monitoring for failures.
    • Cross-instance rate limiting for email resend requests.
  • Documentation

    • Expanded environment configuration guidance for email (production behavior and sandbox limits).
  • Tests

    • Added unit and integration tests for email validation, monitoring, resend-cooldown, and an auth-email smoke test.

Review Change Stack

… loud

The verification email pipeline previously swallowed both boot-time
misconfiguration (missing RESEND_API_KEY / sandbox EMAIL_FROM) and
runtime send failures into console.error. This let a regression go
unnoticed for weeks: signups completed in DB but no verification email
was ever sent in production.

This change keeps the existing behavior on the happy path and makes
failure observable + recoverable across instances.

Changes
- /api/health: returns 503 in production-like environments when Resend
  is unconfigured or still on the resend.dev sandbox, so platform
  monitors and deploy gates fail closed.
- lib/monitoring.ts: structured error reporter that emits a tagged
  `[ALERT]` JSON line and forwards to Sentry when `@sentry/nextjs` is
  installed and `SENTRY_DSN` is set. Dependency stays optional via
  dynamic import; no-op when absent.
- lib/email.ts: send failures and boot-time misconfig now flow through
  reportError (verification + password-reset paths). `validateEmailConfig`
  is now pure (safe for /api/health to call repeatedly); a new
  `logEmailConfigWarnings` hook runs once on module load.
- lib/auth/resend-cooldown.ts: replaces the in-process `Map` used to
  throttle resend with a Redis-backed (@naap/cache) cooldown keyed by
  hashed email + purpose, so the cooldown holds across Vercel Fluid
  Compute instances. Falls back to in-memory when Redis is unavailable.
- Tests: 14 new unit tests (monitoring, email config, cooldown memory +
  Redis paths) + Playwright @Pre-release smoke that asserts /api/health
  reports email configured and register/resend-verification do not 5xx.
- Docs: env.local.example clarifies that RESEND_API_KEY and EMAIL_FROM
  are required in production and that the sandbox sender only delivers
  to the Resend account owner.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
naap-platform Ready Ready Preview, Comment May 21, 2026 6:44pm

Request Review

@github-actions github-actions Bot added the size/XL Extra large PR (500+ lines) label May 21, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

⚠️ This PR is very large (783 lines changed). Please split it into smaller, focused PRs if possible.

@github-actions github-actions Bot added the scope/shell Shell app changes label May 21, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6176daaf-cd84-4794-b3ba-d4cc8472b6c4

📥 Commits

Reviewing files that changed from the base of the PR and between 45191b5 and c2f2b12.

📒 Files selected for processing (4)
  • apps/web-next/src/lib/__tests__/monitoring.test.ts
  • apps/web-next/src/lib/auth/__tests__/resend-cooldown.test.ts
  • apps/web-next/src/lib/auth/resend-cooldown.ts
  • apps/web-next/tests/auth-email-smoke.spec.ts

📝 Walkthrough

Walkthrough

This PR introduces structured error monitoring, cross-instance email resend throttling, and email configuration health checks. It adds a monitoring module with reportError() for JSON logging and optional Sentry forwarding, a cooldown utility using Redis-backed throttling with in-memory fallback, refactors email configuration to separate validation and logging concerns, integrates both into auth registration, enhances the health endpoint with email status, and validates the integration with smoke tests.

Changes

Email monitoring, cooldown, and health integration

Layer / File(s) Summary
Monitoring infrastructure
apps/web-next/src/lib/monitoring.ts, apps/web-next/src/lib/__tests__/monitoring.test.ts
New module with ErrorContext type, reportError() for structured JSON logging with [ALERT] prefix and optional Sentry forwarding, input sanitization to prevent log injection, and test helpers for state reset.
Resend cooldown utility
apps/web-next/src/lib/auth/resend-cooldown.ts, apps/web-next/src/lib/auth/__tests__/resend-cooldown.test.ts
New module introducing tryAcquireCooldown() for cross-instance email throttling via Redis cache with per-purpose tracking and in-memory fallback when cache unavailable, including TTL enforcement and soft entry cap for memory.
Email configuration and error reporting
apps/web-next/src/lib/email.ts, apps/web-next/src/lib/__tests__/email.test.ts
Refactored email module to split concerns: pure validateEmailConfig() checker, new logEmailConfigWarnings() for cold-start warnings, and both sendVerificationEmail() and sendPasswordResetEmail() updated to use reportError() instead of console logging for missing config and send failures.
Auth registration with cooldown and monitoring
apps/web-next/src/lib/api/auth.ts
Updated register function to replace in-memory resend throttling with tryAcquireCooldown() and swap console logging for reportError() with structured tags in resend and post-registration verification-email error paths.
Health endpoint with email config status
apps/web-next/src/app/api/health/route.ts
Extended health check to include email configuration status (configured, warnings, criticalInThisEnv), production-like environment detection, and HTTP 503 response when email is critical but unconfigured.
Integration tests and documentation
apps/web-next/tests/auth-email-smoke.spec.ts, apps/web-next/.env.local.example
Added Playwright smoke tests verifying /api/health email status, endpoint availability, and signup success, plus updated .env.local.example with production guidance for RESEND_API_KEY configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • livepeer/naap#149: Refactors email configuration telemetry and Resend verification flow in apps/web-next/src/lib/email.ts and apps/web-next/src/lib/api/auth.ts, directly overlapping with the cooldown and monitoring wiring in this PR.
  • livepeer/naap#195: Also adjusts auth resend/verification throttling behavior and may overlap with the tryAcquireCooldown design and integration.
  • livepeer/naap#181: Modifies email configuration validation and startup logging for RESEND_API_KEY and sandbox sender handling, overlapping with the refactored email startup checks here.

Suggested labels

scope/backend, size/M

Suggested reviewers

  • eliteprox
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly captures the main intent: hardening the email verification pipeline to make silent failures observable through structured error reporting and health checks.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/auth-email-hardening

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@apps/web-next/src/lib/__tests__/monitoring.test.ts`:
- Around line 11-19: The tests mutate process.env.SENTRY_DSN and don't restore
it, risking cross-test contamination; modify the setup/teardown around
__resetMonitoringForTests() so beforeEach captures the current
process.env.SENTRY_DSN (e.g., save to a local variable) and afterEach restores
it (reassign or delete if originally undefined), and ensure consoleErrorSpy is
still restored in afterEach; update the beforeEach/afterEach surrounding the
existing __resetMonitoringForTests, process.env.SENTRY_DSN mutation, and
consoleErrorSpy usage accordingly.

In `@apps/web-next/src/lib/auth/resend-cooldown.ts`:
- Around line 94-103: tryAcquireCooldown currently does a non-atomic
cache.cacheGet followed by cache.cacheSet which allows race conditions; change
it to perform an atomic acquire (SET NX with TTL) instead of GET+SETEX: use the
cache client's atomic "set if not exists" with expiration (Redis SET ... NX EX)
or reuse the project's distributed lock/SWR lock utility to attempt to create
the key once and only set TTL when acquired (referencing tryAcquireCooldown,
cache.cacheGet, cache.cacheSet, PREFIX, ttlSeconds, ttlMs). If the atomic set
succeeds return true, otherwise return false; preserve existing TTL calculation
and error handling while removing the non-atomic check-then-set path.

In `@apps/web-next/tests/auth-email-smoke.spec.ts`:
- Around line 23-26: The skip guard currently uses baseURL.includes('localhost')
and misses other local addresses; update the test.skip checks (the three
occurrences using test.skip(!!baseURL && baseURL.includes('localhost'), ...)) to
detect local environments by parsing baseURL (new URL(baseURL).hostname) and
skipping if the hostname is 'localhost', '127.0.0.1' or '::1' (or otherwise
matches a local host check), so all three occurrences consistently skip for
those hostnames.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f6f1a1cf-d453-4fe1-9785-20d0ce7e4205

📥 Commits

Reviewing files that changed from the base of the PR and between 3d95a5d and 45191b5.

📒 Files selected for processing (10)
  • apps/web-next/.env.local.example
  • apps/web-next/src/app/api/health/route.ts
  • apps/web-next/src/lib/__tests__/email.test.ts
  • apps/web-next/src/lib/__tests__/monitoring.test.ts
  • apps/web-next/src/lib/api/auth.ts
  • apps/web-next/src/lib/auth/__tests__/resend-cooldown.test.ts
  • apps/web-next/src/lib/auth/resend-cooldown.ts
  • apps/web-next/src/lib/email.ts
  • apps/web-next/src/lib/monitoring.ts
  • apps/web-next/tests/auth-email-smoke.spec.ts

Comment thread apps/web-next/src/lib/__tests__/monitoring.test.ts
Comment thread apps/web-next/src/lib/auth/resend-cooldown.ts
Comment thread apps/web-next/tests/auth-email-smoke.spec.ts
Three valid findings:

1. resend-cooldown: switch from non-atomic cacheGet+cacheSet to an atomic
   Redis SET NX PX round-trip via @naap/cache's getRedis(). Removes the
   tiny TOCTOU window where two concurrent acquirers on different
   instances could both think the slot was free.
2. monitoring.test.ts: save and restore process.env.SENTRY_DSN in
   beforeEach/afterEach so the SENTRY_DSN mutation doesn't leak into
   sibling tests in the same vitest worker.
3. auth-email-smoke: broaden the local-environment skip guard to cover
   127.0.0.1, 0.0.0.0, and [::1] in addition to "localhost", parsing
   baseURL via URL.hostname instead of substring match.

Also updates the cooldown unit tests to exercise the new SET NX PX path
plus the throw-then-fallback-to-memory path.

Co-authored-by: Cursor <cursoragent@cursor.com>
@qianghan
Copy link
Copy Markdown
Contributor

Review cycle status

Code review (CodeRabbit):

  • All 3 inline issues addressed in commit c2f2b120:
    1. monitoring.test.ts: env vars now saved + restored across tests
    2. resend-cooldown.ts: atomic Redis SET NX PX round-trip (replaces non-atomic GET+SETEX); falls back to in-memory when Redis absent
    3. auth-email-smoke.spec.ts: skip guard now parses URL.hostname and matches localhost / 127.0.0.1 / 0.0.0.0 / [::1]
  • All 3 review threads auto-resolved by CodeRabbit (isResolved: true).
  • CodeRabbit is rate-limited from posting a follow-up review for ~51 min so the stale CHANGES_REQUESTED state is still attached — it will clear automatically, or an admin can dismiss it.

Copilot:

  • Tried gh pr edit --add-reviewer copilot-pull-request-reviewer and the REST/GraphQL equivalents: GitHub returns 422 "not a collaborator" — Copilot can only be triggered from the GitHub UI in this repo. Worth a UI click if you want a second AI pass before merge.

CI: 18 success / 5 skipped / 0 failures (incl. Lint & TypeCheck, Build, Quality Gates, CodeQL, Vercel preview deploy).

Tests added: 15 vitest unit tests + 3 Playwright @pre-release smoke tests.

Local verification (run on c2f2b120):

  • npx tsc --noEmit — clean for changed files
  • npx vitest run — 663/664 (1 pre-existing flaky network test unrelated)
  • New unit suites — 15/15

Merge gate: needs @livepeer/core CODEOWNERS approval.

@seanhanca seanhanca enabled auto-merge (squash) May 21, 2026 18:51
@seanhanca
Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@seanhanca seanhanca merged commit 326a04d into main May 21, 2026
24 checks passed
@seanhanca seanhanca deleted the fix/auth-email-hardening branch May 21, 2026 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scope/shell Shell app changes size/XL Extra large PR (500+ lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants