Skip to content

[Refactor] Change Retry Logic in Email Sending#1191

Merged
nams1570 merged 12 commits intodevfrom
spike-email-stuck-sending
Feb 16, 2026
Merged

[Refactor] Change Retry Logic in Email Sending#1191
nams1570 merged 12 commits intodevfrom
spike-email-stuck-sending

Conversation

@nams1570
Copy link
Copy Markdown
Collaborator

@nams1570 nams1570 commented Feb 12, 2026

Context

Some of our users' emails were getting stuck in sending. The long delays in processing the retries caused a vercel function timeout.

Summary of Changes

We refactor the low level email sending functions to remove the retry logic there. We kick it up to the email queue step. Additionally, we flag emails to be retried when they encounter issues but leave it for a future iteration to actually perform the retry. We perform an exponential backoff with a random component to decide when they have to be retried. We also make some small adjustments to the queuing function to not queue skipped emails.

When an email fails to send during the sending function, we check to see if it is a retryable error or not. Some errors are transient and trying again may succeed while others indicate deeper issues. If it is retryable, and the max number of retry attempts hasn't been reached, we set nextSendRetryAt to a time determined by an exponential backoff calculation function. When the queuing function looks for emails to queue, it doesn't just pick up the SCHEDULED. emails whose scheduledAt time <= NOW(), but also those emails whose nextSendRetryAt time <= NOW(). What this means in practice is that one iteration of the email-queue-step will mark emails as retryable while another iteration will perform the retry. This should be cleaner and prevent long delays in the email-queue-step process due to retries. This also makes it easier to scale up the number of retries if need be.

Summary by CodeRabbit

  • New Features

    • Email retry flow with scheduled exponential backoff, exhaustion handling, and per-attempt error metadata visible in APIs and admin UI.
    • Outbox items now include retry count, next retry time, and per-attempt error history; new AdminSendAttemptError type added.
  • Tests

    • Expanded end-to-end coverage for retryable vs non-retryable failures, scheduling, exhaustion, and state transitions.
  • Chores

    • Migrations and a concurrent index added to support new retry fields and optimize queue queries.

Copilot AI review requested due to automatic review settings February 12, 2026 03:13
@vercel
Copy link
Copy Markdown

vercel Bot commented Feb 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
stack-backend Ready Ready Preview, Comment Feb 16, 2026 8:23pm
stack-dashboard Ready Ready Preview, Comment Feb 16, 2026 8:23pm
stack-demo Ready Ready Preview, Comment Feb 16, 2026 8:23pm
stack-docs Ready Ready Preview, Comment Feb 16, 2026 8:23pm

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds deferred email retry support: DB columns and CHECK constraints (staged validation), Prisma schema/index, API/read-model mappings, queue processor retry/backoff and per-attempt error tracking, simplified low-level send path, admin types, and extended E2E tests for retry behaviors.

Changes

Cohort / File(s) Summary
Database Schema & Migrations
apps/backend/prisma/migrations/20260210000000_deferred_email_retry/migration.sql, apps/backend/prisma/migrations/20260210000001_deferred_email_retry_validate/migration.sql, apps/backend/prisma/migrations/20260213004424_email_outbox_is_queued_index/migration.sql, apps/backend/prisma/schema.prisma
Adds sendRetries (INT default 0), nextSendRetryAt (TIMESTAMP/DateTime?), sendAttemptErrors (JSONB/Json?), adds three CHECK constraints (marked NOT VALID) to enforce retry semantics/no-retry-after-finished, validates those constraints in a follow-up migration, and creates concurrent index EmailOutbox_isQueued_idx.
Backend CRUD & API Layer
apps/backend/src/app/api/latest/emails/outbox/crud.tsx
Maps new retry fields into read model (send_retries, next_send_retry_at_millis, send_attempt_errors), parses them from JSON, and clears/resets retry state on rerender/cancel flows.
Email Queue & Send Logic
apps/backend/src/lib/email-queue-step.tsx, apps/backend/src/lib/emails-low-level.tsx
Implements per-attempt error tracking, exponential backoff and scheduling, retry/unclaim/requeue logic, exhaustion/permanent-failure handling, and consolidates low-level send path to a non-retry primitive.
End-to-End Tests & Test Helpers
apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts, apps/e2e/tests/backend/endpoints/api/v1/emails/outbox-api.test.ts, apps/e2e/tests/backend/backend-helpers.ts
Adds extensive E2E tests with SMTP stubs for retryable vs non-retryable errors, polling helpers for new fields, assertions for retry lifecycle and exhaustion, and exposes OutboxEmail type for tests.
Shared Types & Admin SDK
packages/stack-shared/src/interface/crud/email-outbox.ts, packages/template/src/lib/stack-app/email/index.ts, packages/template/src/lib/stack-app/apps/implementations/admin-app-impl.ts, packages/template/src/lib/stack-app/index.ts
Adds retry-related fields to shared CRUD schema and admin types: send_retries, next_send_retry_at_millis (mapped to Date
Docs / Guidance
AGENTS.md
Adds migration guidance recommending staged constraint validation (use NOT VALID then VALIDATE in separate migration files).

Sequence Diagram(s)

sequenceDiagram
    participant Q as Email Queue
    participant P as Send Processor
    participant DB as Database
    participant SMTP as SMTP Provider

    Q->>P: Fetch next email (respect nextSendRetryAt)
    P->>DB: Claim email / read retry metadata
    P->>SMTP: Attempt send via lowLevelSendEmailDirectWithoutRetries
    alt Success
        SMTP-->>P: 250 OK
        P->>DB: Mark finishedSendingAt, clear retry fields
    else Temporary/Retryable error
        SMTP-->>P: Temp error
        P->>P: increment sendRetries, compute backoff
        P->>DB: Append attempt to sendAttemptErrors, set nextSendRetryAt
        P->>Q: Unclaim/requeue for future
    else Permanent error or max attempts reached
        SMTP-->>P: Permanent error / exhausted
        P->>DB: Mark finishedSendingAt with final error summary
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • N2D4
  • BilalG1

"A Rabbit's Retry Ode" 🐇✨

When mails fall short and servers sigh,
I note each hop and every try.
Backoff ticks and errors stacked,
I nudge them on till one sticks back.
Hooray — the post finds sky.

🚥 Pre-merge checks | ✅ 2 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.18% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (13 files):

⚔️ AGENTS.md (content)
⚔️ apps/backend/prisma/schema.prisma (content)
⚔️ apps/backend/src/app/api/latest/emails/outbox/crud.tsx (content)
⚔️ apps/backend/src/lib/email-queue-step.tsx (content)
⚔️ apps/backend/src/lib/emails-low-level.tsx (content)
⚔️ apps/e2e/tests/backend/backend-helpers.ts (content)
⚔️ apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts (content)
⚔️ apps/e2e/tests/backend/endpoints/api/v1/emails/outbox-api.test.ts (content)
⚔️ apps/e2e/tests/js/email.test.ts (content)
⚔️ packages/stack-shared/src/interface/crud/email-outbox.ts (content)
⚔️ packages/template/src/lib/stack-app/apps/implementations/admin-app-impl.ts (content)
⚔️ packages/template/src/lib/stack-app/email/index.ts (content)
⚔️ packages/template/src/lib/stack-app/index.ts (content)

These conflicts must be resolved before merging into dev.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title '[Refactor] Change Retry Logic in Email Sending' accurately summarizes the main change: refactoring email retry logic by moving it from low-level functions to the queue step.
Description check ✅ Passed The pull request description provides clear context, comprehensive summary of changes, and detailed explanation of the refactoring approach.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch spike-email-stuck-sending

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@nams1570 nams1570 force-pushed the spike-email-stuck-sending branch from 0144e55 to bd4d1a4 Compare February 12, 2026 03:15
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Feb 12, 2026

Greptile Overview

Greptile Summary

This PR refactors email retry logic to prevent Vercel function timeouts by moving retry handling from the low-level email sender to the queue processing layer. Instead of blocking queue iterations with inline retries, emails with retryable errors are unclaimed and scheduled for future queue iterations using exponential backoff with jitter.

Key changes:

  • Database schema: Added failedSendAttemptCount, nextSendRetryAt, and sendAttemptErrors fields to track retry state with appropriate CHECK constraints
  • Email queue: Implements deferred retry with exponential backoff (max 3 attempts), unclaiming failed emails and setting nextSendRetryAt for future processing
  • Low-level sender: Removed retry logic from lowLevelSendEmailDirectViaProvider, renamed to lowLevelSendEmailDirectWithoutRetries for clarity
  • API & types: Exposed retry fields through CRUD endpoints and TypeScript types for debugging visibility
  • Testing: Comprehensive tests covering non-retryable errors, retryable failures, and retry exhaustion scenarios

Minor issue:

  • Prisma schema comments should document the new CHECK constraints following the existing pattern (see line 864 for reference)

Confidence Score: 4/5

  • This PR is safe to merge with high confidence - the refactoring is well-designed and thoroughly tested
  • The implementation is solid with comprehensive test coverage, proper error handling, and clean separation of concerns. The deferred retry approach correctly prevents timeout issues. Minor documentation improvement suggested for schema comments, but doesn't affect functionality.
  • apps/backend/prisma/schema.prisma could benefit from documenting the CHECK constraints in comments, but this is a minor documentation issue

Important Files Changed

Filename Overview
apps/backend/prisma/migrations/20260210000000_deferred_email_retry/migration.sql Adds retry fields (failedSendAttemptCount, nextSendRetryAt, sendAttemptErrors) with appropriate CHECK constraints for data integrity
apps/backend/prisma/schema.prisma Adds retry tracking fields to EmailOutbox model, but missing CHECK constraints defined in the migration
apps/backend/src/lib/email-queue-step.tsx Implements deferred retry logic with exponential backoff, unclaiming emails for future iterations instead of blocking current processing
apps/backend/src/lib/emails-low-level.tsx Removes retry logic from low-level email functions, simplifying to single-attempt sending
apps/backend/src/app/api/latest/emails/outbox/crud.tsx Adds retry fields to CRUD API, properly resets retry state when email content is edited

Sequence Diagram

sequenceDiagram
    participant Queue as Email Queue
    participant DB as EmailOutbox DB
    participant LowLevel as Low-Level Sender
    participant Provider as Email Provider
    
    Note over Queue,Provider: First Send Attempt
    Queue->>DB: Claim email (set startedSendingAt, clear nextSendRetryAt)
    Queue->>LowLevel: lowLevelSendEmailDirectWithoutRetries()
    LowLevel->>Provider: Send email (no retries)
    Provider-->>LowLevel: Error (retryable, e.g., 450)
    LowLevel-->>Queue: Result.error(canRetry: true)
    
    Note over Queue,DB: Schedule Deferred Retry
    Queue->>DB: Update: failedSendAttemptCount++, set nextSendRetryAt (exp backoff), unclaim (startedSendingAt=null, isQueued=false)
    
    Note over Queue,Provider: Future Queue Iteration
    Queue->>DB: Find emails with nextSendRetryAt <= NOW()
    Queue->>DB: Claim email for retry (set startedSendingAt, clear nextSendRetryAt)
    Queue->>LowLevel: lowLevelSendEmailDirectWithoutRetries()
    LowLevel->>Provider: Send email (no retries)
    
    alt Success
        Provider-->>LowLevel: Success
        LowLevel-->>Queue: Result.ok()
        Queue->>DB: Mark complete (finishedSendingAt)
    else Retry Exhausted (attempt 3)
        Provider-->>LowLevel: Error
        LowLevel-->>Queue: Result.error()
        Queue->>DB: Mark as permanent failure (finishedSendingAt, sendServerError*)
    else Retryable (attempt < 3)
        Provider-->>LowLevel: Error (retryable)
        LowLevel-->>Queue: Result.error(canRetry: true)
        Queue->>DB: Schedule another retry (failedSendAttemptCount++, nextSendRetryAt)
    end
Loading

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread apps/backend/prisma/schema.prisma Outdated
@nams1570 nams1570 force-pushed the spike-email-stuck-sending branch from bd4d1a4 to 1eea2e9 Compare February 12, 2026 03:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors email sending retry behavior to avoid long inline retry delays that can cause Vercel function timeouts, by moving retry scheduling/tracking into the email queue step and persisting retry metadata on outbox records.

Changes:

  • Removed inline retry logic from low-level email sending, returning a single-attempt result to the caller.
  • Added deferred retry tracking fields (failedSendAttemptCount, nextSendRetryAt, sendAttemptErrors) to the EmailOutbox model and exposed them through CRUD + template/admin SDK types.
  • Implemented exponential backoff (with jitter) retry scheduling in the queue worker and added E2E coverage for retryable and non-retryable SMTP failures.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
packages/template/src/lib/stack-app/index.ts Re-exports new admin retry error type for SDK consumers.
packages/template/src/lib/stack-app/email/index.ts Adds admin-facing retry fields/types to outbox models.
packages/template/src/lib/stack-app/apps/implementations/admin-app-impl.ts Maps new retry fields from CRUD responses into admin SDK types.
packages/stack-shared/src/interface/crud/email-outbox.ts Extends shared CRUD schema to include retry metadata + attempt errors.
apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts Updates expected outbox payload shape and adds deferred-retry E2E scenarios with a local SMTP stub.
apps/backend/src/lib/emails-low-level.tsx Removes retry wrapper logic; logs and returns single-attempt send result.
apps/backend/src/lib/email-queue-step.tsx Implements deferred retry scheduling + error tracking; adjusts queue selection/claiming to consider retries.
apps/backend/src/app/api/latest/emails/outbox/crud.tsx Serializes retry fields into CRUD responses; resets retry fields when content changes.
apps/backend/prisma/schema.prisma Adds retry fields to EmailOutbox model.
apps/backend/prisma/migrations/20260210000000_deferred_email_retry/migration.sql Creates DB columns + constraints for deferred retry fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread apps/backend/src/lib/email-queue-step.tsx
Comment thread apps/backend/src/lib/email-queue-step.tsx
@nams1570 nams1570 requested a review from N2D4 February 12, 2026 03:22
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@apps/backend/prisma/migrations/20260210000000_deferred_email_retry/migration.sql`:
- Around line 10-25: The three CHECK constraints on "EmailOutbox"
(EmailOutbox_nextSendRetryAt_requires_failure,
EmailOutbox_sendAttemptErrors_requires_failure,
EmailOutbox_no_retry_after_finished) must be added using the safer two-step
pattern: first ALTER TABLE "EmailOutbox" ADD CONSTRAINT ... CHECK(...) NOT VALID
for each constraint (brief exclusive lock), then run ALTER TABLE "EmailOutbox"
VALIDATE CONSTRAINT <constraint_name> for each (non-blocking SHARE UPDATE
EXCLUSIVE). Update the migration to add each constraint with NOT VALID and then
validate them in separate statements in the same migration.

In `@apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts`:
- Around line 2026-2033: getOutboxEmailById currently returns response.body
unconditionally which can return error payloads for 4xx/5xx responses; update
getOutboxEmailById to check the HTTP status (e.g., response.ok or
response.status) after the niceBackendFetch call and throw or return a clear
error when the response is not successful, including the status and response
body in the error message so callers (like the polling helpers) can detect and
handle "not found yet" vs real failures; reference the function name
getOutboxEmailById and the use of niceBackendFetch to locate where to add the
check and error handling.
🧹 Nitpick comments (9)
packages/template/src/lib/stack-app/email/index.ts (1)

39-47: timestamp field is string while sibling date fields use Date.

Other date-like fields in AdminEmailOutboxBase (e.g., createdAt, scheduledAt, nextSendRetryAt) are typed as Date, but AdminSendAttemptError.timestamp is a raw string. If this is intentional because it comes directly from new Date().toISOString() stored in JSON and you don't want to deserialize it, consider adding a brief doc comment noting the format (ISO 8601) so consumers know what to expect.

apps/backend/prisma/schema.prisma (1)

847-853: Consider adding an index on nextSendRetryAt for the retry query path.

The prepareSendPlan and claimEmailsForSending functions query WHERE "nextSendRetryAt" IS NOT NULL AND "nextSendRetryAt" <= NOW(). The existing EmailOutbox_ordering_idx (Line 886) covers the normal queuing path but does not cover the retry path. With a potentially large EmailOutbox table (>1M rows per learnings), a partial index on nextSendRetryAt could significantly improve retry query performance.

Suggested partial index
   @@index([tenancyId, finishedSendingAt(sort: Desc), scheduledAtIfNotYetQueued(sort: Desc), priority, id], map: "EmailOutbox_ordering_idx")
   @@index([tenancyId, simpleStatus], map: "EmailOutbox_simple_status_tenancy_idx")
   @@index([tenancyId, status], map: "EmailOutbox_status_tenancy_idx")

In the migration SQL, add:

CREATE INDEX CONCURRENTLY "EmailOutbox_nextSendRetryAt_idx"
  ON "EmailOutbox" ("nextSendRetryAt")
  WHERE "nextSendRetryAt" IS NOT NULL;

Based on learnings: "When writing database migration files, assume that we have >1,000,000 rows in every table."

apps/backend/src/lib/email-queue-step.tsx (3)

44-50: Minor formatting: missing space before =.

-const appendSendAttemptError =(
+const appendSendAttemptError = (

709-709: Type cast on row.sendAttemptErrors — necessary but fragile.

row.sendAttemptErrors as SendAttemptError[] | null is needed because Prisma's raw query returns Json type. Since this field is populated only by this same code path (via appendSendAttemptError), the cast is safe in practice. However, if the JSON structure were ever corrupted or changed, this would silently produce incorrect data.

Consider adding a defensive runtime check or using a validation/parsing step when reading from the raw query result, at least in development mode.

As per coding guidelines: "Any assumption you make should either be validated through the type system (preferred), assertions, or tests."


795-813: Unexpected errors bypass retry tracking.

The outer catch block marks the email as server-error immediately, without updating failedSendAttemptCount or sendAttemptErrors. This means an unexpected error (e.g., transient DB issue during the deliverability check) causes permanent failure even if the error is transient.

This may be acceptable for truly unexpected errors, but consider whether some of these should also use the retry path. At minimum, the sendServerErrorInternalDetails should include the current failedSendAttemptCount for debugging:

Suggested improvement
       data: {
         finishedSendingAt: new Date(),
         canHaveDeliveryInfo: false,
+        failedSendAttemptCount: row.failedSendAttemptCount + 1,
         sendServerErrorExternalMessage: "An error occurred while sending the email. If you are the admin of this project, please check the email configuration and try again.",
         sendServerErrorExternalDetails: {},
         sendServerErrorInternalMessage: errorToNiceString(error),
-        sendServerErrorInternalDetails: {},
+        sendServerErrorInternalDetails: { previousAttemptCount: row.failedSendAttemptCount },
       },
apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts (3)

1941-1944: Non-null assertion on tempFailPort.

Line 1943 uses tempFailPort! which violates the coding guideline preferring ?? throwErr(...).

Suggested fix
   if (tempFailServer) {
-    return tempFailPort!;
+    return tempFailPort ?? throwErr("tempFailPort should be set when tempFailServer exists");
   }

As per coding guidelines: "Prefer ?? throwErr(...) over non-null assertions and fallback values, with good error messages explicitly stating assumptions."


2036-2049: Polling helpers use Date.now() for elapsed time measurement.

waitForOutboxEmail (Line 2037-2038) and waitForAttemptCount (Line 2054-2055) both use Date.now() for timeout tracking. The same pattern appears in the inline polling loops at Lines 2112-2114 and 2234-2237.

As per coding guidelines: "Don't use Date.now() for measuring elapsed (real) time, instead use performance.now()."

Also applies to: 2053-2071


1926-1935: brokenSmtpConfig uses raw port 25 — should this use withPortPrefix for consistency?

The existing testEmailConfig (Line 52) uses withPortPrefix("29") for port configuration, suggesting test ports are prefixed to avoid conflicts. The brokenSmtpConfig uses port 25 directly. Since this config intentionally targets a non-existent host, the port doesn't functionally matter, but it's a minor inconsistency.

packages/template/src/lib/stack-app/apps/implementations/admin-app-impl.ts (1)

696-697: Consider narrowing the crud parameter type instead of any.

The function currently uses any for the entire crud parameter (with a valid eslint-disable). This predates this PR, but since you're extending the surface area with three new fields, it's worth noting that a more precise type (e.g., EmailOutboxCrud["Admin"]["Read"] or similar) would catch mismatches at compile time — such as a typo in crud.failed_send_attempt_count. Low priority since the existing pattern is consistent throughout.

Comment thread apps/backend/prisma/migrations/20260210000000_deferred_email_retry/migration.sql Outdated
Comment thread apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts
Comment thread apps/backend/src/lib/email-queue-step.tsx Outdated
Comment thread apps/backend/src/app/api/latest/emails/outbox/crud.tsx Outdated
Comment thread apps/backend/src/lib/email-queue-step.tsx Outdated
Comment thread apps/backend/src/lib/email-queue-step.tsx Outdated
Comment thread apps/backend/src/lib/email-queue-step.tsx Outdated
Comment thread apps/backend/src/lib/email-queue-step.tsx Outdated
Comment thread apps/backend/src/lib/email-queue-step.tsx
Comment thread apps/backend/src/lib/email-queue-step.tsx Outdated
Comment thread apps/backend/src/lib/email-queue-step.tsx Outdated
Comment thread apps/backend/src/lib/email-queue-step.tsx Outdated
Comment thread apps/backend/src/lib/email-queue-step.tsx
@nams1570 nams1570 force-pushed the spike-email-stuck-sending branch from 34b272d to 5194f5d Compare February 13, 2026 01:28
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@packages/template/src/lib/stack-app/apps/implementations/admin-app-impl.ts`:
- Around line 719-736: The code is using ad-hoc "as" casts for retry fields on
the crud object; update the function/signature that receives the crud parameter
to type it as EmailOutboxCrud (import from
`@stackframe/stack-shared/dist/interface/crud/email-outbox`), remove the
unnecessary "as" casts around send_retries, next_send_retry_at_millis and
send_attempt_errors, and let the typed properties be used directly when building
sendRetries, nextSendRetryAt and sendAttemptErrors (mapping the existing field
names to camelCase remains the same).
🧹 Nitpick comments (5)
apps/backend/prisma/migrations/20260213004424_email_outbox_is_queued_index/migration.sql (1)

1-6: Consider a partial index for the FALSE filter.
Since the hot path is isQueued = FALSE, a partial index can stay smaller and faster to maintain.

🔧 Suggested migration tweak
-CREATE INDEX CONCURRENTLY IF NOT EXISTS "EmailOutbox_isQueued_idx" ON /* SCHEMA_NAME_SENTINEL */."EmailOutbox" ("isQueued");
+CREATE INDEX CONCURRENTLY IF NOT EXISTS "EmailOutbox_isQueued_false_idx"
+  ON /* SCHEMA_NAME_SENTINEL */."EmailOutbox" ("isQueued")
+  WHERE "isQueued" = FALSE;
apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts (4)

1872-1917: Replace non-null assertions with ?? throwErr(...) pattern.

Lines 1874 and 1906 use ! non-null assertions. The coding guidelines prefer ?? throwErr(...) with explicit error messages.

Suggested fix
 async function startTempFailSmtpServer(): Promise<number> {
   if (tempFailServer) {
-    return tempFailPort!;
+    return tempFailPort ?? throwErr("tempFailServer exists but tempFailPort is null");
   }
-      const address = tempFailServer!.address();
+      const address = (tempFailServer ?? throwErr("tempFailServer is null after creation")).address();

As per coding guidelines: "Prefer ?? throwErr(...) over non-null assertions and fallback values, with good error messages explicitly stating assumptions."


1969-2005: Use performance.now() instead of Date.now() for elapsed-time measurement in polling helpers.

waitForOutboxEmail (Line 1971) and waitForAttemptCount (Line 1988) both use Date.now() to measure elapsed wall-clock time for timeout purposes. The same pattern is repeated in inline polling loops at lines 2046 and 2176.

Suggested fix for waitForOutboxEmail
 async function waitForOutboxEmail(subject: string, timeoutMs = 30000): Promise<OutboxEmailWithRetryFields> {
-  const startTime = Date.now();
-  while (Date.now() - startTime < timeoutMs) {
+  const startTime = performance.now();
+  while (performance.now() - startTime < timeoutMs) {
     const emails = await getOutboxEmails({ subject });
Suggested fix for waitForAttemptCount
 async function waitForAttemptCount(emailId: string, attemptCount: number, timeoutMs = 60000): Promise<OutboxEmailWithRetryFields> {
-  const startTime = Date.now();
-  while (Date.now() - startTime < timeoutMs) {
+  const startTime = performance.now();
+  while (performance.now() - startTime < timeoutMs) {

Apply the same change to the inline polling loops at lines 2046–2048 and 2176–2178.

As per coding guidelines: "Don't use Date.now() for measuring elapsed (real) time, instead use performance.now()."


1900-1902: Silently swallowing socket errors may hide issues during test debugging.

The socket.on('error', () => {}) handler discards all socket errors. Consider at least logging to logIfTestFails or stderr so unexpected connection issues are visible when tests fail.


2136-2192: Hardcoded retry limit should reference the actual backend constant to prevent drift.

The test hardcodes MAX_SEND_ATTEMPTS = 5 in assertions at lines 2187–2188, which currently match the backend constant in apps/backend/src/lib/email-queue-step.tsx:22. Since the constant is not exported, consider either:

  1. Exporting MAX_SEND_ATTEMPTS from the backend module and importing it into the test for a single source of truth, or
  2. Adding a more explicit comment explaining why the value is hardcoded and where it should be updated if the backend constant changes.

The existing comments reference the constant name but don't fully document the maintenance burden.

@nams1570 nams1570 requested a review from N2D4 February 13, 2026 01:40
@nams1570 nams1570 force-pushed the spike-email-stuck-sending branch from 5194f5d to 8647187 Compare February 13, 2026 17:14
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@apps/backend/prisma/schema.prisma`:
- Around line 836-837: The comment for the Prisma field isQueued is outdated
because the retry/reschedule flow resets it to false; update the inline comment
on the isQueued Boolean field (in the Prisma schema where isQueued is declared)
to reflect that it indicates the email is currently queued but may be set back
to false on retry/reschedule (replace “Once queued, this stays TRUE” with
wording such as “Indicates the email is currently queued; may be reset to false
when retrying/rescheduling”).

In `@apps/backend/src/app/api/latest/emails/outbox/crud.tsx`:
- Around line 61-78: The code unsafely casts prismaModel.sendAttemptErrors to a
typed array and maps it; instead add a runtime guard that checks
Array.isArray(prismaModel.sendAttemptErrors) and if false calls throwErr(...)
with a descriptive message, then map the validated array to the snake_case
shape; replace the "as Array<...>" cast and ensure the variable name
sendAttemptErrors is set to null when prismaModel.sendAttemptErrors is
null/undefined, following the defensive pattern used in email-queue-step.tsx.

In `@apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts`:
- Around line 1971-1983: In the polling loops (functions waitForOutboxEmail,
waitForAttemptCount, the email status polling loop, and the retry-exhaustion
polling loop) replace uses of Date.now() for elapsed-time measurement with
performance.now(), i.e. set startTime = performance.now() and compare Date.now()
- startTime to timeoutMs using performance.now() - startTime instead; ensure the
same timeoutMs semantics are preserved and that performance is available in the
test runtime (import { performance } from 'perf_hooks' if Node doesn't provide
it globally).
- Around line 1872-1913: The startTempFailSmtpServer function uses non-null
assertions on tempFailPort and tempFailServer; replace them with defensive
checks: in the early-return branch verify tempFailPort is defined and return it
or throw a clear Error (referencing tempFailPort and tempFailServer), and inside
the listen callback check that tempFailServer.address() is non-null before
accessing .port (do not use tempFailServer!); if address is missing or not an
object, reject with a descriptive Error. Update any socket callbacks to guard
against tempFailServer being undefined if referenced elsewhere.
🧹 Nitpick comments (1)
apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts (1)

1959-1962: Use encodeURIComponent (or urlString) for dynamic URL segments.
This helper interpolates emailId into a URL directly; prefer encoding (and apply the same pattern to other new URL constructions in this file).

✅ Example adjustment
-  const response = await niceBackendFetch(`/api/v1/emails/outbox/${emailId}`, {
+  const response = await niceBackendFetch(`/api/v1/emails/outbox/${encodeURIComponent(emailId)}`, {
As per coding guidelines, “Use urlString`` or encodeURIComponent() instead of normal string interpolation for URLs, for consistency.”

Comment thread apps/backend/prisma/schema.prisma Outdated
Comment thread apps/backend/src/app/api/latest/emails/outbox/crud.tsx
Comment thread apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts
Comment thread apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts Outdated
Comment thread apps/backend/prisma/migrations/20260210000000_deferred_email_retry/migration.sql Outdated
Comment thread apps/backend/prisma/schema.prisma Outdated
@nams1570 nams1570 force-pushed the spike-email-stuck-sending branch 3 times, most recently from 4da2012 to 876c950 Compare February 13, 2026 22:14
@nams1570 nams1570 enabled auto-merge (squash) February 13, 2026 22:15
chose to name it failedSendAttemptCount as the nextSendRetryAt can only be set when a send attempt fails.
nextSendRetryAt controls the retry logic.
sendAttemptErrors is named as such because it tracks all of the send errors not just retry errors.
We also put in api schema updates and type updates.
Spending too much time on retries in the low level email sending clogs the email-queue-step.
This can lead to cascading delays if there is an outage with our sender provider.
This will cause a vercel runtime failure.
Removing the retry logic also obviates the need for separate funcs.
The comment indicates this isn't used anywhere else.
There are no uses of it in our codebase.
Now we mark emails to be retried in one iteration,
 and retry them in the next iteration.
Before, retryable emails went from sending -> scheduled -> sending
Now, we move through the queuing logic.
This simplifies the state transition functions and lifts the retry checking logic to only one func
 queuing.
We also add some checks to queuing to have it be robust.
Our queuing logic will be slow if we dont use an index.
isQueued is most of the time true, since emailoutbox table accounts for historical data and all sent emails have isQueued set to true.
So the index will actually be quite selective.
We can't just use the partial index on isQueued because its build on tenancyId, and the queuing query doesnt check tenancy id.
Rather than using an OR which may/may not use the index correctly,
we use two queries.
Without the not valid, the migration tries to create the constraint and validate it at the same time which can lock the entire table.
We decouple the validation and the constraint creation to avoid transaction timeouts
@nams1570 nams1570 force-pushed the spike-email-stuck-sending branch from 876c950 to 091616a Compare February 16, 2026 20:16
@nams1570 nams1570 merged commit fa360ab into dev Feb 16, 2026
26 of 28 checks passed
@nams1570 nams1570 deleted the spike-email-stuck-sending branch February 16, 2026 21:07
N2D4 pushed a commit that referenced this pull request Feb 17, 2026
### Context
Some of our users' emails were getting stuck in sending. The long delays
in processing the retries caused a vercel function timeout.

### Summary of Changes
We refactor the low level email sending functions to remove the retry
logic there. We kick it up to the email queue step. Additionally, we
flag emails to be retried when they encounter issues but leave it for a
future iteration to actually perform the retry. We perform an
exponential backoff with a random component to decide when they have to
be retried. We also make some small adjustments to the queuing function
to not queue skipped emails.

When an email fails to send during the sending function, we check to see
if it is a retryable error or not. Some errors are transient and trying
again may succeed while others indicate deeper issues. If it is
retryable, and the max number of retry attempts hasn't been reached, we
set `nextSendRetryAt` to a time determined by an exponential backoff
calculation function. When the queuing function looks for emails to
queue, it doesn't just pick up the `SCHEDULED`. emails whose
`scheduledAt` time <= `NOW()`, but also those emails whose
`nextSendRetryAt` time <= `NOW()`. What this means in practice is that
one iteration of the `email-queue-step` will mark emails as retryable
while another iteration will perform the retry. This should be cleaner
and prevent long delays in the `email-queue-step` process due to
retries. This also makes it easier to scale up the number of retries if
need be.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants