[Refactor] Change Retry Logic in Email Sending#1191
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds deferred email retry support: DB columns and CHECK constraints (staged validation), Prisma schema/index, API/read-model mappings, queue processor retry/backoff and per-attempt error tracking, simplified low-level send path, admin types, and extended E2E tests for retry behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant Q as Email Queue
participant P as Send Processor
participant DB as Database
participant SMTP as SMTP Provider
Q->>P: Fetch next email (respect nextSendRetryAt)
P->>DB: Claim email / read retry metadata
P->>SMTP: Attempt send via lowLevelSendEmailDirectWithoutRetries
alt Success
SMTP-->>P: 250 OK
P->>DB: Mark finishedSendingAt, clear retry fields
else Temporary/Retryable error
SMTP-->>P: Temp error
P->>P: increment sendRetries, compute backoff
P->>DB: Append attempt to sendAttemptErrors, set nextSendRetryAt
P->>Q: Unclaim/requeue for future
else Permanent error or max attempts reached
SMTP-->>P: Permanent error / exhausted
P->>DB: Mark finishedSendingAt with final error summary
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
0144e55 to
bd4d1a4
Compare
Greptile OverviewGreptile SummaryThis PR refactors email retry logic to prevent Vercel function timeouts by moving retry handling from the low-level email sender to the queue processing layer. Instead of blocking queue iterations with inline retries, emails with retryable errors are unclaimed and scheduled for future queue iterations using exponential backoff with jitter. Key changes:
Minor issue:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Queue as Email Queue
participant DB as EmailOutbox DB
participant LowLevel as Low-Level Sender
participant Provider as Email Provider
Note over Queue,Provider: First Send Attempt
Queue->>DB: Claim email (set startedSendingAt, clear nextSendRetryAt)
Queue->>LowLevel: lowLevelSendEmailDirectWithoutRetries()
LowLevel->>Provider: Send email (no retries)
Provider-->>LowLevel: Error (retryable, e.g., 450)
LowLevel-->>Queue: Result.error(canRetry: true)
Note over Queue,DB: Schedule Deferred Retry
Queue->>DB: Update: failedSendAttemptCount++, set nextSendRetryAt (exp backoff), unclaim (startedSendingAt=null, isQueued=false)
Note over Queue,Provider: Future Queue Iteration
Queue->>DB: Find emails with nextSendRetryAt <= NOW()
Queue->>DB: Claim email for retry (set startedSendingAt, clear nextSendRetryAt)
Queue->>LowLevel: lowLevelSendEmailDirectWithoutRetries()
LowLevel->>Provider: Send email (no retries)
alt Success
Provider-->>LowLevel: Success
LowLevel-->>Queue: Result.ok()
Queue->>DB: Mark complete (finishedSendingAt)
else Retry Exhausted (attempt 3)
Provider-->>LowLevel: Error
LowLevel-->>Queue: Result.error()
Queue->>DB: Mark as permanent failure (finishedSendingAt, sendServerError*)
else Retryable (attempt < 3)
Provider-->>LowLevel: Error (retryable)
LowLevel-->>Queue: Result.error(canRetry: true)
Queue->>DB: Schedule another retry (failedSendAttemptCount++, nextSendRetryAt)
end
|
bd4d1a4 to
1eea2e9
Compare
There was a problem hiding this comment.
Pull request overview
Refactors email sending retry behavior to avoid long inline retry delays that can cause Vercel function timeouts, by moving retry scheduling/tracking into the email queue step and persisting retry metadata on outbox records.
Changes:
- Removed inline retry logic from low-level email sending, returning a single-attempt result to the caller.
- Added deferred retry tracking fields (
failedSendAttemptCount,nextSendRetryAt,sendAttemptErrors) to the EmailOutbox model and exposed them through CRUD + template/admin SDK types. - Implemented exponential backoff (with jitter) retry scheduling in the queue worker and added E2E coverage for retryable and non-retryable SMTP failures.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/template/src/lib/stack-app/index.ts | Re-exports new admin retry error type for SDK consumers. |
| packages/template/src/lib/stack-app/email/index.ts | Adds admin-facing retry fields/types to outbox models. |
| packages/template/src/lib/stack-app/apps/implementations/admin-app-impl.ts | Maps new retry fields from CRUD responses into admin SDK types. |
| packages/stack-shared/src/interface/crud/email-outbox.ts | Extends shared CRUD schema to include retry metadata + attempt errors. |
| apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts | Updates expected outbox payload shape and adds deferred-retry E2E scenarios with a local SMTP stub. |
| apps/backend/src/lib/emails-low-level.tsx | Removes retry wrapper logic; logs and returns single-attempt send result. |
| apps/backend/src/lib/email-queue-step.tsx | Implements deferred retry scheduling + error tracking; adjusts queue selection/claiming to consider retries. |
| apps/backend/src/app/api/latest/emails/outbox/crud.tsx | Serializes retry fields into CRUD responses; resets retry fields when content changes. |
| apps/backend/prisma/schema.prisma | Adds retry fields to EmailOutbox model. |
| apps/backend/prisma/migrations/20260210000000_deferred_email_retry/migration.sql | Creates DB columns + constraints for deferred retry fields. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@apps/backend/prisma/migrations/20260210000000_deferred_email_retry/migration.sql`:
- Around line 10-25: The three CHECK constraints on "EmailOutbox"
(EmailOutbox_nextSendRetryAt_requires_failure,
EmailOutbox_sendAttemptErrors_requires_failure,
EmailOutbox_no_retry_after_finished) must be added using the safer two-step
pattern: first ALTER TABLE "EmailOutbox" ADD CONSTRAINT ... CHECK(...) NOT VALID
for each constraint (brief exclusive lock), then run ALTER TABLE "EmailOutbox"
VALIDATE CONSTRAINT <constraint_name> for each (non-blocking SHARE UPDATE
EXCLUSIVE). Update the migration to add each constraint with NOT VALID and then
validate them in separate statements in the same migration.
In `@apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts`:
- Around line 2026-2033: getOutboxEmailById currently returns response.body
unconditionally which can return error payloads for 4xx/5xx responses; update
getOutboxEmailById to check the HTTP status (e.g., response.ok or
response.status) after the niceBackendFetch call and throw or return a clear
error when the response is not successful, including the status and response
body in the error message so callers (like the polling helpers) can detect and
handle "not found yet" vs real failures; reference the function name
getOutboxEmailById and the use of niceBackendFetch to locate where to add the
check and error handling.
🧹 Nitpick comments (9)
packages/template/src/lib/stack-app/email/index.ts (1)
39-47:timestampfield isstringwhile sibling date fields useDate.Other date-like fields in
AdminEmailOutboxBase(e.g.,createdAt,scheduledAt,nextSendRetryAt) are typed asDate, butAdminSendAttemptError.timestampis a rawstring. If this is intentional because it comes directly fromnew Date().toISOString()stored in JSON and you don't want to deserialize it, consider adding a brief doc comment noting the format (ISO 8601) so consumers know what to expect.apps/backend/prisma/schema.prisma (1)
847-853: Consider adding an index onnextSendRetryAtfor the retry query path.The
prepareSendPlanandclaimEmailsForSendingfunctions queryWHERE "nextSendRetryAt" IS NOT NULL AND "nextSendRetryAt" <= NOW(). The existingEmailOutbox_ordering_idx(Line 886) covers the normal queuing path but does not cover the retry path. With a potentially largeEmailOutboxtable (>1M rows per learnings), a partial index onnextSendRetryAtcould significantly improve retry query performance.Suggested partial index
@@index([tenancyId, finishedSendingAt(sort: Desc), scheduledAtIfNotYetQueued(sort: Desc), priority, id], map: "EmailOutbox_ordering_idx") @@index([tenancyId, simpleStatus], map: "EmailOutbox_simple_status_tenancy_idx") @@index([tenancyId, status], map: "EmailOutbox_status_tenancy_idx")In the migration SQL, add:
CREATE INDEX CONCURRENTLY "EmailOutbox_nextSendRetryAt_idx" ON "EmailOutbox" ("nextSendRetryAt") WHERE "nextSendRetryAt" IS NOT NULL;Based on learnings: "When writing database migration files, assume that we have >1,000,000 rows in every table."
apps/backend/src/lib/email-queue-step.tsx (3)
44-50: Minor formatting: missing space before=.-const appendSendAttemptError =( +const appendSendAttemptError = (
709-709: Type cast onrow.sendAttemptErrors— necessary but fragile.
row.sendAttemptErrors as SendAttemptError[] | nullis needed because Prisma's raw query returnsJsontype. Since this field is populated only by this same code path (viaappendSendAttemptError), the cast is safe in practice. However, if the JSON structure were ever corrupted or changed, this would silently produce incorrect data.Consider adding a defensive runtime check or using a validation/parsing step when reading from the raw query result, at least in development mode.
As per coding guidelines: "Any assumption you make should either be validated through the type system (preferred), assertions, or tests."
795-813: Unexpected errors bypass retry tracking.The outer
catchblock marks the email asserver-errorimmediately, without updatingfailedSendAttemptCountorsendAttemptErrors. This means an unexpected error (e.g., transient DB issue during the deliverability check) causes permanent failure even if the error is transient.This may be acceptable for truly unexpected errors, but consider whether some of these should also use the retry path. At minimum, the
sendServerErrorInternalDetailsshould include the currentfailedSendAttemptCountfor debugging:Suggested improvement
data: { finishedSendingAt: new Date(), canHaveDeliveryInfo: false, + failedSendAttemptCount: row.failedSendAttemptCount + 1, sendServerErrorExternalMessage: "An error occurred while sending the email. If you are the admin of this project, please check the email configuration and try again.", sendServerErrorExternalDetails: {}, sendServerErrorInternalMessage: errorToNiceString(error), - sendServerErrorInternalDetails: {}, + sendServerErrorInternalDetails: { previousAttemptCount: row.failedSendAttemptCount }, },apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts (3)
1941-1944: Non-null assertion ontempFailPort.Line 1943 uses
tempFailPort!which violates the coding guideline preferring?? throwErr(...).Suggested fix
if (tempFailServer) { - return tempFailPort!; + return tempFailPort ?? throwErr("tempFailPort should be set when tempFailServer exists"); }As per coding guidelines: "Prefer
?? throwErr(...)over non-null assertions and fallback values, with good error messages explicitly stating assumptions."
2036-2049: Polling helpers useDate.now()for elapsed time measurement.
waitForOutboxEmail(Line 2037-2038) andwaitForAttemptCount(Line 2054-2055) both useDate.now()for timeout tracking. The same pattern appears in the inline polling loops at Lines 2112-2114 and 2234-2237.As per coding guidelines: "Don't use Date.now() for measuring elapsed (real) time, instead use
performance.now()."Also applies to: 2053-2071
1926-1935:brokenSmtpConfiguses raw port25— should this usewithPortPrefixfor consistency?The existing
testEmailConfig(Line 52) useswithPortPrefix("29")for port configuration, suggesting test ports are prefixed to avoid conflicts. ThebrokenSmtpConfiguses port25directly. Since this config intentionally targets a non-existent host, the port doesn't functionally matter, but it's a minor inconsistency.packages/template/src/lib/stack-app/apps/implementations/admin-app-impl.ts (1)
696-697: Consider narrowing thecrudparameter type instead ofany.The function currently uses
anyfor the entirecrudparameter (with a valid eslint-disable). This predates this PR, but since you're extending the surface area with three new fields, it's worth noting that a more precise type (e.g.,EmailOutboxCrud["Admin"]["Read"]or similar) would catch mismatches at compile time — such as a typo incrud.failed_send_attempt_count. Low priority since the existing pattern is consistent throughout.
34b272d to
5194f5d
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@packages/template/src/lib/stack-app/apps/implementations/admin-app-impl.ts`:
- Around line 719-736: The code is using ad-hoc "as" casts for retry fields on
the crud object; update the function/signature that receives the crud parameter
to type it as EmailOutboxCrud (import from
`@stackframe/stack-shared/dist/interface/crud/email-outbox`), remove the
unnecessary "as" casts around send_retries, next_send_retry_at_millis and
send_attempt_errors, and let the typed properties be used directly when building
sendRetries, nextSendRetryAt and sendAttemptErrors (mapping the existing field
names to camelCase remains the same).
🧹 Nitpick comments (5)
apps/backend/prisma/migrations/20260213004424_email_outbox_is_queued_index/migration.sql (1)
1-6: Consider a partial index for the FALSE filter.
Since the hot path isisQueued = FALSE, a partial index can stay smaller and faster to maintain.🔧 Suggested migration tweak
-CREATE INDEX CONCURRENTLY IF NOT EXISTS "EmailOutbox_isQueued_idx" ON /* SCHEMA_NAME_SENTINEL */."EmailOutbox" ("isQueued"); +CREATE INDEX CONCURRENTLY IF NOT EXISTS "EmailOutbox_isQueued_false_idx" + ON /* SCHEMA_NAME_SENTINEL */."EmailOutbox" ("isQueued") + WHERE "isQueued" = FALSE;apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts (4)
1872-1917: Replace non-null assertions with?? throwErr(...)pattern.Lines 1874 and 1906 use
!non-null assertions. The coding guidelines prefer?? throwErr(...)with explicit error messages.Suggested fix
async function startTempFailSmtpServer(): Promise<number> { if (tempFailServer) { - return tempFailPort!; + return tempFailPort ?? throwErr("tempFailServer exists but tempFailPort is null"); }- const address = tempFailServer!.address(); + const address = (tempFailServer ?? throwErr("tempFailServer is null after creation")).address();As per coding guidelines: "Prefer
?? throwErr(...)over non-null assertions and fallback values, with good error messages explicitly stating assumptions."
1969-2005: Useperformance.now()instead ofDate.now()for elapsed-time measurement in polling helpers.
waitForOutboxEmail(Line 1971) andwaitForAttemptCount(Line 1988) both useDate.now()to measure elapsed wall-clock time for timeout purposes. The same pattern is repeated in inline polling loops at lines 2046 and 2176.Suggested fix for waitForOutboxEmail
async function waitForOutboxEmail(subject: string, timeoutMs = 30000): Promise<OutboxEmailWithRetryFields> { - const startTime = Date.now(); - while (Date.now() - startTime < timeoutMs) { + const startTime = performance.now(); + while (performance.now() - startTime < timeoutMs) { const emails = await getOutboxEmails({ subject });Suggested fix for waitForAttemptCount
async function waitForAttemptCount(emailId: string, attemptCount: number, timeoutMs = 60000): Promise<OutboxEmailWithRetryFields> { - const startTime = Date.now(); - while (Date.now() - startTime < timeoutMs) { + const startTime = performance.now(); + while (performance.now() - startTime < timeoutMs) {Apply the same change to the inline polling loops at lines 2046–2048 and 2176–2178.
As per coding guidelines: "Don't use Date.now() for measuring elapsed (real) time, instead use
performance.now()."
1900-1902: Silently swallowing socket errors may hide issues during test debugging.The
socket.on('error', () => {})handler discards all socket errors. Consider at least logging tologIfTestFailsor stderr so unexpected connection issues are visible when tests fail.
2136-2192: Hardcoded retry limit should reference the actual backend constant to prevent drift.The test hardcodes
MAX_SEND_ATTEMPTS = 5in assertions at lines 2187–2188, which currently match the backend constant inapps/backend/src/lib/email-queue-step.tsx:22. Since the constant is not exported, consider either:
- Exporting
MAX_SEND_ATTEMPTSfrom the backend module and importing it into the test for a single source of truth, or- Adding a more explicit comment explaining why the value is hardcoded and where it should be updated if the backend constant changes.
The existing comments reference the constant name but don't fully document the maintenance burden.
5194f5d to
8647187
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@apps/backend/prisma/schema.prisma`:
- Around line 836-837: The comment for the Prisma field isQueued is outdated
because the retry/reschedule flow resets it to false; update the inline comment
on the isQueued Boolean field (in the Prisma schema where isQueued is declared)
to reflect that it indicates the email is currently queued but may be set back
to false on retry/reschedule (replace “Once queued, this stays TRUE” with
wording such as “Indicates the email is currently queued; may be reset to false
when retrying/rescheduling”).
In `@apps/backend/src/app/api/latest/emails/outbox/crud.tsx`:
- Around line 61-78: The code unsafely casts prismaModel.sendAttemptErrors to a
typed array and maps it; instead add a runtime guard that checks
Array.isArray(prismaModel.sendAttemptErrors) and if false calls throwErr(...)
with a descriptive message, then map the validated array to the snake_case
shape; replace the "as Array<...>" cast and ensure the variable name
sendAttemptErrors is set to null when prismaModel.sendAttemptErrors is
null/undefined, following the defensive pattern used in email-queue-step.tsx.
In `@apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts`:
- Around line 1971-1983: In the polling loops (functions waitForOutboxEmail,
waitForAttemptCount, the email status polling loop, and the retry-exhaustion
polling loop) replace uses of Date.now() for elapsed-time measurement with
performance.now(), i.e. set startTime = performance.now() and compare Date.now()
- startTime to timeoutMs using performance.now() - startTime instead; ensure the
same timeoutMs semantics are preserved and that performance is available in the
test runtime (import { performance } from 'perf_hooks' if Node doesn't provide
it globally).
- Around line 1872-1913: The startTempFailSmtpServer function uses non-null
assertions on tempFailPort and tempFailServer; replace them with defensive
checks: in the early-return branch verify tempFailPort is defined and return it
or throw a clear Error (referencing tempFailPort and tempFailServer), and inside
the listen callback check that tempFailServer.address() is non-null before
accessing .port (do not use tempFailServer!); if address is missing or not an
object, reject with a descriptive Error. Update any socket callbacks to guard
against tempFailServer being undefined if referenced elsewhere.
🧹 Nitpick comments (1)
apps/e2e/tests/backend/endpoints/api/v1/emails/email-queue.test.ts (1)
1959-1962: UseencodeURIComponent(orurlString) for dynamic URL segments.
This helper interpolatesemailIdinto a URL directly; prefer encoding (and apply the same pattern to other new URL constructions in this file).As per coding guidelines, “Use urlString`` or encodeURIComponent() instead of normal string interpolation for URLs, for consistency.”✅ Example adjustment
- const response = await niceBackendFetch(`/api/v1/emails/outbox/${emailId}`, { + const response = await niceBackendFetch(`/api/v1/emails/outbox/${encodeURIComponent(emailId)}`, {
4da2012 to
876c950
Compare
chose to name it failedSendAttemptCount as the nextSendRetryAt can only be set when a send attempt fails. nextSendRetryAt controls the retry logic. sendAttemptErrors is named as such because it tracks all of the send errors not just retry errors. We also put in api schema updates and type updates.
Spending too much time on retries in the low level email sending clogs the email-queue-step. This can lead to cascading delays if there is an outage with our sender provider. This will cause a vercel runtime failure. Removing the retry logic also obviates the need for separate funcs.
The comment indicates this isn't used anywhere else. There are no uses of it in our codebase.
Now we mark emails to be retried in one iteration, and retry them in the next iteration.
Before, retryable emails went from sending -> scheduled -> sending Now, we move through the queuing logic. This simplifies the state transition functions and lifts the retry checking logic to only one func queuing. We also add some checks to queuing to have it be robust.
Our queuing logic will be slow if we dont use an index. isQueued is most of the time true, since emailoutbox table accounts for historical data and all sent emails have isQueued set to true. So the index will actually be quite selective. We can't just use the partial index on isQueued because its build on tenancyId, and the queuing query doesnt check tenancy id.
Rather than using an OR which may/may not use the index correctly, we use two queries.
Without the not valid, the migration tries to create the constraint and validate it at the same time which can lock the entire table. We decouple the validation and the constraint creation to avoid transaction timeouts
876c950 to
091616a
Compare
### Context Some of our users' emails were getting stuck in sending. The long delays in processing the retries caused a vercel function timeout. ### Summary of Changes We refactor the low level email sending functions to remove the retry logic there. We kick it up to the email queue step. Additionally, we flag emails to be retried when they encounter issues but leave it for a future iteration to actually perform the retry. We perform an exponential backoff with a random component to decide when they have to be retried. We also make some small adjustments to the queuing function to not queue skipped emails. When an email fails to send during the sending function, we check to see if it is a retryable error or not. Some errors are transient and trying again may succeed while others indicate deeper issues. If it is retryable, and the max number of retry attempts hasn't been reached, we set `nextSendRetryAt` to a time determined by an exponential backoff calculation function. When the queuing function looks for emails to queue, it doesn't just pick up the `SCHEDULED`. emails whose `scheduledAt` time <= `NOW()`, but also those emails whose `nextSendRetryAt` time <= `NOW()`. What this means in practice is that one iteration of the `email-queue-step` will mark emails as retryable while another iteration will perform the retry. This should be cleaner and prevent long delays in the `email-queue-step` process due to retries. This also makes it easier to scale up the number of retries if need be.
Context
Some of our users' emails were getting stuck in sending. The long delays in processing the retries caused a vercel function timeout.
Summary of Changes
We refactor the low level email sending functions to remove the retry logic there. We kick it up to the email queue step. Additionally, we flag emails to be retried when they encounter issues but leave it for a future iteration to actually perform the retry. We perform an exponential backoff with a random component to decide when they have to be retried. We also make some small adjustments to the queuing function to not queue skipped emails.
When an email fails to send during the sending function, we check to see if it is a retryable error or not. Some errors are transient and trying again may succeed while others indicate deeper issues. If it is retryable, and the max number of retry attempts hasn't been reached, we set
nextSendRetryAtto a time determined by an exponential backoff calculation function. When the queuing function looks for emails to queue, it doesn't just pick up theSCHEDULED. emails whosescheduledAttime <=NOW(), but also those emails whosenextSendRetryAttime <=NOW(). What this means in practice is that one iteration of theemail-queue-stepwill mark emails as retryable while another iteration will perform the retry. This should be cleaner and prevent long delays in theemail-queue-stepprocess due to retries. This also makes it easier to scale up the number of retries if need be.Summary by CodeRabbit
New Features
Tests
Chores