Skip to content

Conversation

@evanhutnik
Copy link
Contributor

Summary

Adding separate retry queue for gmail inbox update pubsub processing.

Problem

When users perform bulk operations (e.g., archiving 10,000 emails), we have to make separate calls to the Gmail API for each message being updated, which results in us hitting the rate limit quota for the user, in turn causing webhook processing to repeatedly retry. This creates a backlog that delays inbox updates for all users, not just the one performing the bulk operation, as all of these messages were being processed by the same workers.

Solution

Implement a two-tier webhook processing system:

  • Primary queue: Processes normal webhook operations. When rate-limited, offloads the operation to the retry queue and moves on.
  • Retry queue: Handles rate-limited operations separately with its own worker pool, allowing the primary queue to maintain low latency.

This architecture prevents head-of-line blocking—the primary queue continues processing fresh updates for all users while rate-limited operations are handled independently.

Changes

  • Rate-limited operations in the primary worker are enqueued to a new SQS retry queue and marked as processed
  • A separate pool of retry workers consumes from this queue
  • Retry workers treat rate limits as retryable errors (with high maxReceiveCount to improve chance it doesn't dlq)
  • Added check_gmail_rate_limit_webhook() helper to deduplicate rate limit handling logic

Testing

Verified in dev environment:

  • Primary queue continues processing during bulk operations
  • Rate-limited operations successfully migrate to retry queue
  • Retry workers properly handle backoff and eventual success

Screenshots, GIFs, and Videos

@evanhutnik evanhutnik requested review from a team as code owners December 1, 2025 22:01
@linear
Copy link

linear bot commented Dec 1, 2025

);

// separate queue for retries to avoid backups for large inbox updates that hit gmail api rate limit
for worker in webhook_retry_workers {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's time to move out the workers into their own container that runs isolated from email service.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make a ticket for that and implement when you have the bandwidth, medium priority

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 37 to 39
Err(anyhow::anyhow!(
"gmail_webhook_retry_queue is not configured"
))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer anyhow::bail! here

Copy link
Member

@whutchinson98 whutchinson98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small nit but overall seems fine.

we really should be running these workers in a separate container to the main email service though.

@evanhutnik evanhutnik merged commit b655dbf into main Dec 2, 2025
34 checks passed
@evanhutnik evanhutnik deleted the evan/ema-42-improve-rate-limit-handling-for-webhook-pubsub-queue branch December 2, 2025 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants