Skip to content

Add a dead letter queue and re-add more bare-boned metrics#488

Merged
jon-bell merged 8 commits intostagingfrom
dead-letter-queue
Nov 24, 2025
Merged

Add a dead letter queue and re-add more bare-boned metrics#488
jon-bell merged 8 commits intostagingfrom
dead-letter-queue

Conversation

@jon-bell
Copy link
Contributor

@jon-bell jon-bell commented Nov 13, 2025

Summary by CodeRabbit

  • New Features

    • Dead-letter queue and tracking for failed async background messages
    • New APIs exposing async queue sizes and circuit-breaker statuses
  • Improvements

    • Simplified metrics generation with lighter Prometheus output
    • Reduced sync concurrency when using the Upstash-backed limiter

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 13, 2025

Walkthrough

Adds a dead-letter queue (DLQ) and tracking table, RPCs for queue and circuit-breaker metrics, DLQ routing and enriched error logging in the GitHub async worker, a metrics function refactor to use two RPCs, a small Upstash concurrency tweak, and a Supabase client version bump.

Changes

Cohort / File(s) Summary
Type definitions
supabase/functions/_shared/SupabaseTypes.d.ts, utils/supabase/SupabaseTypes.d.ts
Added async_worker_dlq_messages table types (Row/Insert/Update/Relationships) and function signatures get_async_queue_sizes() and get_circuit_breaker_statuses().
DB migration / DLQ
supabase/migrations/20251113141300_create-async-worker-dlq.sql
Creates async_calls_dlq queue, public.async_worker_dlq_messages table (columns, indexes, FK to classes), RLS policies, and RPCs get_async_queue_sizes() and get_circuit_breaker_statuses() with service_role permissions.
Async worker DLQ handling
supabase/functions/github-async-worker/index.ts
Adds DLQ routing and tracking: send to DLQ and insert DLQ record on retry/circuit/rate-limit/error branches (including an env-driven DLQ-failure injection), archive original message, and enrich Sentry context.
Metrics simplification
supabase/functions/metrics/index.ts
Removed prior ClassMetrics and generator helper; switched to two RPCs (get_async_queue_sizes, get_circuit_breaker_statuses) and produce compact Prometheus output for queue sizes and circuit-breakers.
Concurrency tweak
supabase/functions/_shared/GitHubSyncHelpers.ts
Reduced Upstash-backed global sync limiter parameters (reservoir, maxConcurrent, reservoirRefreshAmount) from 30 → 20.
Dependency bump
package.json
Bumped supabase dependency and devDependency from ^2.23.4^2.58.5.

Sequence Diagram(s)

sequenceDiagram
    actor Worker as Async Worker
    participant Queue as pgmq q_async_calls
    participant Handler as Message Handler
    participant DLQ as pgmq q_async_calls_dlq
    participant DB as async_worker_dlq_messages
    participant Sentry

    Worker->>Queue: consume message
    Handler->>Handler: process message (circuit/rate-limit checks)
    alt route to DLQ (retry_count ≥ 5 or circuit/rate-limit/error + injection)
        Handler->>DLQ: enqueue envelope
        Handler->>DB: insert DLQ tracking row (envelope, error, retry_count, class_id, debug_id, log_id)
        Handler->>Sentry: capture error with DLQ context
        Note right of DB: archive original message
    else normal retry/requeue or archive
        Handler->>Queue: requeue / delay / archive per logic
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Pay special attention to DLQ routing branches in supabase/functions/github-async-worker/index.ts (multiple conditional paths, Sentry enrichment, archive vs requeue decisions).
  • Review migration SQL (20251113141300_create-async-worker-dlq.sql) for schema, indexes, RLS policies, FK behavior, and RPC security.
  • Verify SupabaseTypes updates match migration and RPC return shapes.
  • Check metrics output correctness and label escaping in supabase/functions/metrics/index.ts.
  • Confirm package bump compatibility and Upstash limiter change impacts.

Possibly related PRs

Poem

A message tired of spinning round,
Is gently sent to quieter ground,
Counts and breakers now in view,
Logs more tidy, errors too,
New guards hum while metrics sound.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: adding a dead letter queue system and implementing simplified metrics functionality.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dead-letter-queue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (8)
supabase/functions/github-async-worker/index.ts (4)

252-344: DLQ helper is well-scoped; consider surfacing failures to callers.

The DLQ helper cleanly encapsulates queue send + tracking-table insert and Sentry context. However, all failures are swallowed (logged only) and callers always proceed to archive the original message. That means if both the DLQ send and the tracking-table insert fail, the message is permanently lost even though the DLQ path failed.

If you want stricter guarantees, consider propagating a failure signal (e.g., boolean/Result or thrown error) so callers can decide not to archive when DLQ persistence fails, or at least differentiate between “successfully dead-lettered” and “failed to dead-letter.”


479-486: Org-level circuit breaker DLQ threshold looks correct.

The guard to send to DLQ when retry_count >= 5 under an open org-level circuit is consistent with the requeue strategy: you get the initial attempt plus up to five retries before dead-lettering, and the original message is archived in the same branch.

If you find yourself adding more similar blocks, it might be worth extracting a small helper like if (shouldDeadLetter(envelope)) … to avoid repeating the same pattern.


1116-1122: Rate-limit branch DLQ cutoff is placed in a sensible position.

Checking retry_count and routing to DLQ before further circuit-breaker logic ensures rate-limited messages don’t requeue indefinitely. The archive call after sendToDeadLetterQueue matches the rest of the worker’s pattern.

Same note as above: if this pattern grows further, a shared helper to encapsulate the DLQ + archive decision would reduce duplication.


1214-1220: Fallback error path correctly participates in DLQ policy.

The final error path (where org is unknown) now also honors the retry_count >= 5 rule before requeueing. This ensures all error scenarios eventually converge on DLQ, not just those with a recognized org/method.

No functional problems; just be aware the same retry logic is now duplicated in three places, so a small helper could make future changes less error-prone.

supabase/functions/metrics/index.ts (1)

37-45: Label escaping helper and circuit-breaker metrics look good; consider a small null-guard.

The escapeLabel helper correctly handles backslashes, quotes, and control characters, and the per-breaker metric lines follow Prometheus label syntax. Assuming scope, key, and state are always non-null strings from the RPC, this is solid.

If there’s any chance of scope/key/state being null or undefined, a small defensive cast (e.g., escapeLabel(cb.scope ?? "unknown")) would prevent runtime errors.

Also applies to: 63-71

supabase/functions/_shared/SupabaseTypes.d.ts (3)

9568-9571: Batch gradebook recalc RPC is okay; consider documenting the expected row shape for p_rows.

enqueue_gradebook_row_recalculation_batch with p_rows: Json[] is flexible but untyped. If this is going to be used from multiple places, it may be worth standardizing a TS interface (in app code) that documents the expected keys in each row so you don’t lose type‑safety end‑to‑end.


9998-10005: Gradebook column update‑with‑recalc RPCs are consistent with existing patterns.

update_gradebook_column_student_with_recalc (p_id, p_updates: Json) and update_gradebook_column_students_batch_with_recalc (p_updates: Json[], returns Json) follow the established “updates as Json payload” pattern. From a types standpoint this is fine; if you find yourself reusing these payload shapes heavily, you could introduce shared TS types on the client to avoid ad‑hoc Json everywhere.


10017-10020: update_gradebook_rows_batch RPC signature is reasonable for bulk updates.

The p_batch_updates: Json[]Json return is in line with the other batch gradebook APIs. No obvious type issues here; just keep the payload contract documented somewhere outside this generated file so callers know exactly what to send and what to expect back.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee56d52 and cdc68c4.

📒 Files selected for processing (5)
  • supabase/functions/_shared/SupabaseTypes.d.ts (9 hunks)
  • supabase/functions/github-async-worker/index.ts (6 hunks)
  • supabase/functions/metrics/index.ts (1 hunks)
  • supabase/migrations/20251113141300_create-async-worker-dlq.sql (1 hunks)
  • utils/supabase/SupabaseTypes.d.ts (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
supabase/functions/github-async-worker/index.ts (2)
supabase/functions/_shared/SupabaseTypes.d.ts (2)
  • Database (3-10099)
  • Json (1-1)
supabase/functions/_shared/GitHubAsyncTypes.ts (1)
  • GitHubAsyncEnvelope (99-107)
utils/supabase/SupabaseTypes.d.ts (1)
supabase/functions/_shared/SupabaseTypes.d.ts (1)
  • Json (1-1)
supabase/functions/_shared/SupabaseTypes.d.ts (2)
utils/supabase/SupabaseTypes.d.ts (2)
  • Json (1-1)
  • Database (3-10099)
utils/supabase/DatabaseTypes.d.ts (1)
  • Json (3-3)
⏰ Context from checks skipped due to timeout of 180000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: deploy
🔇 Additional comments (11)
supabase/functions/github-async-worker/index.ts (2)

505-512: Method-level circuit breaker DLQ behavior mirrors org-level logic.

The method-specific circuit breaker correctly reuses the same retry_count >= 5 threshold and archive flow, with a more specific error message keyed on the method. This keeps behavior consistent between org-scoped and org+method-scoped breakers.

No functional issues spotted here.


1171-1177: Non-rate-limit (org-known) error DLQ handling is consistent.

For non-rate-limit errors where org is known, the new retry_count >= 5 guard sends messages to DLQ and archives them instead of requeuing, which aligns with the rate-limit path and prevents endless retries on persistent failures.

Behavior looks correct and keeps the immediate circuit-breaker semantics intact.

supabase/functions/metrics/index.ts (1)

17-33: RPC-based metrics fetch is straightforward and robust.

The switch to two focused RPCs with immediate error checks (queueError / circuitBreakerError) simplifies the metrics path and keeps failure modes explicit. Using the first row of queueSizes as the source of async/DLQ counts is reasonable given the RPC contract.

No issues from a control-flow perspective.

utils/supabase/SupabaseTypes.d.ts (3)

858-910: DLQ table typing looks consistent with usage patterns

The async_worker_dlq_messages table definition (column nullability, use of Json, and github_async_method enum, plus class_id FK) matches the patterns used elsewhere (e.g. api_gateway_calls) and looks correct for DLQ usage. No issues from a typing perspective.

Please just confirm that this matches the actual SQL migration (column names, nullability, and enum type) so codegen and database stay in sync.


4989-5038: review_assignments.hard_deadline field shape looks right; check downstream handling

hard_deadline is non-nullable on Row and optional on Insert/Update, which is the expected pattern for a NOT NULL column with a default (likely false). Typing looks fine and should be backwards compatible for inserts.

I’d recommend:

  • Verifying the migration sets an explicit default for hard_deadline.
  • Double-checking any UI/API code that reasons about review deadlines now considers this flag where appropriate.

9301-9317: New RPC type signatures are coherent; ensure they match the SQL functions

The added function typings (check_can_add_to_help_request, check_can_remove_from_help_request, the gradebook with_recalc helpers, and the async queue / circuit-breaker metrics RPCs) all have sensible argument names and return types, and they’re aligned with existing conventions (e.g. Returns: undefined for void, Json / Json[] for opaque payloads).

Please confirm against the SQL functions that:

  • Parameter names and orders match exactly (especially for the batch Json[] arguments).
  • Result column names in get_async_queue_sizes and get_circuit_breaker_statuses (async_queue_size, dlq_queue_size, is_open, key, etc.) are spelled identically, so row mapping stays type-safe.

Also applies to: 9568-9571, 9599-9615, 9998-10005, 10017-10020

supabase/migrations/20251113141300_create-async-worker-dlq.sql (1)

1-60: DLQ table, indexes, and RLS policies look good.

The table design with comprehensive error tracking fields, appropriate indexes on frequently-queried columns, foreign key with cascade delete, and restrictive RLS policies are all solid. The pgmq queue creation with exception handling is also a good pattern.

supabase/functions/_shared/SupabaseTypes.d.ts (4)

858-910: DLQ table schema looks consistent; confirm envelope contents are appropriately sanitized.

The async_worker_dlq_messages table shape and enum usage (github_async_method) look aligned with the rest of the async/GitHub telemetry. The only thing to double‑check is that envelope and last_error_context don’t carry raw secrets (tokens, headers, etc.) before being stored, since the type is a generic Json and can’t enforce that at compile time.


4988-5038: hard_deadline field wiring looks correct; ensure DB nullability/defaults match the type.

Adding hard_deadline: boolean on Row with optional flags on Insert/Update matches how other booleans are modeled here. Just make sure the underlying column is NOT NULL with a sensible default (probably false), otherwise Supabase could hand you null at runtime where the type says plain boolean.


9301-9317: Help‑request permission RPC signatures are clear and consistent.

The new check_can_add_to_help_request / check_can_remove_from_help_request RPC types look fine. The only thing to keep straight in calling code is that p_user_id is an auth/user id while p_profile_id_to_remove is a profile id; the type system can’t distinguish those, so it’s worth being deliberate at call sites.


9599-9615: Async queue and circuit‑breaker metrics RPC types look good.

The result shapes for get_async_queue_sizes and get_circuit_breaker_statuses are straightforward and match the intended use for lightweight metrics RPCs. Just ensure the callers treat these as arrays of rows (as modeled here) and not a single object.

coderabbitai[bot]
coderabbitai bot previously approved these changes Nov 17, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
supabase/functions/metrics/index.ts (2)

17-19: Consider parallelizing independent RPC calls.

The two RPC calls are independent and could be executed concurrently using Promise.all to reduce latency.

Apply this diff to parallelize the calls:

-    // Query queue sizes and circuit breaker statuses using RPC functions
-    const { data: queueSizes, error: queueError } = await supabase.rpc("get_async_queue_sizes");
-    const { data: circuitBreakers, error: circuitBreakerError } = await supabase.rpc("get_circuit_breaker_statuses");
+    // Query queue sizes and circuit breaker statuses using RPC functions
+    const [
+      { data: queueSizes, error: queueError },
+      { data: circuitBreakers, error: circuitBreakerError }
+    ] = await Promise.all([
+      supabase.rpc("get_async_queue_sizes"),
+      supabase.rpc("get_circuit_breaker_statuses")
+    ]);

47-61: Metrics format looks correct with proper timestamps.

The Prometheus text exposition format is correct, and the timestamp is now properly in milliseconds as required.

Optional: Consider making the version configurable.

The version "1.0.0" is hardcoded in line 49. You might want to make this configurable via environment variable or import it from a version file for easier maintenance.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cdc68c4 and 28af7ca.

📒 Files selected for processing (2)
  • supabase/functions/metrics/index.ts (1 hunks)
  • supabase/migrations/20251113141300_create-async-worker-dlq.sql (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • supabase/migrations/20251113141300_create-async-worker-dlq.sql
⏰ Context from checks skipped due to timeout of 180000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: deploy
🔇 Additional comments (5)
supabase/functions/metrics/index.ts (5)

21-29: LGTM!

Error handling is correct with appropriate logging before throwing, ensuring errors are captured and reported properly.


35-35: Timestamp units issue resolved.

The previous review correctly identified that the timestamp should be in milliseconds for Prometheus. This has been fixed—Date.now() returns milliseconds directly.


37-45: LGTM!

The label escaping function correctly handles all required Prometheus special characters (backslash, quotes, newline, carriage return, tab).


63-72: LGTM!

The circuit breaker metrics generation correctly handles empty arrays, properly escapes label values, and follows the Prometheus gauge format with appropriate timestamp usage.


31-32: No issues found—code correctly matches RPC contract.

The get_async_queue_sizes() RPC is designed to return exactly one row with aggregated queue counts. Accessing [0] is appropriate, and the optional chaining with default values provides proper defensive handling.

coderabbitai[bot]
coderabbitai bot previously approved these changes Nov 17, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
supabase/functions/_shared/SupabaseTypes.d.ts (2)

870-922: DLQ table typings look consistent; consider whether retry_count should be required on insert

async_worker_dlq_messages matches the usual Row/Insert/Update pattern and reuses existing types (Json, github_async_method, class_id FK) consistently. The only question is whether Insert["retry_count"] should be required: if the DB column has a default (e.g., 0 for first insertion), you may want to make it optional in Insert like github_circuit_breakers.trip_count to keep callers from having to thread an explicit 0 everywhere. If there is no default by design, current typing is fine.


9585-9603: Arg-less RPC typings work, but Args shape is inconsistent with other no-arg functions

These new RPCs’ return types look good and line up with their intended metrics payloads. For Args, you’ve used Record<PropertyKey, never>, whereas most other arg-less functions here use Args: never. Both can work with supabase-js, but mixing conventions is a bit surprising; if these types weren’t generated, consider standardizing on whichever form your codegen or client version expects (likely never) for easier mental parsing and to avoid subtle rpc call-site friction.

supabase/functions/github-async-worker/index.ts (2)

479-486: Clarify and centralize retry threshold semantics used for DLQ routing.

All DLQ routing branches gate on envelope.retry_count ?? 0 >= 5, which in combination with requeueWithDelay means a message can be processed up to 6 times (initial attempt + 5 requeues) before it is finally dead-lettered.

This is totally reasonable, but it’s subtle and duplicated several times (circuit-open pre-checks and all error paths). I’d suggest:

  • Extracting a small helper (e.g. hasExceededRetryThreshold(envelope, limit = 5)) and reusing it in all locations.
  • Double-checking whether the intention is “5 retries” (i.e. 6 total attempts) or “5 attempts total”; if the latter, the comparison should be > 4 after increment, rather than >= 5 on the pre-increment value.

This will make the behavior easier to reason about and harder to accidentally diverge across branches.

Also applies to: 505-512, 1133-1139, 1188-1194, 1231-1237


528-543: DLQ failure injection is powerful—consider stricter scoping to non‑prod environments.

The INJECT_DLQ_FAILURE gate plus a 50% random failure on first attempt will reliably drive messages to the DLQ threshold, but if this env var is ever enabled in production it will:

  • Artificially inflate error and circuit‑breaker metrics, and
  • Potentially trip long‑lived breakers.

To reduce blast radius, consider also checking an environment flag for the deployment stage (or a dedicated INJECT_DLQ_FAILURE_ENV=staging-style guard), or tightening the condition to only act when retry_count >= 1 (no random failures on the first attempt) so it doesn’t penalize fresh messages.

Example tweak:

-  if (injectDlqFailure === "true" || injectDlqFailure === "1") {
-    if (retryCount >= 1 || Math.random() < 0.5) {
+  if (injectDlqFailure === "true" || injectDlqFailure === "1") {
+    if (retryCount >= 1) {
       const error = new Error(`DLQ test failure injection - simulating processing error (retry ${retryCount})`);
       scope.setTag("dlq_test_injection", "true");
       scope.setContext("dlq_test_injection", {
         retry_count: retryCount,
         method: envelope.method
       });
       throw error;
     }
   }

or guard it further with a non‑prod env check.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 28af7ca and 93d1f67.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (7)
  • package.json (1 hunks)
  • supabase/functions/_shared/GitHubSyncHelpers.ts (1 hunks)
  • supabase/functions/_shared/SupabaseTypes.d.ts (2 hunks)
  • supabase/functions/github-async-worker/index.ts (7 hunks)
  • supabase/functions/metrics/index.ts (1 hunks)
  • supabase/migrations/20251113141300_create-async-worker-dlq.sql (1 hunks)
  • utils/supabase/SupabaseTypes.d.ts (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • utils/supabase/SupabaseTypes.d.ts
🧰 Additional context used
🧬 Code graph analysis (3)
supabase/functions/metrics/index.ts (1)
tests/e2e/TestingUtils.ts (1)
  • supabase (12-12)
supabase/functions/github-async-worker/index.ts (2)
supabase/functions/_shared/SupabaseTypes.d.ts (2)
  • Database (3-10118)
  • Json (1-1)
supabase/functions/_shared/GitHubAsyncTypes.ts (1)
  • GitHubAsyncEnvelope (99-107)
supabase/functions/_shared/SupabaseTypes.d.ts (1)
utils/supabase/SupabaseTypes.d.ts (1)
  • Json (1-1)
⏰ Context from checks skipped due to timeout of 180000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: deploy
🔇 Additional comments (3)
supabase/functions/_shared/GitHubSyncHelpers.ts (1)

71-73: Clarify the rationale for reducing concurrency limits from 30 to 20.

The Upstash-backed rate limiter's reservoir, maxConcurrent, and reservoirRefreshAmount are all being reduced from 30 to 20. This represents a ~33% reduction in concurrent sync operations, which could impact throughput during bulk sync operations across multiple repositories.

While tighter concurrency controls may improve stability and reduce GitHub API rate limit issues (especially with the new DLQ system handling failures), it would be helpful to understand:

  1. What drove the decision to reduce to 20 specifically?
  2. Has there been testing or monitoring to validate this won't create bottlenecks?
  3. Are there metrics in place (perhaps via the new get_async_queue_sizes RPC mentioned in the AI summary) to monitor queue depth and latency?

Consider documenting the rationale in a comment near line 58, and ensure the new metrics endpoints can track if this concurrency limit becomes a bottleneck.

package.json (1)

178-178: Verify Supabase CLI v2.58.5 compatibility with existing scripts and workflows.

The changelog for Supabase CLI v2.56.0 through v2.58.5 reveals no explicit breaking changes in the visible releases—mostly bug fixes and feature additions (JWT issuer support, auth notification templates, type-generation improvements). Type-generation improvements require recent CLI versions (example: supabase-js v2.75.1 + CLI > 2.53.1), which v2.58.5 satisfies.

However, the full changelog from v2.23.4 through v2.55.x could not be fully retrieved. Before merging:

  1. Test type generation: Run the script at lines 21–22 (npx supabase gen types) locally to confirm output matches expectations
  2. Verify migrations: Ensure existing migrations in supabase/migrations/ still apply without errors
  3. Check DLQ/RPC features: Confirm the new DLQ table and RPC definitions work as expected with v2.58.5
supabase/functions/metrics/index.ts (1)

17-37: Metrics path looks sane; RPC usage and Prometheus formatting are consistent.

The simplified metrics endpoint using get_async_queue_sizes and get_circuit_breaker_statuses is straightforward, and the Prometheus output now uses millisecond timestamps (Date.now()), which aligns with the text exposition format expectations. Label escaping for circuit‑breaker dimensions is also a good touch.

Only thing to keep an eye on is that queueSizes?.[0] assumes a single-row result from get_async_queue_sizes(); that matches the current function definition, but if it ever changes to multiple rows you’ll want to aggregate explicitly.

Overall this looks good to ship as-is.

Also applies to: 48-63, 68-75

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
supabase/migrations/20251113141300_create-async-worker-dlq.sql (1)

61-61: Update comment to reflect that this RPC only returns queue sizes.

The comment incorrectly states this function returns "async queue sizes and circuit breaker statuses," but get_async_queue_sizes() only returns queue sizes. Circuit breaker statuses come from the separate get_circuit_breaker_statuses() function below.

🧹 Nitpick comments (2)
supabase/functions/github-async-worker/index.ts (2)

252-450: Consider extracting duplicated Sentry logging into a helper function.

The Sentry context and message logging is repeated identically in four error branches (lines 276-300, 309-333, 364-388, 397-421) and again in the success path (lines 425-447). This duplication makes maintenance harder and increases the risk of inconsistency.

Extract the shared logging into a helper:

function logDlqAttempt(
  scope: Sentry.Scope,
  envelope: GitHubAsyncEnvelope,
  meta: { msg_id: number; enqueued_at: string },
  errorMessage: string,
  errorType: string,
  retryCount: number,
  success: boolean
) {
  scope.setTag("dlq", "true");
  scope.setTag("retry_count", String(retryCount));
  scope.setContext("dead_letter_queue", {
    original_msg_id: meta.msg_id,
    method: envelope.method,
    retry_count: retryCount,
    error_message: errorMessage,
    error_type: errorType,
    enqueued_at: meta.enqueued_at,
    class_id: envelope.class_id,
    debug_id: envelope.debug_id,
    log_id: envelope.log_id
  });
  
  const action = success ? "sent to" : "Failed to send message to";
  Sentry.captureMessage(
    `${action} dead letter queue after ${retryCount} retries: ${envelope.method}`,
    {
      level: "error",
      tags: {
        dlq: "true",
        method: envelope.method,
        retry_count: String(retryCount)
      }
    }
  );
}

Then replace all five duplicated blocks with a single call to this helper.


585-603: Consider extracting the DLQ routing pattern into a helper function.

The circuit breaker DLQ routing logic is duplicated at lines 585-603 and 622-640, differing only in the error message. This pattern is repeated two more times in the rate limit and error handlers (lines 1263-1280, 1329-1346, 1383-1400).

A helper function would reduce duplication:

async function routeToDlqOrLeaveUnarchived(
  adminSupabase: SupabaseClient<Database>,
  envelope: GitHubAsyncEnvelope,
  meta: { msg_id: number; enqueued_at: string },
  error: Error,
  scope: Sentry.Scope
): Promise<boolean> {
  const dlqSuccess = await sendToDeadLetterQueue(adminSupabase, envelope, meta, error, scope);
  if (dlqSuccess) {
    await archiveMessage(adminSupabase, meta.msg_id, scope);
    return true;
  } else {
    console.error(`Failed to send message ${meta.msg_id} to DLQ, leaving unarchived`);
    scope.setContext("dlq_archive_skipped", {
      msg_id: meta.msg_id,
      reason: "DLQ send failed"
    });
    Sentry.captureMessage(`Message ${meta.msg_id} not archived due to DLQ failure`, {
      level: "error"
    });
    return false;
  }
}

This would consolidate the five nearly-identical DLQ routing blocks scattered throughout the error handling paths.

Also applies to: 622-640

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93d1f67 and 5d44f63.

📒 Files selected for processing (2)
  • supabase/functions/github-async-worker/index.ts (7 hunks)
  • supabase/migrations/20251113141300_create-async-worker-dlq.sql (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
supabase/functions/github-async-worker/index.ts (2)
supabase/functions/_shared/SupabaseTypes.d.ts (2)
  • Database (3-10118)
  • Json (1-1)
supabase/functions/_shared/GitHubAsyncTypes.ts (1)
  • GitHubAsyncEnvelope (99-107)
⏰ Context from checks skipped due to timeout of 180000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: deploy
🔇 Additional comments (3)
supabase/migrations/20251113141300_create-async-worker-dlq.sql (1)

1-59: LGTM! DLQ infrastructure is well-structured.

The queue, tracking table, indexes, and RLS policies are properly configured. The foreign key cascade behavior and index choices support efficient querying patterns.

supabase/functions/github-async-worker/index.ts (2)

656-673: Consider clarifying the test injection logic or making it more predictable.

The DLQ test injection condition if (retryCount >= 1 || Math.random() < 0.5) has mixed behavior:

  • First attempt (retryCount = 0): 50% chance to fail
  • Subsequent attempts: Always fails

This ensures messages eventually reach the 5-retry DLQ threshold, but the 50% randomness on the first try may be unexpected. Consider either:

  1. Making it deterministic (always fail when enabled), or
  2. Adding a comment explaining why the random initial failure is desirable for test coverage.

252-450: Past review concerns have been addressed.

The function now returns a boolean indicating success, and all call sites (lines 589-602, 626-639, 1266-1279, 1332-1346, 1386-1400) properly check the result before archiving. The INJECT_DLQ_FAILURE test injection is also protected with environment checks. Well done on implementing the suggested improvements!

@argos-ci
Copy link

argos-ci bot commented Nov 23, 2025

The latest updates on your projects. Learn more about Argos notifications ↗︎

Build Status Details Updated (UTC)
default (Inspect) 👍 Changes approved 55 changed, 1 added, 1 removed, 2 failures Nov 23, 2025, 2:05 AM

@jon-bell jon-bell merged commit af337ec into staging Nov 24, 2025
8 of 9 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant