Skip to content

fix(dispatcher): false-positive rate-limit alerts and number-unavailable hot loop#779

Merged
swaroopvarma1 merged 1 commit into
releasefrom
fix/dispatcher-rate-limit-false-block-and-number-fail-fast
May 21, 2026
Merged

fix(dispatcher): false-positive rate-limit alerts and number-unavailable hot loop#779
swaroopvarma1 merged 1 commit into
releasefrom
fix/dispatcher-rate-limit-false-block-and-number-fail-fast

Conversation

@swaroopvarma1
Copy link
Copy Markdown
Collaborator

@swaroopvarma1 swaroopvarma1 commented May 21, 2026

Summary

Two regressions in the new event-driven dispatcher (PR #772, deployed 2026-05-18), surfaced by 9 "Outbound Rate Limit Exceeded (BLOCKED)" alerts in 16 seconds on 2026-05-21 16:50 IST.

  • Spurious rate-limit alerts (no harassment, but real customer delay). The rate-limit ZADD ran for every dispatcher attempt that passed peek, BEFORE the channel-token gate. When channel tokens were exhausted, a single lead's 1-3s retries self-filled its own per-phone bucket in ~10 seconds, then the 8th attempt fired a Slack alert and pushed the lead out by 3600s. Verified in OpenObserve: zero `make_call` events on the 9 alerted phones — no customer was actually called. 9 legitimate first-time order-confirmation leads delayed 1 hour each.
  • `_get_available_number()` None → 10s defer-and-retry forever. Permanent misconfigurations (template missing `outbound_number_id`, deleted/disabled number, empty fallback pool) burned DB+Redis cycles indefinitely with no alert and no termination.

Fix

1. Two-stage rate limit with strict-cap enforcement

The rate limit is now a peek + atomic check-and-record pair:

  • `peek_outbound_rate_limit_and_alert` (new `OUTBOUND_RATE_LIMIT_PEEK_LUA`): read-only `ZCARD` after expiry trim. Runs BEFORE the channel-token gate, fires the Slack alert if the bucket is already at limit. Fast-fail so we don't hold a channel for a doomed lead.
  • `record_outbound_call_attempt` (new `OUTBOUND_RATE_LIMIT_RECORD_LUA`): single atomic Lua doing trim + `ZCARD` + (conditional `ZADD` under `count < limit`). Returns `(allow, defer_seconds)` — the authoritative cap. Runs AFTER `acquire_channel_token` + `_acquire_number` and BEFORE `provider.make_call`.

Placement is constrained by three requirements that uniquely pin the spot:

  1. After channel-token acquire — otherwise channel-wait retries re-introduce the self-fill bug (the original alert storm).
  2. Before make_call — otherwise the call is on the wire before the race is detected; strict cap is meaningless once the customer's phone rings.
  3. Atomic with the count check — anything non-atomic leaves a TOCTOU window where N concurrent same-phone workers can all pass peek under-limit, all dial, then all record (each only resolving the conflict after the dial).

If the atomic record returns rejected (a true cross-lead race; another worker on the same phone filled the bucket between our peek and our record), the worker releases the channel token + DB number and defers the lead by the rate-limit window. Same Slack alert as the peek path so on-call sees one consistent title regardless of which gate caught it.

Failure mode accepted: if `make_call` raises or returns no SID after the atomic record runs, the bucket is inflated by 1 for a call that didn't reach the customer. This matches pre-PR semantics exactly (pre-PR also counted attempts that didn't reach make_call) and only triggers on rare provider-side failures.

2. Fail-fast on permanent `_get_available_number` None

Mark the lead FINISHED with outcome `NUMBER_UNAVAILABLE` and fire a throttled P1 `raise_no_outbound_number` alert (throttled per reseller+template). The Exotel `channels < maximum_channels` check inside `_get_available_number` was redundant with the Redis semaphore and was dropped — None now unambiguously means "structural failure."

Behavior matrix (vs. pre-PR and intermediate peek-only)

Scenario Pre-PR Peek-only (intermediate) This PR
Single lead, count < max, happy path dial + count dial + count dial + count
Single lead, count ≥ max BLOCKED + 1h defer BLOCKED + 1h defer BLOCKED + 1h defer
Single lead, channel-token exhausted self-fill → false BLOCKED peek-only retry (no count growth) peek-only retry (no count growth)
N concurrent same-phone, count + N ≤ max all dial, count grows by N all dial, count grows by N all dial, count grows by N
N concurrent same-phone, count + N > max strict cap (atomic Lua) over-shoots by N - (max - count) strict cap (atomic Lua) restored
make_call fails after record counted not counted counted (matches pre-PR)
Redis down at any rate-limit point fail-open fail-open fail-open
`_get_available_number` returns None 10s defer forever FINISHED + alert FINISHED + alert

Test plan

  • `tests/breeze_buddy/dispatch/` — 98 passed, 1 xfailed (cluster-mode, pre-existing).
  • `uv run pyrefly check` — 0 errors, 15 suppressed (no new suppressions).
  • `uv run black --check . && uv run isort --check --profile black . && uv run autoflake --check ... -r app/` — all clean.
  • `test_channel_exhaustion_does_not_record_rate_limit_attempt` — pins the 2026-05-21 root cause: zero-token semaphore → defer with channel-wait jitter → no ZADD ever.
  • `test_atomic_record_rejection_releases_resources_and_defers` — pins the cross-lead race: peek allows, atomic record rejects, channel + number released, lead deferred 1h, make_call never runs.
  • `test_get_available_number_returns_none_marks_lead_finished` — pins fail-fast behavior, throttled alert capture, untouched channel pool.
  • `test_rate_limit_blocks_before_channel_acquire` — pins peek runs but record does NOT on peek-side block.
  • `test_full_round_trip_happy_path` — pins one peek + one record per successful dial.

Out of scope

  • Twilio's status==IN_USE binary-gate semantics (caps Twilio concurrency at 1/number regardless of `maximum_channels`). Existing behavior, separate concern.
  • Tuning `BB_CHANNEL_WAIT_BACKOFF_MAX_S`. With this fix the 1-3s retries are benign (no ZADD on retry).
  • ZREM-undo on make_call failure (would eliminate the case-6 inflation; complexity-for-correctness trade we don't need today).
  • Re-pushing the 9 affected leads from 2026-05-21 16:50 — operational, not code.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Fixed race condition in concurrent outbound call attempts that could cause duplicate calls.
    • Improved handling when outbound numbers are unavailable; leads now properly finalize with clear status.
  • Refactor

    • Optimized call rate-limiting verification for improved reliability and concurrency safety.
  • Tests

    • Added comprehensive test coverage for edge cases in call queuing and rate limiting.

Review Change Stack

Copilot AI review requested due to automatic review settings May 21, 2026 16:30
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 74169449-d885-4ff2-a44c-1d2f07a8f257

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

The PR refactors Breeze Buddy's outbound call dispatch to split rate-limiting into a non-mutating peek and atomic record pattern to eliminate race conditions, removes provider-specific capacity checks from number selection, adds terminal failure handling with P1 alerts when numbers are unavailable, and updates end-to-end tests to verify the new flow.

Changes

Breeze Buddy outbound rate-limit and availability flow

Layer / File(s) Summary
Rate-limit architecture: peek and record Lua scripts
app/ai/voice/agents/breeze_buddy/services/call_limiter.py
Single Lua gate replaced with non-mutating peek (returns current count) and atomic record (trims, checks limit, conditionally adds entry) scripts. SHA-256-based Redis key generation added. Shared Slack alert helper added. Exported functions check_outbound_rate_limit_and_alert and check_outbound_limit_for_number removed; peek_outbound_rate_limit_and_alert and record_outbound_call_attempt added.
Number availability gating and terminal failure
app/ai/voice/agents/breeze_buddy/managers/calls.py
_get_available_number docstring clarified to treat None as terminal resolution failure, not capacity exhaustion. Template path now requires AVAILABLE status; provider-specific channel/max-channel capacity checks removed from fallback pool selection.
P1 alert for missing outbound numbers
app/ai/voice/agents/breeze_buddy/dispatch/alerts.py
New raise_no_outbound_number alert function added, throttled by (reseller_id, template), logs merchant_id as n/a when missing, sends structured Slack message via existing _send helper.
Worker dispatch flow: peek/record and terminal number handling
app/ai/voice/agents/breeze_buddy/dispatch/worker.py
Rate-limit check split into pre-channel peek and post-token atomic record phases. When no outbound number available, worker calls raise_no_outbound_number alert (best-effort) and marks lead FINISHED with NUMBER_UNAVAILABLE outcome instead of deferring. On atomic record rejection, releases channel token and number, defers by returned interval.
Test harness: peek/record and alert tracking
tests/breeze_buddy/dispatch/conftest.py
DispatchHarness adds rate-limit tracking lists (rate_limit_peeks, rate_limit_records), record acceptance/defer toggles, and no_outbound_number_alerts capture. Monkeypatch wiring updated to inject peek_outbound_rate_limit_and_alert, record_outbound_call_attempt, and raise_no_outbound_number into worker module.
End-to-end tests: rate-limit and number unavailability scenarios
tests/breeze_buddy/dispatch/test_end_to_end.py
Happy-path test asserts exactly one peek and one record. Two new regression tests added: atomic record rejection releases resources and defers; channel exhaustion blocks record (peek only). Rate-limit test strengthened with peek/record split assertions. Number-unavailable test renamed and updated to assert terminal FINISHED outcome with P1 alert instead of defer; rate-limit records remain empty.

Sequence Diagram(s)

sequenceDiagram
  participant Worker
  participant call_limiter
  participant Redis
  Worker->>call_limiter: peek_outbound_rate_limit_and_alert
  call_limiter->>Redis: EVALSHA peek Lua script
  Redis-->>call_limiter: current count
  call_limiter-->>Worker: (allow, defer_seconds)
  Worker->>call_limiter: record_outbound_call_attempt
  call_limiter->>Redis: EVALSHA record Lua script
  Redis-->>call_limiter: (accepted, count_pre)
  call_limiter-->>Worker: (allow, defer_seconds)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • juspay/clairvoyance#770: Extends the Breeze Buddy dispatch stack by integrating the raise_no_outbound_number alert and refactoring the rate-limiter to a peek/atomic record flow to eliminate race conditions.

Poem

🐰 A dispatch refactored with care,
Peek before record—a pattern fair!
When numbers run dry, we finish with grace,
Alerts race ahead through Slack's cyberspace.
Hops to the metrics that never deceive! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Linked Issues check ❓ Inconclusive The linked issue #2 is a system prompt data handling instruction unrelated to coding requirements, so it cannot be meaningfully assessed against the code changes. Clarify the relevant coding requirements and linked issue(s) that this PR should address, as the current linked issue does not pertain to the dispatcher fixes.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fixes: addressing false-positive rate-limit alerts and the number-unavailable retry hot loop by splitting the rate-limit check into peek/record and failing fast on missing numbers.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the dispatcher regressions: rate-limit alerting logic, number availability handling, race condition mitigation, and supporting test updates.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dispatcher-rate-limit-false-block-and-number-fail-fast

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes two dispatcher regressions in Breeze Buddy’s event-driven telephony dispatch path: (1) false-positive outbound rate-limit alerts caused by counting dispatcher retries that never dialed, and (2) an infinite defer/retry hot loop when no outbound number can be resolved.

Changes:

  • Split outbound rate limiting into a non-writing peek + alert step and a post-dial record step, so only real dials are recorded.
  • Treat _get_available_number() returning None as terminal: mark lead FINISHED with outcome NUMBER_UNAVAILABLE and emit a throttled P1 alert.
  • Add/extend end-to-end tests to pin the new invariants (peek-before-channel, record-after-dial, and fail-fast on missing outbound number).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/breeze_buddy/dispatch/test_end_to_end.py Adds/updates regression tests for the peek/record split and the NUMBER_UNAVAILABLE fail-fast path.
tests/breeze_buddy/dispatch/conftest.py Extends the dispatch harness to track peek vs record calls and capture the new alert.
app/ai/voice/agents/breeze_buddy/services/call_limiter.py Splits the Lua/scripted limiter into peek_outbound_rate_limit_and_alert and record_outbound_call_attempt.
app/ai/voice/agents/breeze_buddy/managers/calls.py Updates outbound-number selection logic and clarifies (via doc/logging) the semantics of returning None.
app/ai/voice/agents/breeze_buddy/dispatch/worker.py Moves rate-limit accounting to post-make_call success; adds fail-fast + alert for “no outbound number”.
app/ai/voice/agents/breeze_buddy/dispatch/alerts.py Adds throttled raise_no_outbound_number P1 alert helper.

Comment on lines +144 to +145
# Defer is the channel-wait backoff (random in [1, BB_CHANNEL_WAIT_BACKOFF_MAX_S]).
assert 1 <= defer_seconds <= 60
Comment on lines 265 to 270
@@ -253,18 +270,7 @@ async def _get_available_number(
]
@swaroopvarma1 swaroopvarma1 force-pushed the fix/dispatcher-rate-limit-false-block-and-number-fail-fast branch 3 times, most recently from 05e5307 to 7af8a47 Compare May 21, 2026 17:53
…ble hot loop

Two related regressions surfaced by PR #772 (event-driven dispatch). Both
share a root cause: the new dispatcher retries on transient failures at
1-3s cadence instead of the old cron's 30s, exposing latent bugs in the
downstream gates.

(1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts.

   The rate-limit Lua ZADDed every dispatch attempt that passed the
   peek-count check, BEFORE the channel-token gate. When channel tokens
   were exhausted, a single lead retried every 1-3s, ZADDing its own
   phone into the sliding window each time. After 7 retries (~10-15s),
   the 8th attempt for the SAME lead tripped its own limit, fired a
   Slack alert, and deferred 3600s — without ever dialing the customer.

   On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9
   distinct Amir-and-Sons COD-confirmation leads. Verified in
   OpenObserve: zero make_call events, zero PROCESSING updates, zero
   actual dials reached any of those 9 phones. Each lead had exactly 1
   create_lead_call_tracker (no duplicate ingestion) and exactly 8
   defer-and-release cycles in ~2 minutes (7 short channel-wait + 1
   hour-long rate-block).

   Fix: split call_limiter into peek (read-only ZCARD via new
   OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an
   atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim +
   ZCARD + conditional ZADD in a single Lua). Peek runs before the
   channel-token gate (fast-fail without burning capacity); atomic
   record runs AFTER acquire_channel_token + _acquire_number and
   BEFORE provider.make_call.

   Atomic record is the authoritative cap. Placement constraints:
     - After channel-token acquire — otherwise channel-wait retries
       re-introduce the self-fill bug.
     - Before make_call — otherwise the call is already on the wire
       when the race is detected; strict cap is meaningless once the
       customer's phone rings.
     - Single Lua check-and-set — anything non-atomic leaves a TOCTOU
       window where N concurrent same-phone workers could all pass
       peek under-limit, all dial, then all record (case 5 in design
       notes; pre-PR strict cap silently weakened by the initial split).

   On race-rejection (another worker on the same phone filled the
   bucket between our peek and our record), the worker releases the
   channel token + DB number and defers the lead by the rate-limit
   window. Same Slack alert as the peek path so on-call sees one
   consistent title.

   Failure mode accepted: if make_call raises or returns no SID after
   atomic record runs, the bucket is inflated by 1 for a call that
   didn't reach the customer. Matches pre-PR semantics exactly and only
   triggers on rare provider-side failures.

(2) _get_available_number returning None → 10s defer-and-retry forever.

   The four reasons gate-1 returned None — missing template config,
   deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter
   full — all collapsed into a single None and got the same 10s retry.
   For permanent misconfigurations this was a hot loop with no alert
   and no termination cap, burning DB/Redis cycles indefinitely.

   The Exotel `channels < maximum_channels` check is redundant with the
   Redis channel-token semaphore (the reconciler keeps them in sync;
   semaphore is authoritative per dispatch/channel_semaphore.py). Drop
   it. With that removed, a None return means "structural failure" —
   mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1
   alert (raise_no_outbound_number, throttled per reseller/template),
   stop retrying.

Tests:
- test_rate_limit_blocks_before_channel_acquire: pins peek runs, record
  does NOT, on peek-side block.
- test_channel_exhaustion_does_not_record_rate_limit_attempt:
  regression guard for the 2026-05-21 alert storm.
- test_atomic_record_rejection_releases_resources_and_defers: cross-
  lead race guard — atomic record rejects, channel token + DB number
  released, lead deferred 1h, make_call never runs.
- test_full_round_trip_happy_path: pins exactly one peek and one
  record per successful dial.
- test_get_available_number_returns_none_marks_lead_finished:
  fail-fast on permanent number unavailability.

98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@swaroopvarma1 swaroopvarma1 force-pushed the fix/dispatcher-rate-limit-false-block-and-number-fail-fast branch from 7af8a47 to cfb67c7 Compare May 21, 2026 18:14
@swaroopvarma1 swaroopvarma1 merged commit 72db4e6 into release May 21, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants