fix(dispatcher): false-positive rate-limit alerts and number-unavailable hot loop by swaroopvarma1 · Pull Request #779 · juspay/clairvoyance

swaroopvarma1 · 2026-05-21T16:30:52Z

Summary

Two regressions in the new event-driven dispatcher (PR #772, deployed 2026-05-18), surfaced by 9 "Outbound Rate Limit Exceeded (BLOCKED)" alerts in 16 seconds on 2026-05-21 16:50 IST.

Spurious rate-limit alerts (no harassment, but real customer delay). The rate-limit ZADD ran for every dispatcher attempt that passed peek, BEFORE the channel-token gate. When channel tokens were exhausted, a single lead's 1-3s retries self-filled its own per-phone bucket in ~10 seconds, then the 8th attempt fired a Slack alert and pushed the lead out by 3600s. Verified in OpenObserve: zero `make_call` events on the 9 alerted phones — no customer was actually called. 9 legitimate first-time order-confirmation leads delayed 1 hour each.
`_get_available_number()` None → 10s defer-and-retry forever. Permanent misconfigurations (template missing `outbound_number_id`, deleted/disabled number, empty fallback pool) burned DB+Redis cycles indefinitely with no alert and no termination.

Fix

1. Two-stage rate limit with strict-cap enforcement

The rate limit is now a peek + atomic check-and-record pair:

`peek_outbound_rate_limit_and_alert` (new `OUTBOUND_RATE_LIMIT_PEEK_LUA`): read-only `ZCARD` after expiry trim. Runs BEFORE the channel-token gate, fires the Slack alert if the bucket is already at limit. Fast-fail so we don't hold a channel for a doomed lead.
`record_outbound_call_attempt` (new `OUTBOUND_RATE_LIMIT_RECORD_LUA`): single atomic Lua doing trim + `ZCARD` + (conditional `ZADD` under `count < limit`). Returns `(allow, defer_seconds)` — the authoritative cap. Runs AFTER `acquire_channel_token` + `_acquire_number` and BEFORE `provider.make_call`.

Placement is constrained by three requirements that uniquely pin the spot:

After channel-token acquire — otherwise channel-wait retries re-introduce the self-fill bug (the original alert storm).
Before make_call — otherwise the call is on the wire before the race is detected; strict cap is meaningless once the customer's phone rings.
Atomic with the count check — anything non-atomic leaves a TOCTOU window where N concurrent same-phone workers can all pass peek under-limit, all dial, then all record (each only resolving the conflict after the dial).

If the atomic record returns rejected (a true cross-lead race; another worker on the same phone filled the bucket between our peek and our record), the worker releases the channel token + DB number and defers the lead by the rate-limit window. Same Slack alert as the peek path so on-call sees one consistent title regardless of which gate caught it.

Failure mode accepted: if `make_call` raises or returns no SID after the atomic record runs, the bucket is inflated by 1 for a call that didn't reach the customer. This matches pre-PR semantics exactly (pre-PR also counted attempts that didn't reach make_call) and only triggers on rare provider-side failures.

2. Fail-fast on permanent `_get_available_number` None

Mark the lead FINISHED with outcome `NUMBER_UNAVAILABLE` and fire a throttled P1 `raise_no_outbound_number` alert (throttled per reseller+template). The Exotel `channels < maximum_channels` check inside `_get_available_number` was redundant with the Redis semaphore and was dropped — None now unambiguously means "structural failure."

Behavior matrix (vs. pre-PR and intermediate peek-only)

Scenario	Pre-PR	Peek-only (intermediate)	This PR
Single lead, count < max, happy path	dial + count	dial + count	dial + count
Single lead, count ≥ max	BLOCKED + 1h defer	BLOCKED + 1h defer	BLOCKED + 1h defer
Single lead, channel-token exhausted	self-fill → false BLOCKED	peek-only retry (no count growth)	peek-only retry (no count growth)
N concurrent same-phone, count + N ≤ max	all dial, count grows by N	all dial, count grows by N	all dial, count grows by N
N concurrent same-phone, count + N > max	strict cap (atomic Lua)	over-shoots by N - (max - count)	strict cap (atomic Lua) restored
make_call fails after record	counted	not counted	counted (matches pre-PR)
Redis down at any rate-limit point	fail-open	fail-open	fail-open
`_get_available_number` returns None	10s defer forever	FINISHED + alert	FINISHED + alert

Test plan

`tests/breeze_buddy/dispatch/` — 98 passed, 1 xfailed (cluster-mode, pre-existing).
`uv run pyrefly check` — 0 errors, 15 suppressed (no new suppressions).
`uv run black --check . && uv run isort --check --profile black . && uv run autoflake --check ... -r app/` — all clean.
`test_channel_exhaustion_does_not_record_rate_limit_attempt` — pins the 2026-05-21 root cause: zero-token semaphore → defer with channel-wait jitter → no ZADD ever.
`test_atomic_record_rejection_releases_resources_and_defers` — pins the cross-lead race: peek allows, atomic record rejects, channel + number released, lead deferred 1h, make_call never runs.
`test_get_available_number_returns_none_marks_lead_finished` — pins fail-fast behavior, throttled alert capture, untouched channel pool.
`test_rate_limit_blocks_before_channel_acquire` — pins peek runs but record does NOT on peek-side block.
`test_full_round_trip_happy_path` — pins one peek + one record per successful dial.

Out of scope

Twilio's status==IN_USE binary-gate semantics (caps Twilio concurrency at 1/number regardless of `maximum_channels`). Existing behavior, separate concern.
Tuning `BB_CHANNEL_WAIT_BACKOFF_MAX_S`. With this fix the 1-3s retries are benign (no ZADD on retry).
ZREM-undo on make_call failure (would eliminate the case-6 inflation; complexity-for-correctness trade we don't need today).
Re-pushing the 9 affected leads from 2026-05-21 16:50 — operational, not code.

🤖 Generated with Claude Code

Summary by CodeRabbit

Bug Fixes
- Fixed race condition in concurrent outbound call attempts that could cause duplicate calls.
- Improved handling when outbound numbers are unavailable; leads now properly finalize with clear status.
Refactor
- Optimized call rate-limiting verification for improved reliability and concurrency safety.
Tests
- Added comprehensive test coverage for edge cases in call queuing and rate limiting.

coderabbitai · 2026-05-21T16:31:01Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 74169449-d885-4ff2-a44c-1d2f07a8f257

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

The PR refactors Breeze Buddy's outbound call dispatch to split rate-limiting into a non-mutating peek and atomic record pattern to eliminate race conditions, removes provider-specific capacity checks from number selection, adds terminal failure handling with P1 alerts when numbers are unavailable, and updates end-to-end tests to verify the new flow.

Changes

Breeze Buddy outbound rate-limit and availability flow

Layer / File(s)	Summary
Rate-limit architecture: peek and record Lua scripts `app/ai/voice/agents/breeze_buddy/services/call_limiter.py`	Single Lua gate replaced with non-mutating peek (returns current count) and atomic record (trims, checks limit, conditionally adds entry) scripts. SHA-256-based Redis key generation added. Shared Slack alert helper added. Exported functions `check_outbound_rate_limit_and_alert` and `check_outbound_limit_for_number` removed; `peek_outbound_rate_limit_and_alert` and `record_outbound_call_attempt` added.
Number availability gating and terminal failure `app/ai/voice/agents/breeze_buddy/managers/calls.py`	`_get_available_number` docstring clarified to treat None as terminal resolution failure, not capacity exhaustion. Template path now requires `AVAILABLE` status; provider-specific channel/max-channel capacity checks removed from fallback pool selection.
P1 alert for missing outbound numbers `app/ai/voice/agents/breeze_buddy/dispatch/alerts.py`	New `raise_no_outbound_number` alert function added, throttled by `(reseller_id, template)`, logs `merchant_id` as `n/a` when missing, sends structured Slack message via existing `_send` helper.
Worker dispatch flow: peek/record and terminal number handling `app/ai/voice/agents/breeze_buddy/dispatch/worker.py`	Rate-limit check split into pre-channel peek and post-token atomic record phases. When no outbound number available, worker calls `raise_no_outbound_number` alert (best-effort) and marks lead `FINISHED` with `NUMBER_UNAVAILABLE` outcome instead of deferring. On atomic record rejection, releases channel token and number, defers by returned interval.
Test harness: peek/record and alert tracking `tests/breeze_buddy/dispatch/conftest.py`	`DispatchHarness` adds rate-limit tracking lists (`rate_limit_peeks`, `rate_limit_records`), record acceptance/defer toggles, and `no_outbound_number_alerts` capture. Monkeypatch wiring updated to inject `peek_outbound_rate_limit_and_alert`, `record_outbound_call_attempt`, and `raise_no_outbound_number` into worker module.
End-to-end tests: rate-limit and number unavailability scenarios `tests/breeze_buddy/dispatch/test_end_to_end.py`	Happy-path test asserts exactly one peek and one record. Two new regression tests added: atomic record rejection releases resources and defers; channel exhaustion blocks record (peek only). Rate-limit test strengthened with peek/record split assertions. Number-unavailable test renamed and updated to assert terminal `FINISHED` outcome with P1 alert instead of defer; rate-limit records remain empty.

Sequence Diagram(s)

sequenceDiagram
  participant Worker
  participant call_limiter
  participant Redis
  Worker->>call_limiter: peek_outbound_rate_limit_and_alert
  call_limiter->>Redis: EVALSHA peek Lua script
  Redis-->>call_limiter: current count
  call_limiter-->>Worker: (allow, defer_seconds)
  Worker->>call_limiter: record_outbound_call_attempt
  call_limiter->>Redis: EVALSHA record Lua script
  Redis-->>call_limiter: (accepted, count_pre)
  call_limiter-->>Worker: (allow, defer_seconds)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

juspay/clairvoyance#770: Extends the Breeze Buddy dispatch stack by integrating the raise_no_outbound_number alert and refactoring the rate-limiter to a peek/atomic record flow to eliminate race conditions.

Poem

🐰 A dispatch refactored with care,
Peek before record—a pattern fair!
When numbers run dry, we finish with grace,
Alerts race ahead through Slack's cyberspace.
Hops to the metrics that never deceive! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	❓ Inconclusive	The linked issue `#2` is a system prompt data handling instruction unrelated to coding requirements, so it cannot be meaningfully assessed against the code changes.	Clarify the relevant coding requirements and linked issue(s) that this PR should address, as the current linked issue does not pertain to the dispatcher fixes.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main fixes: addressing false-positive rate-limit alerts and the number-unavailable retry hot loop by splitting the rate-limit check into peek/record and failing fast on missing numbers.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the dispatcher regressions: rate-limit alerting logic, number availability handling, race condition mitigation, and supporting test updates.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/dispatcher-rate-limit-false-block-and-number-fail-fast

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Fixes two dispatcher regressions in Breeze Buddy’s event-driven telephony dispatch path: (1) false-positive outbound rate-limit alerts caused by counting dispatcher retries that never dialed, and (2) an infinite defer/retry hot loop when no outbound number can be resolved.

Changes:

Split outbound rate limiting into a non-writing peek + alert step and a post-dial record step, so only real dials are recorded.
Treat _get_available_number() returning None as terminal: mark lead FINISHED with outcome NUMBER_UNAVAILABLE and emit a throttled P1 alert.
Add/extend end-to-end tests to pin the new invariants (peek-before-channel, record-after-dial, and fail-fast on missing outbound number).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`tests/breeze_buddy/dispatch/test_end_to_end.py`	Adds/updates regression tests for the peek/record split and the `NUMBER_UNAVAILABLE` fail-fast path.
`tests/breeze_buddy/dispatch/conftest.py`	Extends the dispatch harness to track peek vs record calls and capture the new alert.
`app/ai/voice/agents/breeze_buddy/services/call_limiter.py`	Splits the Lua/scripted limiter into `peek_outbound_rate_limit_and_alert` and `record_outbound_call_attempt`.
`app/ai/voice/agents/breeze_buddy/managers/calls.py`	Updates outbound-number selection logic and clarifies (via doc/logging) the semantics of returning `None`.
`app/ai/voice/agents/breeze_buddy/dispatch/worker.py`	Moves rate-limit accounting to post-`make_call` success; adds fail-fast + alert for “no outbound number”.
`app/ai/voice/agents/breeze_buddy/dispatch/alerts.py`	Adds throttled `raise_no_outbound_number` P1 alert helper.

+    # Defer is the channel-wait backoff (random in [1, BB_CHANNEL_WAIT_BACKOFF_MAX_S]).
+    assert 1 <= defer_seconds <= 60


@@ -253,18 +270,7 @@ async def _get_available_number(
        ]


…ble hot loop Two related regressions surfaced by PR #772 (event-driven dispatch). Both share a root cause: the new dispatcher retries on transient failures at 1-3s cadence instead of the old cron's 30s, exposing latent bugs in the downstream gates. (1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts. The rate-limit Lua ZADDed every dispatch attempt that passed the peek-count check, BEFORE the channel-token gate. When channel tokens were exhausted, a single lead retried every 1-3s, ZADDing its own phone into the sliding window each time. After 7 retries (~10-15s), the 8th attempt for the SAME lead tripped its own limit, fired a Slack alert, and deferred 3600s — without ever dialing the customer. On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9 distinct Amir-and-Sons COD-confirmation leads. Verified in OpenObserve: zero make_call events, zero PROCESSING updates, zero actual dials reached any of those 9 phones. Each lead had exactly 1 create_lead_call_tracker (no duplicate ingestion) and exactly 8 defer-and-release cycles in ~2 minutes (7 short channel-wait + 1 hour-long rate-block). Fix: split call_limiter into peek (read-only ZCARD via new OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim + ZCARD + conditional ZADD in a single Lua). Peek runs before the channel-token gate (fast-fail without burning capacity); atomic record runs AFTER acquire_channel_token + _acquire_number and BEFORE provider.make_call. Atomic record is the authoritative cap. Placement constraints: - After channel-token acquire — otherwise channel-wait retries re-introduce the self-fill bug. - Before make_call — otherwise the call is already on the wire when the race is detected; strict cap is meaningless once the customer's phone rings. - Single Lua check-and-set — anything non-atomic leaves a TOCTOU window where N concurrent same-phone workers could all pass peek under-limit, all dial, then all record (case 5 in design notes; pre-PR strict cap silently weakened by the initial split). On race-rejection (another worker on the same phone filled the bucket between our peek and our record), the worker releases the channel token + DB number and defers the lead by the rate-limit window. Same Slack alert as the peek path so on-call sees one consistent title. Failure mode accepted: if make_call raises or returns no SID after atomic record runs, the bucket is inflated by 1 for a call that didn't reach the customer. Matches pre-PR semantics exactly and only triggers on rare provider-side failures. (2) _get_available_number returning None → 10s defer-and-retry forever. The four reasons gate-1 returned None — missing template config, deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter full — all collapsed into a single None and got the same 10s retry. For permanent misconfigurations this was a hot loop with no alert and no termination cap, burning DB/Redis cycles indefinitely. The Exotel `channels < maximum_channels` check is redundant with the Redis channel-token semaphore (the reconciler keeps them in sync; semaphore is authoritative per dispatch/channel_semaphore.py). Drop it. With that removed, a None return means "structural failure" — mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1 alert (raise_no_outbound_number, throttled per reseller/template), stop retrying. Tests: - test_rate_limit_blocks_before_channel_acquire: pins peek runs, record does NOT, on peek-side block. - test_channel_exhaustion_does_not_record_rate_limit_attempt: regression guard for the 2026-05-21 alert storm. - test_atomic_record_rejection_releases_resources_and_defers: cross- lead race guard — atomic record rejects, channel token + DB number released, lead deferred 1h, make_call never runs. - test_full_round_trip_happy_path: pins exactly one peek and one record per successful dial. - test_get_available_number_returns_none_marks_lead_finished: fail-fast on permanent number unavailability. 98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 21, 2026 16:30

Copilot started reviewing on behalf of swaroopvarma1 May 21, 2026 16:31 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

swaroopvarma1 force-pushed the fix/dispatcher-rate-limit-false-block-and-number-fail-fast branch 3 times, most recently from 05e5307 to 7af8a47 Compare May 21, 2026 17:53

swaroopvarma1 force-pushed the fix/dispatcher-rate-limit-false-block-and-number-fail-fast branch from 7af8a47 to cfb67c7 Compare May 21, 2026 18:14

swaroopvarma1 merged commit 72db4e6 into release May 21, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(dispatcher): false-positive rate-limit alerts and number-unavailable hot loop#779

fix(dispatcher): false-positive rate-limit alerts and number-unavailable hot loop#779
swaroopvarma1 merged 1 commit into
releasefrom
fix/dispatcher-rate-limit-false-block-and-number-fail-fast

swaroopvarma1 commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Review skipped

❌ Failed checks (1 inconclusive)

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# Defer is the channel-wait backoff (random in [1, BB_CHANNEL_WAIT_BACKOFF_MAX_S]).
		assert 1 <= defer_seconds <= 60

Conversation

swaroopvarma1 commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

1. Two-stage rate limit with strict-cap enforcement

2. Fail-fast on permanent `_get_available_number` None

Behavior matrix (vs. pre-PR and intermediate peek-only)

Test plan

Out of scope

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 inconclusive)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

swaroopvarma1 commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading