fix(dispatcher): false-positive rate-limit alerts and number-unavailable hot loop#779
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThe PR refactors Breeze Buddy's outbound call dispatch to split rate-limiting into a non-mutating peek and atomic record pattern to eliminate race conditions, removes provider-specific capacity checks from number selection, adds terminal failure handling with P1 alerts when numbers are unavailable, and updates end-to-end tests to verify the new flow. ChangesBreeze Buddy outbound rate-limit and availability flow
Sequence Diagram(s)sequenceDiagram
participant Worker
participant call_limiter
participant Redis
Worker->>call_limiter: peek_outbound_rate_limit_and_alert
call_limiter->>Redis: EVALSHA peek Lua script
Redis-->>call_limiter: current count
call_limiter-->>Worker: (allow, defer_seconds)
Worker->>call_limiter: record_outbound_call_attempt
call_limiter->>Redis: EVALSHA record Lua script
Redis-->>call_limiter: (accepted, count_pre)
call_limiter-->>Worker: (allow, defer_seconds)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Fixes two dispatcher regressions in Breeze Buddy’s event-driven telephony dispatch path: (1) false-positive outbound rate-limit alerts caused by counting dispatcher retries that never dialed, and (2) an infinite defer/retry hot loop when no outbound number can be resolved.
Changes:
- Split outbound rate limiting into a non-writing peek + alert step and a post-dial record step, so only real dials are recorded.
- Treat
_get_available_number()returningNoneas terminal: mark leadFINISHEDwith outcomeNUMBER_UNAVAILABLEand emit a throttled P1 alert. - Add/extend end-to-end tests to pin the new invariants (peek-before-channel, record-after-dial, and fail-fast on missing outbound number).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
tests/breeze_buddy/dispatch/test_end_to_end.py |
Adds/updates regression tests for the peek/record split and the NUMBER_UNAVAILABLE fail-fast path. |
tests/breeze_buddy/dispatch/conftest.py |
Extends the dispatch harness to track peek vs record calls and capture the new alert. |
app/ai/voice/agents/breeze_buddy/services/call_limiter.py |
Splits the Lua/scripted limiter into peek_outbound_rate_limit_and_alert and record_outbound_call_attempt. |
app/ai/voice/agents/breeze_buddy/managers/calls.py |
Updates outbound-number selection logic and clarifies (via doc/logging) the semantics of returning None. |
app/ai/voice/agents/breeze_buddy/dispatch/worker.py |
Moves rate-limit accounting to post-make_call success; adds fail-fast + alert for “no outbound number”. |
app/ai/voice/agents/breeze_buddy/dispatch/alerts.py |
Adds throttled raise_no_outbound_number P1 alert helper. |
| # Defer is the channel-wait backoff (random in [1, BB_CHANNEL_WAIT_BACKOFF_MAX_S]). | ||
| assert 1 <= defer_seconds <= 60 |
| @@ -253,18 +270,7 @@ async def _get_available_number( | |||
| ] | |||
05e5307 to
7af8a47
Compare
…ble hot loop Two related regressions surfaced by PR #772 (event-driven dispatch). Both share a root cause: the new dispatcher retries on transient failures at 1-3s cadence instead of the old cron's 30s, exposing latent bugs in the downstream gates. (1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts. The rate-limit Lua ZADDed every dispatch attempt that passed the peek-count check, BEFORE the channel-token gate. When channel tokens were exhausted, a single lead retried every 1-3s, ZADDing its own phone into the sliding window each time. After 7 retries (~10-15s), the 8th attempt for the SAME lead tripped its own limit, fired a Slack alert, and deferred 3600s — without ever dialing the customer. On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9 distinct Amir-and-Sons COD-confirmation leads. Verified in OpenObserve: zero make_call events, zero PROCESSING updates, zero actual dials reached any of those 9 phones. Each lead had exactly 1 create_lead_call_tracker (no duplicate ingestion) and exactly 8 defer-and-release cycles in ~2 minutes (7 short channel-wait + 1 hour-long rate-block). Fix: split call_limiter into peek (read-only ZCARD via new OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim + ZCARD + conditional ZADD in a single Lua). Peek runs before the channel-token gate (fast-fail without burning capacity); atomic record runs AFTER acquire_channel_token + _acquire_number and BEFORE provider.make_call. Atomic record is the authoritative cap. Placement constraints: - After channel-token acquire — otherwise channel-wait retries re-introduce the self-fill bug. - Before make_call — otherwise the call is already on the wire when the race is detected; strict cap is meaningless once the customer's phone rings. - Single Lua check-and-set — anything non-atomic leaves a TOCTOU window where N concurrent same-phone workers could all pass peek under-limit, all dial, then all record (case 5 in design notes; pre-PR strict cap silently weakened by the initial split). On race-rejection (another worker on the same phone filled the bucket between our peek and our record), the worker releases the channel token + DB number and defers the lead by the rate-limit window. Same Slack alert as the peek path so on-call sees one consistent title. Failure mode accepted: if make_call raises or returns no SID after atomic record runs, the bucket is inflated by 1 for a call that didn't reach the customer. Matches pre-PR semantics exactly and only triggers on rare provider-side failures. (2) _get_available_number returning None → 10s defer-and-retry forever. The four reasons gate-1 returned None — missing template config, deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter full — all collapsed into a single None and got the same 10s retry. For permanent misconfigurations this was a hot loop with no alert and no termination cap, burning DB/Redis cycles indefinitely. The Exotel `channels < maximum_channels` check is redundant with the Redis channel-token semaphore (the reconciler keeps them in sync; semaphore is authoritative per dispatch/channel_semaphore.py). Drop it. With that removed, a None return means "structural failure" — mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1 alert (raise_no_outbound_number, throttled per reseller/template), stop retrying. Tests: - test_rate_limit_blocks_before_channel_acquire: pins peek runs, record does NOT, on peek-side block. - test_channel_exhaustion_does_not_record_rate_limit_attempt: regression guard for the 2026-05-21 alert storm. - test_atomic_record_rejection_releases_resources_and_defers: cross- lead race guard — atomic record rejects, channel token + DB number released, lead deferred 1h, make_call never runs. - test_full_round_trip_happy_path: pins exactly one peek and one record per successful dial. - test_get_available_number_returns_none_marks_lead_finished: fail-fast on permanent number unavailability. 98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7af8a47 to
cfb67c7
Compare
Summary
Two regressions in the new event-driven dispatcher (PR #772, deployed 2026-05-18), surfaced by 9 "Outbound Rate Limit Exceeded (BLOCKED)" alerts in 16 seconds on 2026-05-21 16:50 IST.
Fix
1. Two-stage rate limit with strict-cap enforcement
The rate limit is now a peek + atomic check-and-record pair:
Placement is constrained by three requirements that uniquely pin the spot:
If the atomic record returns rejected (a true cross-lead race; another worker on the same phone filled the bucket between our peek and our record), the worker releases the channel token + DB number and defers the lead by the rate-limit window. Same Slack alert as the peek path so on-call sees one consistent title regardless of which gate caught it.
Failure mode accepted: if `make_call` raises or returns no SID after the atomic record runs, the bucket is inflated by 1 for a call that didn't reach the customer. This matches pre-PR semantics exactly (pre-PR also counted attempts that didn't reach make_call) and only triggers on rare provider-side failures.
2. Fail-fast on permanent `_get_available_number` None
Mark the lead FINISHED with outcome `NUMBER_UNAVAILABLE` and fire a throttled P1 `raise_no_outbound_number` alert (throttled per reseller+template). The Exotel `channels < maximum_channels` check inside `_get_available_number` was redundant with the Redis semaphore and was dropped — None now unambiguously means "structural failure."
Behavior matrix (vs. pre-PR and intermediate peek-only)
Test plan
Out of scope
🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Refactor
Tests