feat: Add support for template orchestrator#772
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR adds orchestrator-mode support to the Breeze Buddy voice agent, enabling mid-call routing to pre-configured agent templates. The implementation preloads TTS services for all agents, provides a new ChangesOrchestrator-Mode Call Routing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
ad68fd5 to
33ed5fb
Compare
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py`:
- Around line 923-927: The current except block in the orchestrator bootstrap
(the try that builds the TTS switcher) only logs and falls back to plain TTS,
which can leave routing enabled but without required state (tts_map /
agent_templates); change that handler in __init__.py so it fails fast: after
logging the error (logger.error(...)), re-raise the exception (or raise a new
RuntimeError with context) instead of continuing, ensuring the orchestrator
bootstrap in the TTS switcher build path aborts rather than degrading to plain
TTS.
In `@app/ai/voice/agents/breeze_buddy/agent/pipeline.py`:
- Around line 249-256: The new block calling FlowConfigLoader().load_template
(the variables rendered_template and agent_vars) is not Black-formatted; run the
Black formatter with line-length=88 over the file (or at least this
function/pipeline block) to fix spacing/line breaks and ensure conformity with
the project's Black rules, then re-run CI; focus on formatting around the
FlowConfigLoader instantiation and the await flow_loader.load_template call to
resolve the CI failure.
- Around line 233-240: The loop that preloads TTS silently continues when
get_template_by_id(agent_route.template_id) returns None
(orch_config.agent_templates / agent_route.template_id in the for-loop), leaving
a advertised route unusable; instead, fail fast during preload by replacing the
continue branch: log the error with full context and raise an explicit exception
(e.g., RuntimeError/ValueError) that includes agent_route.template_id and
agent_route.name so startup/preload fails and the misconfiguration is surfaced
immediately.
In `@app/ai/voice/agents/breeze_buddy/handlers/internal/route_to_agent.py`:
- Around line 34-51: The route_to_agent function has a wrong return annotation
and docstring: it currently declares Dict[str, Any] but on success returns a
tuple (dict, dict) while error paths return a single dict; update the function
signature and docstring to reflect the union return type (e.g., Tuple[Dict[str,
Any], Dict[str, Any]] | Dict[str, Any] or Optional/Union equivalent used in the
codebase) so static checkers are satisfied, and adjust the docstring description
of the return value to explain both the success tuple (routing result and
edge-function payload) and the error dict; modify the annotation on
route_to_agent and the Returns section in its docstring accordingly.
In `@app/ai/voice/agents/breeze_buddy/template/types.py`:
- Around line 941-953: The schema currently documents but does not enforce that
when is_orchestrator is True an orchestrator_config must be provided and that
that config contains at least one agent template; add validation to the Pydantic
model holding is_orchestrator and orchestrator_config (e.g., a `@root_validator`
or `@validator` on orchestrator_config) to: 1) raise a ValidationError if
is_orchestrator is True and orchestrator_config is None, 2) validate
orchestrator_config.agent_templates is not empty, and 3) optionally ensure that
if is_orchestrator is False then orchestrator_config is None or empty; reference
the is_orchestrator and orchestrator_config fields and the OrchestratorConfig
type when adding the validator so invalid templates fail at validation time
rather than runtime.
In `@app/database/accessor/breeze_buddy/lead_call_tracker.py`:
- Around line 728-729: The two-line call to update_lead_template_query and
run_parameterized_query violates Black formatting (line length/spacing);
reformat the file using Black with line-length=88 to fix it — run `black .
--line-length 88` (or your repo’s Black command) on the file containing
update_lead_template_query and run_parameterized_query so the call is
wrapped/indented to satisfy Black and CI.
In `@app/database/accessor/breeze_buddy/template.py`:
- Around line 112-118: The uniqueness check currently calls
validate_single_orchestrator_per_number whenever
configurations.get("is_orchestrator") is true, blocking saving of inactive
drafts; change the guard to only run this validation when the orchestrator is
active by also checking configurations.get("is_active") (e.g., if configurations
and configurations.get("is_orchestrator") and configurations.get("is_active")
and outbound_number_id: await
validate_single_orchestrator_per_number(outbound_number_id)). Apply the same
change to the other similar block that calls
validate_single_orchestrator_per_number (the one around the second occurrence
handling updates) so only active orchestrators trigger the uniqueness
validation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 76dc8643-f396-4b8d-bd7d-3b0156c11571
📒 Files selected for processing (16)
app/ai/voice/agents/breeze_buddy/agent/__init__.pyapp/ai/voice/agents/breeze_buddy/agent/flow.pyapp/ai/voice/agents/breeze_buddy/agent/pipeline.pyapp/ai/voice/agents/breeze_buddy/handlers/internal/__init__.pyapp/ai/voice/agents/breeze_buddy/handlers/internal/builtin_dispatcher.pyapp/ai/voice/agents/breeze_buddy/handlers/internal/end_conversation.pyapp/ai/voice/agents/breeze_buddy/handlers/internal/route_to_agent.pyapp/ai/voice/agents/breeze_buddy/managers/calls.pyapp/ai/voice/agents/breeze_buddy/template/builder.pyapp/ai/voice/agents/breeze_buddy/template/context.pyapp/ai/voice/agents/breeze_buddy/template/types.pyapp/api/routers/breeze_buddy/telephony/answer/handlers.pyapp/database/accessor/breeze_buddy/lead_call_tracker.pyapp/database/accessor/breeze_buddy/template.pyapp/database/queries/breeze_buddy/lead_call_tracker.pyapp/database/queries/breeze_buddy/template.py
| flow_loader = FlowConfigLoader() | ||
| rendered_template, agent_vars = await flow_loader.load_template( | ||
| reseller_id=raw_template.reseller_id, | ||
| template=raw_template.name, | ||
| merchant_id=raw_template.merchant_id, | ||
| call_payload=call_payload or {}, | ||
| template_id=raw_template.id, | ||
| ) |
There was a problem hiding this comment.
Black formatting check is failing in this new block.
The CI pipeline already reports this file as not Black-compliant; please run formatter on this section before merge.
As per coding guidelines, "Use Black for code formatting with line-length=88".
🧰 Tools
🪛 GitHub Actions: PR Build Check / 0_build.txt
[error] 252-252: Black formatting check failed (uv run black --check --diff .). This file would be reformatted.
🪛 GitHub Actions: PR Build Check / build
[error] 252-252: Black formatting check failed (uv run black --check --diff .). File would be reformatted: /home/runner/work/clairvoyance/clairvoyance/app/ai/voice/agents/breeze_buddy/agent/pipeline.py
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/ai/voice/agents/breeze_buddy/agent/pipeline.py` around lines 249 - 256,
The new block calling FlowConfigLoader().load_template (the variables
rendered_template and agent_vars) is not Black-formatted; run the Black
formatter with line-length=88 over the file (or at least this function/pipeline
block) to fix spacing/line breaks and ensure conformity with the project's Black
rules, then re-run CI; focus on formatting around the FlowConfigLoader
instantiation and the await flow_loader.load_template call to resolve the CI
failure.
There was a problem hiding this comment.
Pull request overview
This PR adds “template orchestrator” support to Breeze Buddy voice calls: an orchestrator template can act as the inbound entrypoint, perform intent detection, and then route mid-call to a chosen agent template while preserving call metadata/transcription and optionally hot-swapping TTS voices.
Changes:
- Added an
is_orchestratortemplate configuration with anorchestrator_configrouting table and aroute_to_agentbuiltin handler for mid-call template switching. - Implemented preloading of agent templates + agent TTS services at pipeline startup (ServiceSwitcher +
tts_map) to enable low-latency voice switching on route. - Improved call completion persistence (don’t gate FINISHED write on config lookup) and enhanced transcription/node traversal metadata to reflect cross-template routing.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| app/database/queries/breeze_buddy/template.py | Adds SQL query to fetch the active orchestrator template for an outbound number. |
| app/database/queries/breeze_buddy/lead_call_tracker.py | Adds SQL update to change a lead’s template/template_id mid-call. |
| app/database/accessor/breeze_buddy/template.py | Enforces “single orchestrator per outbound number” and provides orchestrator lookup helpers. |
| app/database/accessor/breeze_buddy/lead_call_tracker.py | Adds accessor to persist the routed-to agent template on the lead record. |
| app/api/routers/breeze_buddy/telephony/answer/handlers.py | Bypasses IVR selection when an orchestrator template exists for the called number. |
| app/ai/voice/agents/breeze_buddy/template/types.py | Introduces orchestrator configuration schema (agent routing list). |
| app/ai/voice/agents/breeze_buddy/template/context.py | Adds template switching + tags node traversal entries with active template_id. |
| app/ai/voice/agents/breeze_buddy/template/builder.py | Injects an orchestrator routing table into direct-mode system prompts. |
| app/ai/voice/agents/breeze_buddy/managers/calls.py | Ensures call completion details persist even if config lookup fails. |
| app/ai/voice/agents/breeze_buddy/handlers/internal/route_to_agent.py | New handler implementing mid-call routing, metadata snapshotting, TTS swap, and edge-function transition. |
| app/ai/voice/agents/breeze_buddy/handlers/internal/end_conversation.py | Appends transcription to preserve pre-routing history. |
| app/ai/voice/agents/breeze_buddy/handlers/internal/builtin_dispatcher.py | Registers route_to_agent and adjusts traversal recording for edge-function transitions. |
| app/ai/voice/agents/breeze_buddy/handlers/internal/init.py | Exports route_to_agent handler. |
| app/ai/voice/agents/breeze_buddy/agent/pipeline.py | Adds orchestrator TTS/template preloading helper (ServiceSwitcher + rendered templates map). |
| app/ai/voice/agents/breeze_buddy/agent/flow.py | Stores tts_map + pre-rendered agent templates on the bot for handler access. |
| app/ai/voice/agents/breeze_buddy/agent/init.py | Wires orchestrator TTS switcher into agent startup and passes maps into flow manager setup. |
Comments suppressed due to low confidence (1)
app/database/accessor/breeze_buddy/template.py:396
- Same as create_template: a ValueError from validate_single_orchestrator_per_number will be swallowed by the broad exception handler and converted into
None, which the REST handler maps to a 500. Consider re-raising ValueError so callers can treat it as a 400 validation failure.
# Validate orchestrator uniqueness before writing
if (
configurations
and configurations.get("is_orchestrator")
and outbound_number_id
| # We store the orchestrator phase as a flat list under | ||
| # metaData["orchestrator_transcription"] | ||
| # and also initialise metaData["transcription"] with those messages so | ||
| # end_conversation can safely append the agent phase on top of it. |
| existing = await get_orchestrator_template_by_outbound_number_id(outbound_number_id) | ||
| if existing and ( | ||
| exclude_template_id is None or str(existing.id) != exclude_template_id | ||
| ): | ||
| raise ValueError( |
| # Validate orchestrator uniqueness before writing | ||
| if ( | ||
| configurations | ||
| and configurations.get("is_orchestrator") | ||
| and outbound_number_id |
| "agent templates listed in orchestrator_config. Only one template per outbound number " | ||
| "may have is_orchestrator=True (enforced at save time by the accessor).", | ||
| ) | ||
| orchestrator_config: Optional[OrchestratorConfig] = Field( | ||
| None, |
| for agent_route in orch_config.agent_templates: | ||
| raw_template = await get_template_by_id(agent_route.template_id) | ||
| if not raw_template: |
| including the orchestrator's own TTS under key ``"__orchestrator__"``. | ||
| - ``agent_templates_map``: dict mapping agent template_id → rendered | ||
| ``TemplateModel`` (ready for ``FlowConfigBuilder``). | ||
|
|
| context.lead.metaData["transcription"] = ( | ||
| context.lead.metaData.get("transcription") or [] | ||
| ) + transcription | ||
| logger.info( | ||
| f"Collected {len(transcription)} total messages " | ||
| f"({len(filtered_transcript)} user/assistant) for call {context.call_sid}" |
| async def route_to_agent( | ||
| context: TemplateContext, | ||
| args: Dict[str, Any], | ||
| transition_to: Optional[str] = None, | ||
| ) -> Dict[str, Any]: |
| flow_manager._global_functions = agent_global_functions | ||
| logger.debug( | ||
| f"[Orchestrator] Rebuilt {len(agent_global_functions)} global " | ||
| f"function(s) from agent template '{agent_template.name}' " | ||
| f"(call {context.call_sid})" | ||
| ) | ||
| except Exception as exc: | ||
| # Non-fatal: clear globals to at least prevent route_to_agent | ||
| # hallucination even if we can't register agent globals. | ||
| flow_manager._global_functions = [] | ||
| logger.warning( | ||
| f"[Orchestrator] Failed to rebuild agent global functions, " | ||
| f"cleared instead (call {context.call_sid}): {exc}" |
| # Skip recording if the handler returned a next_node — that means it was | ||
| # a routing/transition action (e.g. route_to_agent) which already manages | ||
| # its own node traversal bookkeeping internally. | ||
| if next_node is None: |
20a2097 to
0db3701
Compare
|
@coderabbitai fixed all the comments. Review the latest commit now. |
|
(ᵔ◡ᵔ)🐇 ✅ Actions performedReview triggered.
|
0db3701 to
ce67d07
Compare
…ble hot loop Two related regressions surfaced by PR #772 (event-driven dispatch). Both share a root cause: the new dispatcher retries on transient failures at 1-3s cadence instead of the old cron's 30s, exposing latent bugs in the downstream gates. (1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts. The rate-limit Lua ZADDed every dispatch attempt that passed the peek-count check, BEFORE the channel-token gate. When channel tokens were exhausted, a single lead retried every 1-3s, ZADDing its own phone into the sliding window each time. After 7 retries (~10-15s), the 8th attempt for the SAME lead tripped its own limit, fired a Slack alert, and deferred 3600s — without ever dialing the customer. On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9 distinct Amir-and-Sons COD-confirmation leads. Verified in OpenObserve: zero make_call events, zero PROCESSING updates, zero actual dials reached any of those 9 phones. Each lead had exactly 1 create_lead_call_tracker (no duplicate ingestion) and exactly 8 defer-and-release cycles in ~2 minutes (7 short channel-wait + 1 hour-long rate-block). Fix: split call_limiter into peek (read-only ZCARD via new OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim + ZCARD + conditional ZADD in a single Lua). Peek runs before the channel-token gate (fast-fail without burning capacity); atomic record runs AFTER acquire_channel_token + _acquire_number and BEFORE provider.make_call. Atomic record is the authoritative cap. Placement constraints: - After channel-token acquire — otherwise channel-wait retries re-introduce the self-fill bug. - Before make_call — otherwise the call is already on the wire when the race is detected; strict cap is meaningless once the customer's phone rings. - Single Lua check-and-set — anything non-atomic leaves a TOCTOU window where N concurrent same-phone workers could all pass peek under-limit, all dial, then all record (case 5 in design notes; pre-PR strict cap silently weakened by the initial split). On race-rejection (another worker on the same phone filled the bucket between our peek and our record), the worker releases the channel token + DB number and defers the lead by the rate-limit window. Same Slack alert as the peek path so on-call sees one consistent title. Failure mode accepted: if make_call raises or returns no SID after atomic record runs, the bucket is inflated by 1 for a call that didn't reach the customer. Matches pre-PR semantics exactly and only triggers on rare provider-side failures. (2) _get_available_number returning None → 10s defer-and-retry forever. The four reasons gate-1 returned None — missing template config, deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter full — all collapsed into a single None and got the same 10s retry. For permanent misconfigurations this was a hot loop with no alert and no termination cap, burning DB/Redis cycles indefinitely. The Exotel `channels < maximum_channels` check is redundant with the Redis channel-token semaphore (the reconciler keeps them in sync; semaphore is authoritative per dispatch/channel_semaphore.py). Drop it. With that removed, a None return means "structural failure" — mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1 alert (raise_no_outbound_number, throttled per reseller/template), stop retrying. Tests: - test_rate_limit_blocks_before_channel_acquire: pins peek runs, record does NOT, on peek-side block. - test_channel_exhaustion_does_not_record_rate_limit_attempt: regression guard for the 2026-05-21 alert storm. - test_atomic_record_rejection_releases_resources_and_defers: cross- lead race guard — atomic record rejects, channel token + DB number released, lead deferred 1h, make_call never runs. - test_full_round_trip_happy_path: pins exactly one peek and one record per successful dial. - test_get_available_number_returns_none_marks_lead_finished: fail-fast on permanent number unavailability. 98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ble hot loop Two related regressions surfaced by PR #772 (event-driven dispatch). Both share a root cause: the new dispatcher retries on transient failures at 1-3s cadence instead of the old cron's 30s, exposing latent bugs in the downstream gates. (1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts. The rate-limit Lua ZADDed every dispatch attempt that passed the peek-count check, BEFORE the channel-token gate. When channel tokens were exhausted, a single lead retried every 1-3s, ZADDing its own phone into the sliding window each time. After 7 retries (~10-15s), the 8th attempt for the SAME lead tripped its own limit, fired a Slack alert, and deferred 3600s — without ever dialing the customer. On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9 distinct Amir-and-Sons COD-confirmation leads. Verified in OpenObserve: zero make_call events, zero PROCESSING updates, zero actual dials reached any of those 9 phones. Each lead had exactly 1 create_lead_call_tracker (no duplicate ingestion) and exactly 8 defer-and-release cycles in ~2 minutes (7 short channel-wait + 1 hour-long rate-block). Fix: split call_limiter into peek (read-only ZCARD via new OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim + ZCARD + conditional ZADD in a single Lua). Peek runs before the channel-token gate (fast-fail without burning capacity); atomic record runs AFTER acquire_channel_token + _acquire_number and BEFORE provider.make_call. Atomic record is the authoritative cap. Placement constraints: - After channel-token acquire — otherwise channel-wait retries re-introduce the self-fill bug. - Before make_call — otherwise the call is already on the wire when the race is detected; strict cap is meaningless once the customer's phone rings. - Single Lua check-and-set — anything non-atomic leaves a TOCTOU window where N concurrent same-phone workers could all pass peek under-limit, all dial, then all record (case 5 in design notes; pre-PR strict cap silently weakened by the initial split). On race-rejection (another worker on the same phone filled the bucket between our peek and our record), the worker releases the channel token + DB number and defers the lead by the rate-limit window. Same Slack alert as the peek path so on-call sees one consistent title. Failure mode accepted: if make_call raises or returns no SID after atomic record runs, the bucket is inflated by 1 for a call that didn't reach the customer. Matches pre-PR semantics exactly and only triggers on rare provider-side failures. (2) _get_available_number returning None → 10s defer-and-retry forever. The four reasons gate-1 returned None — missing template config, deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter full — all collapsed into a single None and got the same 10s retry. For permanent misconfigurations this was a hot loop with no alert and no termination cap, burning DB/Redis cycles indefinitely. The Exotel `channels < maximum_channels` check is redundant with the Redis channel-token semaphore (the reconciler keeps them in sync; semaphore is authoritative per dispatch/channel_semaphore.py). Drop it. With that removed, a None return means "structural failure" — mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1 alert (raise_no_outbound_number, throttled per reseller/template), stop retrying. Tests: - test_rate_limit_blocks_before_channel_acquire: pins peek runs, record does NOT, on peek-side block. - test_channel_exhaustion_does_not_record_rate_limit_attempt: regression guard for the 2026-05-21 alert storm. - test_atomic_record_rejection_releases_resources_and_defers: cross- lead race guard — atomic record rejects, channel token + DB number released, lead deferred 1h, make_call never runs. - test_full_round_trip_happy_path: pins exactly one peek and one record per successful dial. - test_get_available_number_returns_none_marks_lead_finished: fail-fast on permanent number unavailability. 98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ble hot loop Two related regressions surfaced by PR #772 (event-driven dispatch). Both share a root cause: the new dispatcher retries on transient failures at 1-3s cadence instead of the old cron's 30s, exposing latent bugs in the downstream gates. (1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts. The rate-limit Lua ZADDed every dispatch attempt that passed the peek-count check, BEFORE the channel-token gate. When channel tokens were exhausted, a single lead retried every 1-3s, ZADDing its own phone into the sliding window each time. After 7 retries (~10-15s), the 8th attempt for the SAME lead tripped its own limit, fired a Slack alert, and deferred 3600s — without ever dialing the customer. On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9 distinct Amir-and-Sons COD-confirmation leads. Verified in OpenObserve: zero make_call events, zero PROCESSING updates, zero actual dials reached any of those 9 phones. Each lead had exactly 1 create_lead_call_tracker (no duplicate ingestion) and exactly 8 defer-and-release cycles in ~2 minutes (7 short channel-wait + 1 hour-long rate-block). Fix: split call_limiter into peek (read-only ZCARD via new OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim + ZCARD + conditional ZADD in a single Lua). Peek runs before the channel-token gate (fast-fail without burning capacity); atomic record runs AFTER acquire_channel_token + _acquire_number and BEFORE provider.make_call. Atomic record is the authoritative cap. Placement constraints: - After channel-token acquire — otherwise channel-wait retries re-introduce the self-fill bug. - Before make_call — otherwise the call is already on the wire when the race is detected; strict cap is meaningless once the customer's phone rings. - Single Lua check-and-set — anything non-atomic leaves a TOCTOU window where N concurrent same-phone workers could all pass peek under-limit, all dial, then all record (case 5 in design notes; pre-PR strict cap silently weakened by the initial split). On race-rejection (another worker on the same phone filled the bucket between our peek and our record), the worker releases the channel token + DB number and defers the lead by the rate-limit window. Same Slack alert as the peek path so on-call sees one consistent title. Failure mode accepted: if make_call raises or returns no SID after atomic record runs, the bucket is inflated by 1 for a call that didn't reach the customer. Matches pre-PR semantics exactly and only triggers on rare provider-side failures. (2) _get_available_number returning None → 10s defer-and-retry forever. The four reasons gate-1 returned None — missing template config, deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter full — all collapsed into a single None and got the same 10s retry. For permanent misconfigurations this was a hot loop with no alert and no termination cap, burning DB/Redis cycles indefinitely. The Exotel `channels < maximum_channels` check is redundant with the Redis channel-token semaphore (the reconciler keeps them in sync; semaphore is authoritative per dispatch/channel_semaphore.py). Drop it. With that removed, a None return means "structural failure" — mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1 alert (raise_no_outbound_number, throttled per reseller/template), stop retrying. Tests: - test_rate_limit_blocks_before_channel_acquire: pins peek runs, record does NOT, on peek-side block. - test_channel_exhaustion_does_not_record_rate_limit_attempt: regression guard for the 2026-05-21 alert storm. - test_atomic_record_rejection_releases_resources_and_defers: cross- lead race guard — atomic record rejects, channel token + DB number released, lead deferred 1h, make_call never runs. - test_full_round_trip_happy_path: pins exactly one peek and one record per successful dial. - test_get_available_number_returns_none_marks_lead_finished: fail-fast on permanent number unavailability. 98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ble hot loop Two related regressions surfaced by PR #772 (event-driven dispatch). Both share a root cause: the new dispatcher retries on transient failures at 1-3s cadence instead of the old cron's 30s, exposing latent bugs in the downstream gates. (1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts. The rate-limit Lua ZADDed every dispatch attempt that passed the peek-count check, BEFORE the channel-token gate. When channel tokens were exhausted, a single lead retried every 1-3s, ZADDing its own phone into the sliding window each time. After 7 retries (~10-15s), the 8th attempt for the SAME lead tripped its own limit, fired a Slack alert, and deferred 3600s — without ever dialing the customer. On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9 distinct Amir-and-Sons COD-confirmation leads. Verified in OpenObserve: zero make_call events, zero PROCESSING updates, zero actual dials reached any of those 9 phones. Each lead had exactly 1 create_lead_call_tracker (no duplicate ingestion) and exactly 8 defer-and-release cycles in ~2 minutes (7 short channel-wait + 1 hour-long rate-block). Fix: split call_limiter into peek (read-only ZCARD via new OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim + ZCARD + conditional ZADD in a single Lua). Peek runs before the channel-token gate (fast-fail without burning capacity); atomic record runs AFTER acquire_channel_token + _acquire_number and BEFORE provider.make_call. Atomic record is the authoritative cap. Placement constraints: - After channel-token acquire — otherwise channel-wait retries re-introduce the self-fill bug. - Before make_call — otherwise the call is already on the wire when the race is detected; strict cap is meaningless once the customer's phone rings. - Single Lua check-and-set — anything non-atomic leaves a TOCTOU window where N concurrent same-phone workers could all pass peek under-limit, all dial, then all record (case 5 in design notes; pre-PR strict cap silently weakened by the initial split). On race-rejection (another worker on the same phone filled the bucket between our peek and our record), the worker releases the channel token + DB number and defers the lead by the rate-limit window. Same Slack alert as the peek path so on-call sees one consistent title. Failure mode accepted: if make_call raises or returns no SID after atomic record runs, the bucket is inflated by 1 for a call that didn't reach the customer. Matches pre-PR semantics exactly and only triggers on rare provider-side failures. (2) _get_available_number returning None → 10s defer-and-retry forever. The four reasons gate-1 returned None — missing template config, deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter full — all collapsed into a single None and got the same 10s retry. For permanent misconfigurations this was a hot loop with no alert and no termination cap, burning DB/Redis cycles indefinitely. The Exotel `channels < maximum_channels` check is redundant with the Redis channel-token semaphore (the reconciler keeps them in sync; semaphore is authoritative per dispatch/channel_semaphore.py). Drop it. With that removed, a None return means "structural failure" — mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1 alert (raise_no_outbound_number, throttled per reseller/template), stop retrying. Tests: - test_rate_limit_blocks_before_channel_acquire: pins peek runs, record does NOT, on peek-side block. - test_channel_exhaustion_does_not_record_rate_limit_attempt: regression guard for the 2026-05-21 alert storm. - test_atomic_record_rejection_releases_resources_and_defers: cross- lead race guard — atomic record rejects, channel token + DB number released, lead deferred 1h, make_call never runs. - test_full_round_trip_happy_path: pins exactly one peek and one record per successful dial. - test_get_available_number_returns_none_marks_lead_finished: fail-fast on permanent number unavailability. 98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ble hot loop Two related regressions surfaced by PR #772 (event-driven dispatch). Both share a root cause: the new dispatcher retries on transient failures at 1-3s cadence instead of the old cron's 30s, exposing latent bugs in the downstream gates. (1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts. The rate-limit Lua ZADDed every dispatch attempt that passed the peek-count check, BEFORE the channel-token gate. When channel tokens were exhausted, a single lead retried every 1-3s, ZADDing its own phone into the sliding window each time. After 7 retries (~10-15s), the 8th attempt for the SAME lead tripped its own limit, fired a Slack alert, and deferred 3600s — without ever dialing the customer. On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9 distinct Amir-and-Sons COD-confirmation leads. Verified in OpenObserve: zero make_call events, zero PROCESSING updates, zero actual dials reached any of those 9 phones. Each lead had exactly 1 create_lead_call_tracker (no duplicate ingestion) and exactly 8 defer-and-release cycles in ~2 minutes (7 short channel-wait + 1 hour-long rate-block). Fix: split call_limiter into peek (read-only ZCARD via new OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim + ZCARD + conditional ZADD in a single Lua). Peek runs before the channel-token gate (fast-fail without burning capacity); atomic record runs AFTER acquire_channel_token + _acquire_number and BEFORE provider.make_call. Atomic record is the authoritative cap. Placement constraints: - After channel-token acquire — otherwise channel-wait retries re-introduce the self-fill bug. - Before make_call — otherwise the call is already on the wire when the race is detected; strict cap is meaningless once the customer's phone rings. - Single Lua check-and-set — anything non-atomic leaves a TOCTOU window where N concurrent same-phone workers could all pass peek under-limit, all dial, then all record (case 5 in design notes; pre-PR strict cap silently weakened by the initial split). On race-rejection (another worker on the same phone filled the bucket between our peek and our record), the worker releases the channel token + DB number and defers the lead by the rate-limit window. Same Slack alert as the peek path so on-call sees one consistent title. Failure mode accepted: if make_call raises or returns no SID after atomic record runs, the bucket is inflated by 1 for a call that didn't reach the customer. Matches pre-PR semantics exactly and only triggers on rare provider-side failures. (2) _get_available_number returning None → 10s defer-and-retry forever. The four reasons gate-1 returned None — missing template config, deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter full — all collapsed into a single None and got the same 10s retry. For permanent misconfigurations this was a hot loop with no alert and no termination cap, burning DB/Redis cycles indefinitely. The Exotel `channels < maximum_channels` check is redundant with the Redis channel-token semaphore (the reconciler keeps them in sync; semaphore is authoritative per dispatch/channel_semaphore.py). Drop it. With that removed, a None return means "structural failure" — mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1 alert (raise_no_outbound_number, throttled per reseller/template), stop retrying. Tests: - test_rate_limit_blocks_before_channel_acquire: pins peek runs, record does NOT, on peek-side block. - test_channel_exhaustion_does_not_record_rate_limit_attempt: regression guard for the 2026-05-21 alert storm. - test_atomic_record_rejection_releases_resources_and_defers: cross- lead race guard — atomic record rejects, channel token + DB number released, lead deferred 1h, make_call never runs. - test_full_round_trip_happy_path: pins exactly one peek and one record per successful dial. - test_get_available_number_returns_none_marks_lead_finished: fail-fast on permanent number unavailability. 98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
66af8eb to
19c233c
Compare
- Added support to merge two different templates to the flow manager during call - This helps us in orchestrating among multiple agents while main agent being as an intent detector
19c233c to
ac05679
Compare
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Other