Skip to content

feat: Add support for template orchestrator#772

Open
manas-narra wants to merge 1 commit into
juspay:releasefrom
manas-narra:add-support-for-template-orchestrator
Open

feat: Add support for template orchestrator#772
manas-narra wants to merge 1 commit into
juspay:releasefrom
manas-narra:add-support-for-template-orchestrator

Conversation

@manas-narra
Copy link
Copy Markdown
Collaborator

@manas-narra manas-narra commented May 20, 2026

  • Added support to merge two different templates to the flow manager during call
  • This helps us in orchestrating among multiple agents while main agent being as an intent detector

Summary by CodeRabbit

Release Notes

  • New Features

    • Added orchestrator mode enabling dynamic routing between multiple agent templates during active conversations.
    • Introduced agent routing with LLM-guided template selection using agent metadata and descriptions.
  • Bug Fixes

    • Fixed transcription persistence to append new entries instead of overwriting existing data.
  • Other

    • Enhanced call tracking with template switching records and improved context management for template transitions.

Review Change Stack

Copilot AI review requested due to automatic review settings May 20, 2026 05:51
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c2d33325-e121-4adc-bd10-235448d0bfc5

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR adds orchestrator-mode support to the Breeze Buddy voice agent, enabling mid-call routing to pre-configured agent templates. The implementation preloads TTS services for all agents, provides a new route_to_agent handler that snapshots context and switches templates, tracks active templates throughout the call, and persists routing decisions to the database.

Changes

Orchestrator-Mode Call Routing

Layer / File(s) Summary
Configuration contracts for orchestrator templates
app/ai/voice/agents/breeze_buddy/template/types.py
New AgentRouteConfig and OrchestratorConfig Pydantic models define orchestrator routing targets; ConfigurationModel extended with is_orchestrator flag and optional orchestrator_config field.
Database query and accessor layer
app/database/queries/breeze_buddy/template.py, app/database/queries/breeze_buddy/lead_call_tracker.py, app/database/accessor/breeze_buddy/template.py, app/database/accessor/breeze_buddy/lead_call_tracker.py
Added query and accessor functions to fetch active orchestrator templates per outbound number, validate single orchestrator uniqueness during creation/replacement, and update lead template records mid-call without blocking on config lookup.
Orchestrator TTS preloading
app/ai/voice/agents/breeze_buddy/agent/pipeline.py, app/ai/voice/agents/breeze_buddy/agent/__init__.py
New build_orchestrator_tts_switcher() function preloads TTS services for orchestrator and all routed agent templates by rendering template variables and resolving voice configs; integrated into Agent.run (non-stream mode) to build switcher, collect maps, and fall back gracefully on errors.
Flow manager setup and template tracking
app/ai/voice/agents/breeze_buddy/agent/flow.py, app/ai/voice/agents/breeze_buddy/template/context.py
setup_flow_manager accepts and stores tts_map and agent_templates on bot instance; TemplateContext gains switch_template() method to record routing decisions and record_node_entry() tags each node traversal with currently active template_id.
Route handler core implementation
app/ai/voice/agents/breeze_buddy/handlers/internal/route_to_agent.py
New async route_to_agent handler validates template_id, snapshots orchestrator LLM context to transcription, resolves pre-rendered agent template, builds agent flow config with context reset, switches TTS, persists routing, records node transitions, and rebuilds global functions to prevent tool hallucinations.
Handler registration and dispatch logic
app/ai/voice/agents/breeze_buddy/handlers/internal/__init__.py, app/ai/voice/agents/breeze_buddy/handlers/internal/builtin_dispatcher.py
route_to_agent exported from internal handlers and registered in BUILTIN_HANDLERS; dispatcher updated to record global function calls only when handler does not return next_node, avoiding duplicate recording for transition handlers.
Flow builder integration
app/ai/voice/agents/breeze_buddy/template/builder.py
route_to_agent registered in FlowConfigBuilder.handler_map; during direct-mode flow building, available agent templates (template_id, name, description) appended to system prompt when orchestrator mode is active.
Call resolution and transcription handling
app/api/routers/breeze_buddy/telephony/answer/handlers.py, app/ai/voice/agents/breeze_buddy/handlers/internal/end_conversation.py, app/ai/voice/agents/breeze_buddy/managers/calls.py
Inbound call resolution detects orchestrator templates and bypasses IVR; transcription now appended to existing history instead of replaced; call completion persists lead records unconditionally and makes config lookup optional for retry scheduling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A rabbit builds a crossroads fair,
Where calls may route with utmost care—
Voices pre-loaded, templates aligned,
Flow switches seamless through the mind.
Mid-conversation, paths converge,
As orchestrator and agents merge! 🎭

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main feature being added—orchestrator support for templates—which is directly reflected in the substantial changes across the codebase.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@manas-narra manas-narra force-pushed the add-support-for-template-orchestrator branch from ad68fd5 to 33ed5fb Compare May 20, 2026 05:54
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/ai/voice/agents/breeze_buddy/agent/__init__.py`:
- Around line 923-927: The current except block in the orchestrator bootstrap
(the try that builds the TTS switcher) only logs and falls back to plain TTS,
which can leave routing enabled but without required state (tts_map /
agent_templates); change that handler in __init__.py so it fails fast: after
logging the error (logger.error(...)), re-raise the exception (or raise a new
RuntimeError with context) instead of continuing, ensuring the orchestrator
bootstrap in the TTS switcher build path aborts rather than degrading to plain
TTS.

In `@app/ai/voice/agents/breeze_buddy/agent/pipeline.py`:
- Around line 249-256: The new block calling FlowConfigLoader().load_template
(the variables rendered_template and agent_vars) is not Black-formatted; run the
Black formatter with line-length=88 over the file (or at least this
function/pipeline block) to fix spacing/line breaks and ensure conformity with
the project's Black rules, then re-run CI; focus on formatting around the
FlowConfigLoader instantiation and the await flow_loader.load_template call to
resolve the CI failure.
- Around line 233-240: The loop that preloads TTS silently continues when
get_template_by_id(agent_route.template_id) returns None
(orch_config.agent_templates / agent_route.template_id in the for-loop), leaving
a advertised route unusable; instead, fail fast during preload by replacing the
continue branch: log the error with full context and raise an explicit exception
(e.g., RuntimeError/ValueError) that includes agent_route.template_id and
agent_route.name so startup/preload fails and the misconfiguration is surfaced
immediately.

In `@app/ai/voice/agents/breeze_buddy/handlers/internal/route_to_agent.py`:
- Around line 34-51: The route_to_agent function has a wrong return annotation
and docstring: it currently declares Dict[str, Any] but on success returns a
tuple (dict, dict) while error paths return a single dict; update the function
signature and docstring to reflect the union return type (e.g., Tuple[Dict[str,
Any], Dict[str, Any]] | Dict[str, Any] or Optional/Union equivalent used in the
codebase) so static checkers are satisfied, and adjust the docstring description
of the return value to explain both the success tuple (routing result and
edge-function payload) and the error dict; modify the annotation on
route_to_agent and the Returns section in its docstring accordingly.

In `@app/ai/voice/agents/breeze_buddy/template/types.py`:
- Around line 941-953: The schema currently documents but does not enforce that
when is_orchestrator is True an orchestrator_config must be provided and that
that config contains at least one agent template; add validation to the Pydantic
model holding is_orchestrator and orchestrator_config (e.g., a `@root_validator`
or `@validator` on orchestrator_config) to: 1) raise a ValidationError if
is_orchestrator is True and orchestrator_config is None, 2) validate
orchestrator_config.agent_templates is not empty, and 3) optionally ensure that
if is_orchestrator is False then orchestrator_config is None or empty; reference
the is_orchestrator and orchestrator_config fields and the OrchestratorConfig
type when adding the validator so invalid templates fail at validation time
rather than runtime.

In `@app/database/accessor/breeze_buddy/lead_call_tracker.py`:
- Around line 728-729: The two-line call to update_lead_template_query and
run_parameterized_query violates Black formatting (line length/spacing);
reformat the file using Black with line-length=88 to fix it — run `black .
--line-length 88` (or your repo’s Black command) on the file containing
update_lead_template_query and run_parameterized_query so the call is
wrapped/indented to satisfy Black and CI.

In `@app/database/accessor/breeze_buddy/template.py`:
- Around line 112-118: The uniqueness check currently calls
validate_single_orchestrator_per_number whenever
configurations.get("is_orchestrator") is true, blocking saving of inactive
drafts; change the guard to only run this validation when the orchestrator is
active by also checking configurations.get("is_active") (e.g., if configurations
and configurations.get("is_orchestrator") and configurations.get("is_active")
and outbound_number_id: await
validate_single_orchestrator_per_number(outbound_number_id)). Apply the same
change to the other similar block that calls
validate_single_orchestrator_per_number (the one around the second occurrence
handling updates) so only active orchestrators trigger the uniqueness
validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 76dc8643-f396-4b8d-bd7d-3b0156c11571

📥 Commits

Reviewing files that changed from the base of the PR and between 72116d9 and ad68fd5.

📒 Files selected for processing (16)
  • app/ai/voice/agents/breeze_buddy/agent/__init__.py
  • app/ai/voice/agents/breeze_buddy/agent/flow.py
  • app/ai/voice/agents/breeze_buddy/agent/pipeline.py
  • app/ai/voice/agents/breeze_buddy/handlers/internal/__init__.py
  • app/ai/voice/agents/breeze_buddy/handlers/internal/builtin_dispatcher.py
  • app/ai/voice/agents/breeze_buddy/handlers/internal/end_conversation.py
  • app/ai/voice/agents/breeze_buddy/handlers/internal/route_to_agent.py
  • app/ai/voice/agents/breeze_buddy/managers/calls.py
  • app/ai/voice/agents/breeze_buddy/template/builder.py
  • app/ai/voice/agents/breeze_buddy/template/context.py
  • app/ai/voice/agents/breeze_buddy/template/types.py
  • app/api/routers/breeze_buddy/telephony/answer/handlers.py
  • app/database/accessor/breeze_buddy/lead_call_tracker.py
  • app/database/accessor/breeze_buddy/template.py
  • app/database/queries/breeze_buddy/lead_call_tracker.py
  • app/database/queries/breeze_buddy/template.py

Comment thread app/ai/voice/agents/breeze_buddy/agent/__init__.py
Comment thread app/ai/voice/agents/breeze_buddy/agent/pipeline.py Outdated
Comment on lines +249 to +256
flow_loader = FlowConfigLoader()
rendered_template, agent_vars = await flow_loader.load_template(
reseller_id=raw_template.reseller_id,
template=raw_template.name,
merchant_id=raw_template.merchant_id,
call_payload=call_payload or {},
template_id=raw_template.id,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Black formatting check is failing in this new block.

The CI pipeline already reports this file as not Black-compliant; please run formatter on this section before merge.

As per coding guidelines, "Use Black for code formatting with line-length=88".

🧰 Tools
🪛 GitHub Actions: PR Build Check / 0_build.txt

[error] 252-252: Black formatting check failed (uv run black --check --diff .). This file would be reformatted.

🪛 GitHub Actions: PR Build Check / build

[error] 252-252: Black formatting check failed (uv run black --check --diff .). File would be reformatted: /home/runner/work/clairvoyance/clairvoyance/app/ai/voice/agents/breeze_buddy/agent/pipeline.py

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/ai/voice/agents/breeze_buddy/agent/pipeline.py` around lines 249 - 256,
The new block calling FlowConfigLoader().load_template (the variables
rendered_template and agent_vars) is not Black-formatted; run the Black
formatter with line-length=88 over the file (or at least this function/pipeline
block) to fix spacing/line breaks and ensure conformity with the project's Black
rules, then re-run CI; focus on formatting around the FlowConfigLoader
instantiation and the await flow_loader.load_template call to resolve the CI
failure.

Comment thread app/ai/voice/agents/breeze_buddy/template/types.py
Comment thread app/database/accessor/breeze_buddy/lead_call_tracker.py Outdated
Comment thread app/database/accessor/breeze_buddy/template.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds “template orchestrator” support to Breeze Buddy voice calls: an orchestrator template can act as the inbound entrypoint, perform intent detection, and then route mid-call to a chosen agent template while preserving call metadata/transcription and optionally hot-swapping TTS voices.

Changes:

  • Added an is_orchestrator template configuration with an orchestrator_config routing table and a route_to_agent builtin handler for mid-call template switching.
  • Implemented preloading of agent templates + agent TTS services at pipeline startup (ServiceSwitcher + tts_map) to enable low-latency voice switching on route.
  • Improved call completion persistence (don’t gate FINISHED write on config lookup) and enhanced transcription/node traversal metadata to reflect cross-template routing.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
app/database/queries/breeze_buddy/template.py Adds SQL query to fetch the active orchestrator template for an outbound number.
app/database/queries/breeze_buddy/lead_call_tracker.py Adds SQL update to change a lead’s template/template_id mid-call.
app/database/accessor/breeze_buddy/template.py Enforces “single orchestrator per outbound number” and provides orchestrator lookup helpers.
app/database/accessor/breeze_buddy/lead_call_tracker.py Adds accessor to persist the routed-to agent template on the lead record.
app/api/routers/breeze_buddy/telephony/answer/handlers.py Bypasses IVR selection when an orchestrator template exists for the called number.
app/ai/voice/agents/breeze_buddy/template/types.py Introduces orchestrator configuration schema (agent routing list).
app/ai/voice/agents/breeze_buddy/template/context.py Adds template switching + tags node traversal entries with active template_id.
app/ai/voice/agents/breeze_buddy/template/builder.py Injects an orchestrator routing table into direct-mode system prompts.
app/ai/voice/agents/breeze_buddy/managers/calls.py Ensures call completion details persist even if config lookup fails.
app/ai/voice/agents/breeze_buddy/handlers/internal/route_to_agent.py New handler implementing mid-call routing, metadata snapshotting, TTS swap, and edge-function transition.
app/ai/voice/agents/breeze_buddy/handlers/internal/end_conversation.py Appends transcription to preserve pre-routing history.
app/ai/voice/agents/breeze_buddy/handlers/internal/builtin_dispatcher.py Registers route_to_agent and adjusts traversal recording for edge-function transitions.
app/ai/voice/agents/breeze_buddy/handlers/internal/init.py Exports route_to_agent handler.
app/ai/voice/agents/breeze_buddy/agent/pipeline.py Adds orchestrator TTS/template preloading helper (ServiceSwitcher + rendered templates map).
app/ai/voice/agents/breeze_buddy/agent/flow.py Stores tts_map + pre-rendered agent templates on the bot for handler access.
app/ai/voice/agents/breeze_buddy/agent/init.py Wires orchestrator TTS switcher into agent startup and passes maps into flow manager setup.
Comments suppressed due to low confidence (1)

app/database/accessor/breeze_buddy/template.py:396

  • Same as create_template: a ValueError from validate_single_orchestrator_per_number will be swallowed by the broad exception handler and converted into None, which the REST handler maps to a 500. Consider re-raising ValueError so callers can treat it as a 400 validation failure.
        # Validate orchestrator uniqueness before writing
        if (
            configurations
            and configurations.get("is_orchestrator")
            and outbound_number_id

Comment on lines +73 to +76
# We store the orchestrator phase as a flat list under
# metaData["orchestrator_transcription"]
# and also initialise metaData["transcription"] with those messages so
# end_conversation can safely append the agent phase on top of it.
Comment on lines +663 to +667
existing = await get_orchestrator_template_by_outbound_number_id(outbound_number_id)
if existing and (
exclude_template_id is None or str(existing.id) != exclude_template_id
):
raise ValueError(
Comment on lines +112 to +116
# Validate orchestrator uniqueness before writing
if (
configurations
and configurations.get("is_orchestrator")
and outbound_number_id
Comment on lines +946 to +950
"agent templates listed in orchestrator_config. Only one template per outbound number "
"may have is_orchestrator=True (enforced at save time by the accessor).",
)
orchestrator_config: Optional[OrchestratorConfig] = Field(
None,
Comment on lines +233 to +235
for agent_route in orch_config.agent_templates:
raw_template = await get_template_by_id(agent_route.template_id)
if not raw_template:
Comment on lines +203 to +206
including the orchestrator's own TTS under key ``"__orchestrator__"``.
- ``agent_templates_map``: dict mapping agent template_id → rendered
``TemplateModel`` (ready for ``FlowConfigBuilder``).

Comment on lines 106 to 111
context.lead.metaData["transcription"] = (
context.lead.metaData.get("transcription") or []
) + transcription
logger.info(
f"Collected {len(transcription)} total messages "
f"({len(filtered_transcript)} user/assistant) for call {context.call_sid}"
Comment on lines +34 to +38
async def route_to_agent(
context: TemplateContext,
args: Dict[str, Any],
transition_to: Optional[str] = None,
) -> Dict[str, Any]:
Comment on lines +285 to +297
flow_manager._global_functions = agent_global_functions
logger.debug(
f"[Orchestrator] Rebuilt {len(agent_global_functions)} global "
f"function(s) from agent template '{agent_template.name}' "
f"(call {context.call_sid})"
)
except Exception as exc:
# Non-fatal: clear globals to at least prevent route_to_agent
# hallucination even if we can't register agent globals.
flow_manager._global_functions = []
logger.warning(
f"[Orchestrator] Failed to rebuild agent global functions, "
f"cleared instead (call {context.call_sid}): {exc}"
Comment on lines +175 to +178
# Skip recording if the handler returned a next_node — that means it was
# a routing/transition action (e.g. route_to_agent) which already manages
# its own node traversal bookkeeping internally.
if next_node is None:
@manas-narra manas-narra force-pushed the add-support-for-template-orchestrator branch 2 times, most recently from 20a2097 to 0db3701 Compare May 20, 2026 06:36
@manas-narra
Copy link
Copy Markdown
Collaborator Author

@coderabbitai fixed all the comments. Review the latest commit now.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

@manas-narra Sure, I'll review the latest commit now!

(ᵔ◡ᵔ)🐇

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@manas-narra manas-narra force-pushed the add-support-for-template-orchestrator branch from 0db3701 to ce67d07 Compare May 20, 2026 13:18
swaroopvarma1 pushed a commit that referenced this pull request May 21, 2026
…ble hot loop

Two related regressions surfaced by PR #772 (event-driven dispatch). Both
share a root cause: the new dispatcher retries on transient failures at
1-3s cadence instead of the old cron's 30s, exposing latent bugs in the
downstream gates.

(1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts.

   The rate-limit Lua ZADDed every dispatch attempt that passed the
   peek-count check, BEFORE the channel-token gate. When channel tokens
   were exhausted, a single lead retried every 1-3s, ZADDing its own
   phone into the sliding window each time. After 7 retries (~10-15s),
   the 8th attempt for the SAME lead tripped its own limit, fired a
   Slack alert, and deferred 3600s — without ever dialing the customer.

   On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9
   distinct Amir-and-Sons COD-confirmation leads. Verified in
   OpenObserve: zero make_call events, zero PROCESSING updates, zero
   actual dials reached any of those 9 phones. Each lead had exactly 1
   create_lead_call_tracker (no duplicate ingestion) and exactly 8
   defer-and-release cycles in ~2 minutes (7 short channel-wait + 1
   hour-long rate-block).

   Fix: split call_limiter into peek (read-only ZCARD via new
   OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an
   atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim +
   ZCARD + conditional ZADD in a single Lua). Peek runs before the
   channel-token gate (fast-fail without burning capacity); atomic
   record runs AFTER acquire_channel_token + _acquire_number and
   BEFORE provider.make_call.

   Atomic record is the authoritative cap. Placement constraints:
     - After channel-token acquire — otherwise channel-wait retries
       re-introduce the self-fill bug.
     - Before make_call — otherwise the call is already on the wire
       when the race is detected; strict cap is meaningless once the
       customer's phone rings.
     - Single Lua check-and-set — anything non-atomic leaves a TOCTOU
       window where N concurrent same-phone workers could all pass
       peek under-limit, all dial, then all record (case 5 in design
       notes; pre-PR strict cap silently weakened by the initial split).

   On race-rejection (another worker on the same phone filled the
   bucket between our peek and our record), the worker releases the
   channel token + DB number and defers the lead by the rate-limit
   window. Same Slack alert as the peek path so on-call sees one
   consistent title.

   Failure mode accepted: if make_call raises or returns no SID after
   atomic record runs, the bucket is inflated by 1 for a call that
   didn't reach the customer. Matches pre-PR semantics exactly and only
   triggers on rare provider-side failures.

(2) _get_available_number returning None → 10s defer-and-retry forever.

   The four reasons gate-1 returned None — missing template config,
   deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter
   full — all collapsed into a single None and got the same 10s retry.
   For permanent misconfigurations this was a hot loop with no alert
   and no termination cap, burning DB/Redis cycles indefinitely.

   The Exotel `channels < maximum_channels` check is redundant with the
   Redis channel-token semaphore (the reconciler keeps them in sync;
   semaphore is authoritative per dispatch/channel_semaphore.py). Drop
   it. With that removed, a None return means "structural failure" —
   mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1
   alert (raise_no_outbound_number, throttled per reseller/template),
   stop retrying.

Tests:
- test_rate_limit_blocks_before_channel_acquire: pins peek runs, record
  does NOT, on peek-side block.
- test_channel_exhaustion_does_not_record_rate_limit_attempt:
  regression guard for the 2026-05-21 alert storm.
- test_atomic_record_rejection_releases_resources_and_defers: cross-
  lead race guard — atomic record rejects, channel token + DB number
  released, lead deferred 1h, make_call never runs.
- test_full_round_trip_happy_path: pins exactly one peek and one
  record per successful dial.
- test_get_available_number_returns_none_marks_lead_finished:
  fail-fast on permanent number unavailability.

98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
swaroopvarma1 pushed a commit that referenced this pull request May 21, 2026
…ble hot loop

Two related regressions surfaced by PR #772 (event-driven dispatch). Both
share a root cause: the new dispatcher retries on transient failures at
1-3s cadence instead of the old cron's 30s, exposing latent bugs in the
downstream gates.

(1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts.

   The rate-limit Lua ZADDed every dispatch attempt that passed the
   peek-count check, BEFORE the channel-token gate. When channel tokens
   were exhausted, a single lead retried every 1-3s, ZADDing its own
   phone into the sliding window each time. After 7 retries (~10-15s),
   the 8th attempt for the SAME lead tripped its own limit, fired a
   Slack alert, and deferred 3600s — without ever dialing the customer.

   On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9
   distinct Amir-and-Sons COD-confirmation leads. Verified in
   OpenObserve: zero make_call events, zero PROCESSING updates, zero
   actual dials reached any of those 9 phones. Each lead had exactly 1
   create_lead_call_tracker (no duplicate ingestion) and exactly 8
   defer-and-release cycles in ~2 minutes (7 short channel-wait + 1
   hour-long rate-block).

   Fix: split call_limiter into peek (read-only ZCARD via new
   OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an
   atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim +
   ZCARD + conditional ZADD in a single Lua). Peek runs before the
   channel-token gate (fast-fail without burning capacity); atomic
   record runs AFTER acquire_channel_token + _acquire_number and
   BEFORE provider.make_call.

   Atomic record is the authoritative cap. Placement constraints:
     - After channel-token acquire — otherwise channel-wait retries
       re-introduce the self-fill bug.
     - Before make_call — otherwise the call is already on the wire
       when the race is detected; strict cap is meaningless once the
       customer's phone rings.
     - Single Lua check-and-set — anything non-atomic leaves a TOCTOU
       window where N concurrent same-phone workers could all pass
       peek under-limit, all dial, then all record (case 5 in design
       notes; pre-PR strict cap silently weakened by the initial split).

   On race-rejection (another worker on the same phone filled the
   bucket between our peek and our record), the worker releases the
   channel token + DB number and defers the lead by the rate-limit
   window. Same Slack alert as the peek path so on-call sees one
   consistent title.

   Failure mode accepted: if make_call raises or returns no SID after
   atomic record runs, the bucket is inflated by 1 for a call that
   didn't reach the customer. Matches pre-PR semantics exactly and only
   triggers on rare provider-side failures.

(2) _get_available_number returning None → 10s defer-and-retry forever.

   The four reasons gate-1 returned None — missing template config,
   deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter
   full — all collapsed into a single None and got the same 10s retry.
   For permanent misconfigurations this was a hot loop with no alert
   and no termination cap, burning DB/Redis cycles indefinitely.

   The Exotel `channels < maximum_channels` check is redundant with the
   Redis channel-token semaphore (the reconciler keeps them in sync;
   semaphore is authoritative per dispatch/channel_semaphore.py). Drop
   it. With that removed, a None return means "structural failure" —
   mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1
   alert (raise_no_outbound_number, throttled per reseller/template),
   stop retrying.

Tests:
- test_rate_limit_blocks_before_channel_acquire: pins peek runs, record
  does NOT, on peek-side block.
- test_channel_exhaustion_does_not_record_rate_limit_attempt:
  regression guard for the 2026-05-21 alert storm.
- test_atomic_record_rejection_releases_resources_and_defers: cross-
  lead race guard — atomic record rejects, channel token + DB number
  released, lead deferred 1h, make_call never runs.
- test_full_round_trip_happy_path: pins exactly one peek and one
  record per successful dial.
- test_get_available_number_returns_none_marks_lead_finished:
  fail-fast on permanent number unavailability.

98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
swaroopvarma1 pushed a commit that referenced this pull request May 21, 2026
…ble hot loop

Two related regressions surfaced by PR #772 (event-driven dispatch). Both
share a root cause: the new dispatcher retries on transient failures at
1-3s cadence instead of the old cron's 30s, exposing latent bugs in the
downstream gates.

(1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts.

   The rate-limit Lua ZADDed every dispatch attempt that passed the
   peek-count check, BEFORE the channel-token gate. When channel tokens
   were exhausted, a single lead retried every 1-3s, ZADDing its own
   phone into the sliding window each time. After 7 retries (~10-15s),
   the 8th attempt for the SAME lead tripped its own limit, fired a
   Slack alert, and deferred 3600s — without ever dialing the customer.

   On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9
   distinct Amir-and-Sons COD-confirmation leads. Verified in
   OpenObserve: zero make_call events, zero PROCESSING updates, zero
   actual dials reached any of those 9 phones. Each lead had exactly 1
   create_lead_call_tracker (no duplicate ingestion) and exactly 8
   defer-and-release cycles in ~2 minutes (7 short channel-wait + 1
   hour-long rate-block).

   Fix: split call_limiter into peek (read-only ZCARD via new
   OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an
   atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim +
   ZCARD + conditional ZADD in a single Lua). Peek runs before the
   channel-token gate (fast-fail without burning capacity); atomic
   record runs AFTER acquire_channel_token + _acquire_number and
   BEFORE provider.make_call.

   Atomic record is the authoritative cap. Placement constraints:
     - After channel-token acquire — otherwise channel-wait retries
       re-introduce the self-fill bug.
     - Before make_call — otherwise the call is already on the wire
       when the race is detected; strict cap is meaningless once the
       customer's phone rings.
     - Single Lua check-and-set — anything non-atomic leaves a TOCTOU
       window where N concurrent same-phone workers could all pass
       peek under-limit, all dial, then all record (case 5 in design
       notes; pre-PR strict cap silently weakened by the initial split).

   On race-rejection (another worker on the same phone filled the
   bucket between our peek and our record), the worker releases the
   channel token + DB number and defers the lead by the rate-limit
   window. Same Slack alert as the peek path so on-call sees one
   consistent title.

   Failure mode accepted: if make_call raises or returns no SID after
   atomic record runs, the bucket is inflated by 1 for a call that
   didn't reach the customer. Matches pre-PR semantics exactly and only
   triggers on rare provider-side failures.

(2) _get_available_number returning None → 10s defer-and-retry forever.

   The four reasons gate-1 returned None — missing template config,
   deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter
   full — all collapsed into a single None and got the same 10s retry.
   For permanent misconfigurations this was a hot loop with no alert
   and no termination cap, burning DB/Redis cycles indefinitely.

   The Exotel `channels < maximum_channels` check is redundant with the
   Redis channel-token semaphore (the reconciler keeps them in sync;
   semaphore is authoritative per dispatch/channel_semaphore.py). Drop
   it. With that removed, a None return means "structural failure" —
   mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1
   alert (raise_no_outbound_number, throttled per reseller/template),
   stop retrying.

Tests:
- test_rate_limit_blocks_before_channel_acquire: pins peek runs, record
  does NOT, on peek-side block.
- test_channel_exhaustion_does_not_record_rate_limit_attempt:
  regression guard for the 2026-05-21 alert storm.
- test_atomic_record_rejection_releases_resources_and_defers: cross-
  lead race guard — atomic record rejects, channel token + DB number
  released, lead deferred 1h, make_call never runs.
- test_full_round_trip_happy_path: pins exactly one peek and one
  record per successful dial.
- test_get_available_number_returns_none_marks_lead_finished:
  fail-fast on permanent number unavailability.

98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
swaroopvarma1 pushed a commit that referenced this pull request May 21, 2026
…ble hot loop

Two related regressions surfaced by PR #772 (event-driven dispatch). Both
share a root cause: the new dispatcher retries on transient failures at
1-3s cadence instead of the old cron's 30s, exposing latent bugs in the
downstream gates.

(1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts.

   The rate-limit Lua ZADDed every dispatch attempt that passed the
   peek-count check, BEFORE the channel-token gate. When channel tokens
   were exhausted, a single lead retried every 1-3s, ZADDing its own
   phone into the sliding window each time. After 7 retries (~10-15s),
   the 8th attempt for the SAME lead tripped its own limit, fired a
   Slack alert, and deferred 3600s — without ever dialing the customer.

   On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9
   distinct Amir-and-Sons COD-confirmation leads. Verified in
   OpenObserve: zero make_call events, zero PROCESSING updates, zero
   actual dials reached any of those 9 phones. Each lead had exactly 1
   create_lead_call_tracker (no duplicate ingestion) and exactly 8
   defer-and-release cycles in ~2 minutes (7 short channel-wait + 1
   hour-long rate-block).

   Fix: split call_limiter into peek (read-only ZCARD via new
   OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an
   atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim +
   ZCARD + conditional ZADD in a single Lua). Peek runs before the
   channel-token gate (fast-fail without burning capacity); atomic
   record runs AFTER acquire_channel_token + _acquire_number and
   BEFORE provider.make_call.

   Atomic record is the authoritative cap. Placement constraints:
     - After channel-token acquire — otherwise channel-wait retries
       re-introduce the self-fill bug.
     - Before make_call — otherwise the call is already on the wire
       when the race is detected; strict cap is meaningless once the
       customer's phone rings.
     - Single Lua check-and-set — anything non-atomic leaves a TOCTOU
       window where N concurrent same-phone workers could all pass
       peek under-limit, all dial, then all record (case 5 in design
       notes; pre-PR strict cap silently weakened by the initial split).

   On race-rejection (another worker on the same phone filled the
   bucket between our peek and our record), the worker releases the
   channel token + DB number and defers the lead by the rate-limit
   window. Same Slack alert as the peek path so on-call sees one
   consistent title.

   Failure mode accepted: if make_call raises or returns no SID after
   atomic record runs, the bucket is inflated by 1 for a call that
   didn't reach the customer. Matches pre-PR semantics exactly and only
   triggers on rare provider-side failures.

(2) _get_available_number returning None → 10s defer-and-retry forever.

   The four reasons gate-1 returned None — missing template config,
   deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter
   full — all collapsed into a single None and got the same 10s retry.
   For permanent misconfigurations this was a hot loop with no alert
   and no termination cap, burning DB/Redis cycles indefinitely.

   The Exotel `channels < maximum_channels` check is redundant with the
   Redis channel-token semaphore (the reconciler keeps them in sync;
   semaphore is authoritative per dispatch/channel_semaphore.py). Drop
   it. With that removed, a None return means "structural failure" —
   mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1
   alert (raise_no_outbound_number, throttled per reseller/template),
   stop retrying.

Tests:
- test_rate_limit_blocks_before_channel_acquire: pins peek runs, record
  does NOT, on peek-side block.
- test_channel_exhaustion_does_not_record_rate_limit_attempt:
  regression guard for the 2026-05-21 alert storm.
- test_atomic_record_rejection_releases_resources_and_defers: cross-
  lead race guard — atomic record rejects, channel token + DB number
  released, lead deferred 1h, make_call never runs.
- test_full_round_trip_happy_path: pins exactly one peek and one
  record per successful dial.
- test_get_available_number_returns_none_marks_lead_finished:
  fail-fast on permanent number unavailability.

98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
swaroopvarma1 pushed a commit that referenced this pull request May 21, 2026
…ble hot loop

Two related regressions surfaced by PR #772 (event-driven dispatch). Both
share a root cause: the new dispatcher retries on transient failures at
1-3s cadence instead of the old cron's 30s, exposing latent bugs in the
downstream gates.

(1) Spurious "Outbound Rate Limit Exceeded (BLOCKED)" alerts.

   The rate-limit Lua ZADDed every dispatch attempt that passed the
   peek-count check, BEFORE the channel-token gate. When channel tokens
   were exhausted, a single lead retried every 1-3s, ZADDing its own
   phone into the sliding window each time. After 7 retries (~10-15s),
   the 8th attempt for the SAME lead tripped its own limit, fired a
   Slack alert, and deferred 3600s — without ever dialing the customer.

   On 2026-05-21 16:50 IST, 9 such alerts fired in 16 seconds against 9
   distinct Amir-and-Sons COD-confirmation leads. Verified in
   OpenObserve: zero make_call events, zero PROCESSING updates, zero
   actual dials reached any of those 9 phones. Each lead had exactly 1
   create_lead_call_tracker (no duplicate ingestion) and exactly 8
   defer-and-release cycles in ~2 minutes (7 short channel-wait + 1
   hour-long rate-block).

   Fix: split call_limiter into peek (read-only ZCARD via new
   OUTBOUND_RATE_LIMIT_PEEK_LUA, fires alert, never writes) and an
   atomic check-and-record (OUTBOUND_RATE_LIMIT_RECORD_LUA: trim +
   ZCARD + conditional ZADD in a single Lua). Peek runs before the
   channel-token gate (fast-fail without burning capacity); atomic
   record runs AFTER acquire_channel_token + _acquire_number and
   BEFORE provider.make_call.

   Atomic record is the authoritative cap. Placement constraints:
     - After channel-token acquire — otherwise channel-wait retries
       re-introduce the self-fill bug.
     - Before make_call — otherwise the call is already on the wire
       when the race is detected; strict cap is meaningless once the
       customer's phone rings.
     - Single Lua check-and-set — anything non-atomic leaves a TOCTOU
       window where N concurrent same-phone workers could all pass
       peek under-limit, all dial, then all record (case 5 in design
       notes; pre-PR strict cap silently weakened by the initial split).

   On race-rejection (another worker on the same phone filled the
   bucket between our peek and our record), the worker releases the
   channel token + DB number and defers the lead by the rate-limit
   window. Same Slack alert as the peek path so on-call sees one
   consistent title.

   Failure mode accepted: if make_call raises or returns no SID after
   atomic record runs, the bucket is inflated by 1 for a call that
   didn't reach the customer. Matches pre-PR semantics exactly and only
   triggers on rare provider-side failures.

(2) _get_available_number returning None → 10s defer-and-retry forever.

   The four reasons gate-1 returned None — missing template config,
   deleted/disabled number, Twilio status==IN_USE, Exotel DB-counter
   full — all collapsed into a single None and got the same 10s retry.
   For permanent misconfigurations this was a hot loop with no alert
   and no termination cap, burning DB/Redis cycles indefinitely.

   The Exotel `channels < maximum_channels` check is redundant with the
   Redis channel-token semaphore (the reconciler keeps them in sync;
   semaphore is authoritative per dispatch/channel_semaphore.py). Drop
   it. With that removed, a None return means "structural failure" —
   mark FINISHED with outcome NUMBER_UNAVAILABLE, fire a throttled P1
   alert (raise_no_outbound_number, throttled per reseller/template),
   stop retrying.

Tests:
- test_rate_limit_blocks_before_channel_acquire: pins peek runs, record
  does NOT, on peek-side block.
- test_channel_exhaustion_does_not_record_rate_limit_attempt:
  regression guard for the 2026-05-21 alert storm.
- test_atomic_record_rejection_releases_resources_and_defers: cross-
  lead race guard — atomic record rejects, channel token + DB number
  released, lead deferred 1h, make_call never runs.
- test_full_round_trip_happy_path: pins exactly one peek and one
  record per successful dial.
- test_get_available_number_returns_none_marks_lead_finished:
  fail-fast on permanent number unavailability.

98 passed, 1 xfailed; pyrefly 0 errors; black/isort/autoflake clean on app/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@manas-narra manas-narra force-pushed the add-support-for-template-orchestrator branch 3 times, most recently from 66af8eb to 19c233c Compare May 25, 2026 04:41
  - Added support to merge two different templates to the flow manager during call
  - This helps us in orchestrating among multiple agents while main agent being as an intent detector
@manas-narra manas-narra force-pushed the add-support-for-template-orchestrator branch from 19c233c to ac05679 Compare May 26, 2026 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants