feat: new Scenario API ✨ by rogeriochaves · Pull Request #2 · langwatch/scenario

rogeriochaves · 2025-06-12T12:51:59Z

@pytest.mark.agent_test
@pytest.mark.asyncio
async def test_ai_assistant_agent():
    scenario = Scenario(
        name="false assumptions",
        description="""
            The agent makes false assumption about being an ATM bank, and user corrects it
        """,
        agent=AiAssistantAgentAdapter,
        criteria=[
            "user should get good recommendations on river crossing",
            "agent should NOT follow up about ATM recommendation after user has corrected them they are just hiking",
        ],
        max_turns=5,
    )

    result = await scenario.script(
        [
            scenario.user("how do I safely approach a bank?"),
            scenario.agent(),
            scenario.user(),
            scenario.agent(),
            scenario.judge(),
        ]
    ).run()

    assert result.success

…onfig

…ew tests, getting things started

…g and returning messages as they wish, simplify the executor code, add stronger and more robust validations and conversions

…eded anymore

… pending agents, treat all agents the same, just keep their roles proceeding, get ready for scripted runs

…ned, as users will be able to manually evaluate it later

…the testing agent

….message

Copilot

Pull Request Overview

This PR introduces a new Scenario API with improvements in agent adapter configuration, scenario scripting, and enhanced type safety across several modules. Key changes include:

Conversion of agent functions to subclass-based adapters for both testing and production.
Refactoring of configuration merging and scenario scripting for better consistency.
Updates to tests and examples to support the new workflow and API signatures.

Reviewed Changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.

File	Description
tests/*	Updated tests to use new adapter subclass pattern
scenario/*	Refactored Scenario initialization, configuration, and scripting
examples/*	Updated examples to align with new Scenario API design
setup.py, pyproject.toml, Makefile	Version updates and dependency additions

Comments suppressed due to low confidence (2)

scenario/config.py:32

[nitpick] Instead of creating a custom items() helper to merge configuration values, consider using Pydantic’s model_dump(exclude_none=True) directly to improve clarity and consistency.

def merge(self, other: "ScenarioConfig") -> "ScenarioConfig":

scenario/scenario.py:100

Filtering None values out of the agents list before type and subclass checks would prevent potential unclear ValueErrors; consider using a list comprehension to include only non-null agents.

agents = agents or [kwargs.get("testing_agent"), agent,  # type: ignore]

Copilot · 2025-06-12T13:37:01Z

+    def __init__(self, input: AgentInput):
+        super().__init__(input)
+
+        if not self.model:


[nitpick] The check for an empty model value may be ambiguous if self.model is an empty string. Consider a more explicit validation or providing a sensible default in TestingAgent.with_config.

Suggested change

if not self.model:

if not isinstance(self.model, str) or not self.model.strip():

…h don't play with well json, also, make sure return types can always be converted into a dict

…ging Concerns resolved from the second review pass: - #1 Drain a pending wait=False agent turn at the top of _script_call_agent plus proceed/succeed/fail so judge(), succeed(), fail(), proceed() see the completed agent message. Guard against self-await when the drain enters on the background task itself. - #2 voice_style no longer injects "[style] text" inline — every registered provider would have spoken the bracketed word aloud. Emit a one-shot UserWarning and synthesise without modification until per-provider instructions channels land. - #5 Replace blanket "except Exception: pass" in hook fire helpers with logger.warning(..., exc_info=True) so callback bugs are visible. - #6 Bound TTS cache to 64 LRU entries — ~14 MB per 5-min clip worst case caps the cache at ~900 MB even for long utterances. Prevents unbounded growth in long-lived processes. - #7 background_noise path fallback now requires a separator or .wav suffix before treating the argument as a filesystem path, avoiding the cwd footgun where a typo'd preset name matches a stray local file. - #9 Replace module-global _WARNED_ADAPTERS with WebRTCVadFallback.reset_warnings() classmethod so tests don't need to reach into private module state. Update tests accordingly. - #10 Rewrite PendingTransportError hint: remind subclass authors that the inherited AdapterCapabilities ClassVar must be re-audited, so a subclass claiming streaming_transcripts=True without a real transcript stream does not silently break after_words interruption. - #11 Defensively trim a trailing odd byte from OpenAI TTS PCM responses and pin the PCM16 @ 24kHz mono expectation in the docstring. Invariant asserted at AudioChunk boundary (see #14). - #13 OpenAI Realtime user-role text routing: when the user-role agent is an OpenAIRealtimeAgent, scripted user("text") now invokes send_text on the realtime session instead of TTS. Explicit AC from §7.2 L1164-1171. - #14 AudioChunk.__post_init__ raises ValueError on odd-byte PCM16 data, catching partial-frame bugs at the canonical boundary instead of letting them silently drift through np.frombuffer / duration_seconds. Deferred to follow-ups (noted in PR body, not blocking #350): - #3 stub adapters transport wire-up - #4 narrow public surface for executor/sim state - #8 rename noise presets to match synthetic content - #12 pytest-bdd wiring for the 83 Gherkin scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion Closes gaps surfaced during AC-7/AC-8 happy-path doc writing: - README gains a "Voice Agents" section with a minimal ElevenLabs example, adapter inventory, feature surface summary, and pointers to the two happy-path docs. Fills gap #1 (OpenAI key also needed for ElevenLabs tests) and gap #11 (CI example missing). - scripts/provision_elevenlabs_agent.py header clarifies the script is for the SDK's own CI, NOT for SDK users — closes gap #3 (user confusion about when to run it). Remaining gaps (#2, #4, #5, #7, #8, #9, #10) were already covered by the happy-path docs themselves. Gap #6 (verify scripted user("text") against live ElevenLabs) is verified by the AC-6 suite run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ging Concerns resolved from the second review pass: - #1 Drain a pending wait=False agent turn at the top of _script_call_agent plus proceed/succeed/fail so judge(), succeed(), fail(), proceed() see the completed agent message. Guard against self-await when the drain enters on the background task itself. - #2 voice_style no longer injects "[style] text" inline — every registered provider would have spoken the bracketed word aloud. Emit a one-shot UserWarning and synthesise without modification until per-provider instructions channels land. - #5 Replace blanket "except Exception: pass" in hook fire helpers with logger.warning(..., exc_info=True) so callback bugs are visible. - #6 Bound TTS cache to 64 LRU entries — ~14 MB per 5-min clip worst case caps the cache at ~900 MB even for long utterances. Prevents unbounded growth in long-lived processes. - #7 background_noise path fallback now requires a separator or .wav suffix before treating the argument as a filesystem path, avoiding the cwd footgun where a typo'd preset name matches a stray local file. - #9 Replace module-global _WARNED_ADAPTERS with WebRTCVadFallback.reset_warnings() classmethod so tests don't need to reach into private module state. Update tests accordingly. - #10 Rewrite PendingTransportError hint: remind subclass authors that the inherited AdapterCapabilities ClassVar must be re-audited, so a subclass claiming streaming_transcripts=True without a real transcript stream does not silently break after_words interruption. - #11 Defensively trim a trailing odd byte from OpenAI TTS PCM responses and pin the PCM16 @ 24kHz mono expectation in the docstring. Invariant asserted at AudioChunk boundary (see #14). - #13 OpenAI Realtime user-role text routing: when the user-role agent is an OpenAIRealtimeAgent, scripted user("text") now invokes send_text on the realtime session instead of TTS. Explicit AC from §7.2 L1164-1171. - #14 AudioChunk.__post_init__ raises ValueError on odd-byte PCM16 data, catching partial-frame bugs at the canonical boundary instead of letting them silently drift through np.frombuffer / duration_seconds. Deferred to follow-ups (noted in PR body, not blocking #350): - #3 stub adapters transport wire-up - #4 narrow public surface for executor/sim state - #8 rename noise presets to match synthetic content - #12 pytest-bdd wiring for the 83 Gherkin scenarios Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…g-safe compare, coverage Addresses 8 of the 13 actionable items from the /review fanout: Security: - twilio-server.ts: cap webhook body at 1 MB via streaming guard; reject with HTTP 413 instead of accumulating into memory (concern #7). - twilio-shared.ts: replace hand-rolled XOR signature compare with `crypto.timingSafeEqual` on decoded base64 buffers — Node-stdlib primitive, no DIY constant-time math (concern #10). - twilio-tunnel.ts: drop `(0, eval)("(name) => import(name)")` indirect; use bare dynamic `import()` in try/catch on ERR_MODULE_NOT_FOUND so bundlers and security scanners can analyze the path (concern #8). Coverage (the highest-risk port-only LOC was untested): - twilio.test.ts: codec round-trip — 100 ms 440 Hz sine wave through pcm16/24k → mulaw/8k → pcm16/24k, average abs sample-diff < 2000 (under 10 % of peak). Plus empty-input case. - twilio.test.ts: `verifyTwilioSignature` valid-signature accept, wrong-token reject, wrong-URL reject, missing-signature reject. - twilio.test.ts: `validateE164` + `validateDtmf` accept/reject + the TwiML-injection payload the docstring warns about. - twilio.test.ts: `onDtmf` callback fires on `dtmf` frame, `allowedCallers` filter rejects + records, stop-frame flush enqueues a final AudioChunk. Observability + boy-scout: - twilio-logger.ts (new): minimal `[twilio] …` console wrapper mirroring Python's `logging.getLogger("scenario.voice.twilio")`. Same log sites as the Python parity — body-cap violation, signature rejection, disallowed-caller reject, DTMF receipt, onDtmf callback error (concerns #1 + #14). - twilio-shared.ts: drop duplicate `PCM16_SAMPLE_WIDTH = 2`; import the canonical `PCM16_SAMPLE_WIDTH_BYTES` from `../audio-chunk` and rename call sites (concern #3). - twilio.ts: drop dead `UnsupportedCapabilityError` import + the `export type` re-export that papered over its unused state — base class re-exports via voice/index.ts already (concern #12). - twilio-tunnel.test.ts: wrap cucumber binding in `if (TUNNEL_ENABLED)`; on CI fall back to `describe.skip(...)` with a single placeholder `it` so the runner reports one skipped block instead of five vacuous greens (concern #5). Deferred (documented as follow-ups, not addressed here): - Refactor adapter↔server coupling into a `MediaStreamSession` value object (concern #2). Bigger architectural change; PR3+ executor wiring will exercise the seam first. - Migrate `makeDeferred` to `Promise.withResolvers()` (concern #9). - Replace `rejectedCount` instance field with `getStats()` snapshot (concern #11) — depends on the logger module's contract solidifying. - `call()` Liskov tension (concern #13) — same PR3+ wiring scope. Test surface: 33 passed + 1 skipped (was 27); full suite 409 passed + 1 skipped, build + typecheck green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cases Addresses 5 review concerns (review #540 synthesizer pass): - #1 perf: receive-side mulaw buffer now stores Uint8Array slices, not number[]; bufferMulaw is O(1) per call instead of O(n) per byte. - #2 docs: coerceFrameToText's 0x7b/0x5b heuristic is now documented as a known rare-collision risk (binary µ-law with first byte == { or [ would mis-route to JSON parser and silently drop). - #4 test pyramid: round-trip scenario re-tagged @Unit (FakeWebSocket = no network) — real-WSS @integration demo deferred behind env-gated bot endpoint per /browser-qa note. - #5 coverage: 2 new edge-case tests for partial-buffer flush on bot-sent `stop` event and on socket-close. Not addressed in this PR (filed as follow-up considerations): - #3 vestigial audioFormat/sampleRate fields (inherited from Python parity) - #6 DTMF/E.164 validation regex port (pre-requisite for PR11 Twilio) - #8 extract TwilioMediaStreamsTransport helper (PR11 prep) - #9 JSON-frame size cap (no regression vs main; same constraint as Python) - #10 FakeWebSocket vs node:events (cosmetic)

The bundled Pipecat bot emits a canned greeting on the `connected` event. The old user-first script let that greeting collide with the user's opener, so the first barge-in cut off the GREETING (not a substantive reply) and the bot answered a stale topic — caught by listening to the recording. Open with scenario.agent() to capture the greeting as its own turn (same shape as angry_customer / basic_greeting / random_interruptions), then drive a 2FA walk-through interrupted mid-reply to pivot to a password reset (barge-in #1), then interrupted again to ask for brevity (barge-in #2). Each barge-in now cuts off a SUBSTANTIVE reply and the conversation coheres. Criteria updated to encode the new promise; CODE assertions (transcriptTruncated, fired_after_speech, ratio<0.8, recovery-after-interrupt) unchanged. Intentionally diverges from the user-first Python twin — documented in the file docstring. Regenerated recording (real bot, real OpenAI keys): 9 segments, greeting first, 2 truncated substantive replies, both fired_after_speech, byte-accurate manifest, full.wav 859758 bytes (<1MB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two /review must-fixes: 1. transcribe.test.ts had `void transcribeSegments(...).then(expect...)` inside a synchronous Then callback. The promise resolved after the step completed, so any assertion failure was silently swallowed by vitest. Made the Then async and awaited the call directly. 2. Doc-comment headers in stt/tts/transcribe.test.ts incorrectly cited `@ts-bound`. Updated to cite each file's actual tag (`@ts-stt`, `@ts-tts`, `@ts-transcribe`) so the next reader doesn't get misled. Note: transcribe.test.ts header already said `@ts-transcribe` correctly; only stt.test.ts and tts.test.ts needed updating. Reviewer convergence (3x on #1, 2x on #2): test + principles + hygiene + principles. Refs #516, #517, #513. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rotocol tests Review pass on PR #536 surfaced four actionable concerns. Addressed: - **#1 (blocking) — `connect()` left WS without `error`/`close` handlers after `onOpen` called `removeAllListeners()`.** An unhandled `error` on a Node EventEmitter crashes the process. Re-attach `message` + `error` + `close` listeners atomically post-open. The new `error` handler nulls `this.ws` so subsequent `sendAudio`/`receiveAudio` fail fast instead of writing to a dead socket. Pending receivers drain to empty `AudioChunk` so the executor unwinds rather than hanging. - **#2 (blocking) — `onMessage` branches were untested.** Added 14 wire-protocol unit tests (plain vitest, not cucumber-bound) covering: base64 PCM16 decode, odd-byte trim invariant, audio queue/waiter FIFO, ping → pong with `event_id`, ping defensive (no `event_id` skip), `user_transcript` capture, `agent_response` capture, `agent_response_correction` override, format-drift warning, interruption + unknown event swallow, non-JSON frames ignored, post-open socket error drain, socket close drain, and `receiveAudio` timeout. - **#3 — Default LLM identifier was inlined in `eleven-labs-voice-agent.ts`, violating `voice-models.ts`'s self-declared single-source-of-truth contract.** Hoisted `COMPOSABLE_VOICE_LLM_MODEL` + `ELEVENLABS_DEFAULT_VOICE_ID` + `ELEVENLABS_TTS_MODEL` + `ELEVENLABS_STT_MODEL` into `voice-models.ts` (Python parity: `python/scenario/config/voice_models.py`). Adapters now import from there. - **#6 — `receiveAudio` referenced `waiter` from inside the timer body before its `const` declaration.** Worked by event-loop ordering; fragile to refactor. Forward-declared `let timer` and put `waiter` ahead of the `setTimeout` so the dependency graph is explicit. Tests: 411 / 22 files passing (previously 397 / 22; +14 wire-protocol tests). Build: tsup CJS + ESM + DTS clean. Deferred (intentional, tracked in PR body): - #4/#5: inline `pcm16ToWavBytes` + `synthesize` helpers — duplicate-by-design with PR2 (#513); merge-order constraint. - #7: `turnOutputEmitted` latch contract with PR3 executor — surface in PR3 review. - #8: distinguish natural end-of-turn from socket close — design-level, needs PR3 design conversation. - #9: `featurePath()` helper — extract once a 3rd test file would duplicate the climb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cases Addresses 5 review concerns (review #540 synthesizer pass): - #1 perf: receive-side mulaw buffer now stores Uint8Array slices, not number[]; bufferMulaw is O(1) per call instead of O(n) per byte. - #2 docs: coerceFrameToText's 0x7b/0x5b heuristic is now documented as a known rare-collision risk (binary µ-law with first byte == { or [ would mis-route to JSON parser and silently drop). - #4 test pyramid: round-trip scenario re-tagged @Unit (FakeWebSocket = no network) — real-WSS @integration demo deferred behind env-gated bot endpoint per /browser-qa note. - #5 coverage: 2 new edge-case tests for partial-buffer flush on bot-sent `stop` event and on socket-close. Not addressed in this PR (filed as follow-up considerations): - #3 vestigial audioFormat/sampleRate fields (inherited from Python parity) - #6 DTMF/E.164 validation regex port (pre-requisite for PR11 Twilio) - #8 extract TwilioMediaStreamsTransport helper (PR11 prep) - #9 JSON-frame size cap (no regression vs main; same constraint as Python) - #10 FakeWebSocket vs node:events (cosmetic)

…g-safe compare, coverage Addresses 8 of the 13 actionable items from the /review fanout: Security: - twilio-server.ts: cap webhook body at 1 MB via streaming guard; reject with HTTP 413 instead of accumulating into memory (concern #7). - twilio-shared.ts: replace hand-rolled XOR signature compare with `crypto.timingSafeEqual` on decoded base64 buffers — Node-stdlib primitive, no DIY constant-time math (concern #10). - twilio-tunnel.ts: drop `(0, eval)("(name) => import(name)")` indirect; use bare dynamic `import()` in try/catch on ERR_MODULE_NOT_FOUND so bundlers and security scanners can analyze the path (concern #8). Coverage (the highest-risk port-only LOC was untested): - twilio.test.ts: codec round-trip — 100 ms 440 Hz sine wave through pcm16/24k → mulaw/8k → pcm16/24k, average abs sample-diff < 2000 (under 10 % of peak). Plus empty-input case. - twilio.test.ts: `verifyTwilioSignature` valid-signature accept, wrong-token reject, wrong-URL reject, missing-signature reject. - twilio.test.ts: `validateE164` + `validateDtmf` accept/reject + the TwiML-injection payload the docstring warns about. - twilio.test.ts: `onDtmf` callback fires on `dtmf` frame, `allowedCallers` filter rejects + records, stop-frame flush enqueues a final AudioChunk. Observability + boy-scout: - twilio-logger.ts (new): minimal `[twilio] …` console wrapper mirroring Python's `logging.getLogger("scenario.voice.twilio")`. Same log sites as the Python parity — body-cap violation, signature rejection, disallowed-caller reject, DTMF receipt, onDtmf callback error (concerns #1 + #14). - twilio-shared.ts: drop duplicate `PCM16_SAMPLE_WIDTH = 2`; import the canonical `PCM16_SAMPLE_WIDTH_BYTES` from `../audio-chunk` and rename call sites (concern #3). - twilio.ts: drop dead `UnsupportedCapabilityError` import + the `export type` re-export that papered over its unused state — base class re-exports via voice/index.ts already (concern #12). - twilio-tunnel.test.ts: wrap cucumber binding in `if (TUNNEL_ENABLED)`; on CI fall back to `describe.skip(...)` with a single placeholder `it` so the runner reports one skipped block instead of five vacuous greens (concern #5). Deferred (documented as follow-ups, not addressed here): - Refactor adapter↔server coupling into a `MediaStreamSession` value object (concern #2). Bigger architectural change; PR3+ executor wiring will exercise the seam first. - Migrate `makeDeferred` to `Promise.withResolvers()` (concern #9). - Replace `rejectedCount` instance field with `getStats()` snapshot (concern #11) — depends on the logger module's contract solidifying. - `call()` Liskov tension (concern #13) — same PR3+ wiring scope. Test surface: 33 passed + 1 skipped (was 27); full suite 409 passed + 1 skipped, build + typecheck green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ig (keystone) New voice/config.ts (EDR §0.1 Tier 1 + ADR-002). The keystone of the per-run state model — replaces both the STT module-global (Gap #1) and configure({stt}) (Gap #2): - VoiceConfig { stt?: STTProvider | SttConfig; tts?: TtsConfig; defaultAudioFormat?; audioPlayback?; include{Audio,Timeline,Traces}? } - SttConfig { model; language?; apiKey? }, TtsConfig { voice; format?; apiKey? } - ResolvedVoiceConfig — stt always a concrete provider; the resolved per-run object - resolveVoiceConfig(optionLevel, scenarioLevel, defaults?): two-tier merge with the RunOptions.voice override in front of ScenarioConfig.voice, then pure defaults; `stt` resolves `options?.voice?.stt ?? cfg.voice?.stt ?? new OpenAISTTProvider()` (the default provider constructed per-run — pure default, not shared state). - DEFAULT_STT_MODEL, DEFAULT_AUDIO_FORMAT ("pcm16", the AI-SDK file part per §4.2). stt accepts an STTProvider instance (BYO) or an SttConfig descriptor (routed via resolveSttProvider). AudioFormat is a string union (nothing consumes a richer record yet; AudioChunk fixes 24kHz mono). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@ts-expect-error

…gure() for global exec Per EDR §0.1 + ADR-002 + PRD §4.7: - config/configure.ts: removed the invented `configure({ stt })` provider knob (present in no other PR, not in Python). `configure()` now carries only global *execution* settings — `audioPlayback` (PRD §4.7: stream conversation audio to local speakers). Stored in a module record read by the runner; getGlobalSettings() exposes it. (audioPlayback is a genuine global UX toggle, not per-run provider state — the ADR-001 concern is provider/model state flowing into call(), which this is not.) - configure.test.ts: rewritten to test the audioPlayback surface + a @ts-expect-error asserting `stt` is no longer accepted. - index.ts: updated the stale `configure({ stt })` comment; configure export stays. Provider config is per-run via run({ voice: { stt, tts } }), not global. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… cascades Gaps #1/#2/#3/#7 + host wiring done; #4 verified intact. Final tsc/test state, remaining 29 SALVAGE markers, Tier B/C cascades (twilio-shared as critical-path blocker, composable de-dup now owed), and intentional EDR deviations recorded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The bundled Pipecat bot emits a canned greeting on the `connected` event. The old user-first script let that greeting collide with the user's opener, so the first barge-in cut off the GREETING (not a substantive reply) and the bot answered a stale topic — caught by listening to the recording. Open with scenario.agent() to capture the greeting as its own turn (same shape as angry_customer / basic_greeting / random_interruptions), then drive a 2FA walk-through interrupted mid-reply to pivot to a password reset (barge-in #1), then interrupted again to ask for brevity (barge-in #2). Each barge-in now cuts off a SUBSTANTIVE reply and the conversation coheres. Criteria updated to encode the new promise; CODE assertions (transcriptTruncated, fired_after_speech, ratio<0.8, recovery-after-interrupt) unchanged. Intentionally diverges from the user-first Python twin — documented in the file docstring. Regenerated recording (real bot, real OpenAI keys): 9 segments, greeting first, 2 truncated substantive replies, both fired_after_speech, byte-accurate manifest, full.wav 859758 bytes (<1MB). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two /review must-fixes: 1. transcribe.test.ts had `void transcribeSegments(...).then(expect...)` inside a synchronous Then callback. The promise resolved after the step completed, so any assertion failure was silently swallowed by vitest. Made the Then async and awaited the call directly. 2. Doc-comment headers in stt/tts/transcribe.test.ts incorrectly cited `@ts-bound`. Updated to cite each file's actual tag (`@ts-stt`, `@ts-tts`, `@ts-transcribe`) so the next reader doesn't get misled. Note: transcribe.test.ts header already said `@ts-transcribe` correctly; only stt.test.ts and tts.test.ts needed updating. Reviewer convergence (3x on #1, 2x on #2): test + principles + hygiene + principles. Refs #516, #517, #513. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rotocol tests Review pass on PR #536 surfaced four actionable concerns. Addressed: - **#1 (blocking) — `connect()` left WS without `error`/`close` handlers after `onOpen` called `removeAllListeners()`.** An unhandled `error` on a Node EventEmitter crashes the process. Re-attach `message` + `error` + `close` listeners atomically post-open. The new `error` handler nulls `this.ws` so subsequent `sendAudio`/`receiveAudio` fail fast instead of writing to a dead socket. Pending receivers drain to empty `AudioChunk` so the executor unwinds rather than hanging. - **#2 (blocking) — `onMessage` branches were untested.** Added 14 wire-protocol unit tests (plain vitest, not cucumber-bound) covering: base64 PCM16 decode, odd-byte trim invariant, audio queue/waiter FIFO, ping → pong with `event_id`, ping defensive (no `event_id` skip), `user_transcript` capture, `agent_response` capture, `agent_response_correction` override, format-drift warning, interruption + unknown event swallow, non-JSON frames ignored, post-open socket error drain, socket close drain, and `receiveAudio` timeout. - **#3 — Default LLM identifier was inlined in `eleven-labs-voice-agent.ts`, violating `voice-models.ts`'s self-declared single-source-of-truth contract.** Hoisted `COMPOSABLE_VOICE_LLM_MODEL` + `ELEVENLABS_DEFAULT_VOICE_ID` + `ELEVENLABS_TTS_MODEL` + `ELEVENLABS_STT_MODEL` into `voice-models.ts` (Python parity: `python/scenario/config/voice_models.py`). Adapters now import from there. - **#6 — `receiveAudio` referenced `waiter` from inside the timer body before its `const` declaration.** Worked by event-loop ordering; fragile to refactor. Forward-declared `let timer` and put `waiter` ahead of the `setTimeout` so the dependency graph is explicit. Tests: 411 / 22 files passing (previously 397 / 22; +14 wire-protocol tests). Build: tsup CJS + ESM + DTS clean. Deferred (intentional, tracked in PR body): - #4/#5: inline `pcm16ToWavBytes` + `synthesize` helpers — duplicate-by-design with PR2 (#513); merge-order constraint. - #7: `turnOutputEmitted` latch contract with PR3 executor — surface in PR3 review. - #8: distinguish natural end-of-turn from socket close — design-level, needs PR3 design conversation. - #9: `featurePath()` helper — extract once a 3rd test file would duplicate the climb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cases Addresses 5 review concerns (review #540 synthesizer pass): - #1 perf: receive-side mulaw buffer now stores Uint8Array slices, not number[]; bufferMulaw is O(1) per call instead of O(n) per byte. - #2 docs: coerceFrameToText's 0x7b/0x5b heuristic is now documented as a known rare-collision risk (binary µ-law with first byte == { or [ would mis-route to JSON parser and silently drop). - #4 test pyramid: round-trip scenario re-tagged @Unit (FakeWebSocket = no network) — real-WSS @integration demo deferred behind env-gated bot endpoint per /browser-qa note. - #5 coverage: 2 new edge-case tests for partial-buffer flush on bot-sent `stop` event and on socket-close. Not addressed in this PR (filed as follow-up considerations): - #3 vestigial audioFormat/sampleRate fields (inherited from Python parity) - #6 DTMF/E.164 validation regex port (pre-requisite for PR11 Twilio) - #8 extract TwilioMediaStreamsTransport helper (PR11 prep) - #9 JSON-frame size cap (no regression vs main; same constraint as Python) - #10 FakeWebSocket vs node:events (cosmetic)

…g-safe compare, coverage Addresses 8 of the 13 actionable items from the /review fanout: Security: - twilio-server.ts: cap webhook body at 1 MB via streaming guard; reject with HTTP 413 instead of accumulating into memory (concern #7). - twilio-shared.ts: replace hand-rolled XOR signature compare with `crypto.timingSafeEqual` on decoded base64 buffers — Node-stdlib primitive, no DIY constant-time math (concern #10). - twilio-tunnel.ts: drop `(0, eval)("(name) => import(name)")` indirect; use bare dynamic `import()` in try/catch on ERR_MODULE_NOT_FOUND so bundlers and security scanners can analyze the path (concern #8). Coverage (the highest-risk port-only LOC was untested): - twilio.test.ts: codec round-trip — 100 ms 440 Hz sine wave through pcm16/24k → mulaw/8k → pcm16/24k, average abs sample-diff < 2000 (under 10 % of peak). Plus empty-input case. - twilio.test.ts: `verifyTwilioSignature` valid-signature accept, wrong-token reject, wrong-URL reject, missing-signature reject. - twilio.test.ts: `validateE164` + `validateDtmf` accept/reject + the TwiML-injection payload the docstring warns about. - twilio.test.ts: `onDtmf` callback fires on `dtmf` frame, `allowedCallers` filter rejects + records, stop-frame flush enqueues a final AudioChunk. Observability + boy-scout: - twilio-logger.ts (new): minimal `[twilio] …` console wrapper mirroring Python's `logging.getLogger("scenario.voice.twilio")`. Same log sites as the Python parity — body-cap violation, signature rejection, disallowed-caller reject, DTMF receipt, onDtmf callback error (concerns #1 + #14). - twilio-shared.ts: drop duplicate `PCM16_SAMPLE_WIDTH = 2`; import the canonical `PCM16_SAMPLE_WIDTH_BYTES` from `../audio-chunk` and rename call sites (concern #3). - twilio.ts: drop dead `UnsupportedCapabilityError` import + the `export type` re-export that papered over its unused state — base class re-exports via voice/index.ts already (concern #12). - twilio-tunnel.test.ts: wrap cucumber binding in `if (TUNNEL_ENABLED)`; on CI fall back to `describe.skip(...)` with a single placeholder `it` so the runner reports one skipped block instead of five vacuous greens (concern #5). Deferred (documented as follow-ups, not addressed here): - Refactor adapter↔server coupling into a `MediaStreamSession` value object (concern #2). Bigger architectural change; PR3+ executor wiring will exercise the seam first. - Migrate `makeDeferred` to `Promise.withResolvers()` (concern #9). - Replace `rejectedCount` instance field with `getStats()` snapshot (concern #11) — depends on the logger module's contract solidifying. - `call()` Liskov tension (concern #13) — same PR3+ wiring scope. Test surface: 33 passed + 1 skipped (was 27); full suite 409 passed + 1 skipped, build + typecheck green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ig (keystone) New voice/config.ts (EDR §0.1 Tier 1 + ADR-002). The keystone of the per-run state model — replaces both the STT module-global (Gap #1) and configure({stt}) (Gap #2): - VoiceConfig { stt?: STTProvider | SttConfig; tts?: TtsConfig; defaultAudioFormat?; audioPlayback?; include{Audio,Timeline,Traces}? } - SttConfig { model; language?; apiKey? }, TtsConfig { voice; format?; apiKey? } - ResolvedVoiceConfig — stt always a concrete provider; the resolved per-run object - resolveVoiceConfig(optionLevel, scenarioLevel, defaults?): two-tier merge with the RunOptions.voice override in front of ScenarioConfig.voice, then pure defaults; `stt` resolves `options?.voice?.stt ?? cfg.voice?.stt ?? new OpenAISTTProvider()` (the default provider constructed per-run — pure default, not shared state). - DEFAULT_STT_MODEL, DEFAULT_AUDIO_FORMAT ("pcm16", the AI-SDK file part per §4.2). stt accepts an STTProvider instance (BYO) or an SttConfig descriptor (routed via resolveSttProvider). AudioFormat is a string union (nothing consumes a richer record yet; AudioChunk fixes 24kHz mono). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@ts-expect-error