feat(typescript-sdk/#372): voice TTS + STT plumbing (PR2 of N) by drewdrewthis · Pull Request #513 · langwatch/scenario

drewdrewthis · 2026-05-21T14:43:37Z

Summary

PR2 of N for #372 — ports python/scenario/voice/{tts,stt,_transcribe}.py to TypeScript and exposes scenario.configure({ stt }) for swapping the default STT provider.

Builds on PR1 (#511 — types-only contract surface). No adapter runtime, no VAD, no simulator/judge wiring — those land in PR3+.

Scope

javascript/src/voice/tts.ts — synthesize(text, voice, effectFn?) with LRU(64) keyed on sha256(text)+voice. Effects apply after cache hit per the locked decision; raw text never reaches the cache payload. registerTtsProvider({ prefix, synth }) for custom backends. Default OpenAI provider lazy-imports the SDK so users with custom providers don't need an OPENAI_API_KEY.
javascript/src/voice/stt.ts — STTProvider interface (transcribe(audio: AudioChunk): Promise<string> only), OpenAISTTProvider (default = gpt-4o-transcribe) with 25-minute chunking, ElevenLabsSTTProvider, setSttProvider / getSttProvider. Pure-TS pcm16ToWav encoder — no audioop/ffmpeg dep for transcription.
javascript/src/voice/transcribe.ts — transcribeSegments, post-hoc, idempotent per-segment, degrades gracefully when no provider is configured.
javascript/src/config/configure.ts — scenario.configure({ stt }) entry point. Wired into the top-level scenario object so scenario.configure({ stt: new MyProvider() }) works as in Python.

Acceptance checks

pnpm -C javascript build green — commit 05d549d, dist outputs CJS+ESM+DTS without errors (228 KB CJS bundle, 9.5s DTS).
pnpm -C javascript test green — 397 tests passed across 25 files (3.35s). Voice + config subset: 37 tests across 5 files (502 ms).
Cache key is sha256(text)+voice; effects-after-cache invariant proven by test — tts.test.ts "applies effectFn AFTER cache hit" runs synthesis once and re-reads from cache for two different effects, then asserts the third effect output equals original.reverse() (NOT boosted.reverse()), proving effects never bake into stored audio.
scenario.configure({ stt }) accepts a custom provider and getSttProvider() returns it — configure.test.ts covers swap, null-clear, and no-op behavior.

Bound feature scenarios (`python/specs/voice-agents.feature`)

Scenario	Lines	Test file
TTS cache key is (text, voice) only and effects apply after cache hit	172-187	`voice/__tests__/tts.test.ts`
Default STT provider is OpenAI gpt-4o-transcribe	720-726	`voice/__tests__/stt.test.ts`
Users swap STT providers via scenario.configure	727-733	`voice/__tests__/stt.test.ts` + `config/__tests__/configure.test.ts`
STT provider interface is minimal and provider-agnostic	734-739	`voice/__tests__/stt.test.ts`
Transcription chunks audio longer than 25 minutes	740-748	`voice/__tests__/stt.test.ts`
transcribe_segments fills missing transcripts in place	857-865	`voice/__tests__/transcribe.test.ts`
missing STT provider degrades gracefully	872-878	`voice/__tests__/transcribe.test.ts`

Hard rules respected

No python/ changes.
No adapter runtime, VAD, transports, simulator/judge wiring, effects module, recording behavior, or script steps. Those are PR3+.
Draft until lead flips after verification.

Test plan

Local pnpm -C javascript test green (397/397).
Local pnpm -C javascript build green (CJS + ESM + DTS).
CI green on javascript-ci.

🤖 Generated with Claude Code

Tag was encoding PR ordinals; once the slice plan completes the @prN-binding tags would persist with no remaining semantics. @ts-bound describes the invariant — this scenario has a TypeScript test binding — and survives the slice plan. PR-B (#513) and PR-C (#515) will pick up the same tag rather than @pr2-binding / @pr3-binding. Reviewer convergence: hygiene, principles, synthesizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…trofit PR-A) (#517) * feat(test): bind PR #511 voice scenarios to specs/voice-agents.feature via vitest-cucumber Retrofits voice-contract-surface.test.ts so the 5 scenarios it claims to test are actually loaded from specs/voice-agents.feature and executed by the test runner, rather than only paraphrased in describe() names. Adds @amiceli/vitest-cucumber@^6.5.0 as a dev dep. Peer-dep matches the repo's pinned vitest ^4.0.14. Refs #516 (spec-binding retrofit for TS voice parity slice plan #372). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(test): rename @pr1-binding → @ts-bound for durability Tag was encoding PR ordinals; once the slice plan completes the @prN-binding tags would persist with no remaining semantics. @ts-bound describes the invariant — this scenario has a TypeScript test binding — and survives the slice plan. PR-B (#513) and PR-C (#515) will pick up the same tag rather than @pr2-binding / @pr3-binding. Reviewer convergence: hygiene, principles, synthesizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ports python/scenario/voice/{tts,stt,_transcribe}.py to TypeScript and exposes scenario.configure({ stt }) for swapping the default STT provider. - voice/tts.ts: synthesize(text, voice, effectFn?) + LRU(64) keyed on sha256(text)+voice. Effects apply AFTER cache hit per the locked decision; raw text never reaches the cache payload. - voice/stt.ts: STTProvider interface, OpenAISTTProvider default (gpt-4o-transcribe) with 25-minute chunking, ElevenLabsSTTProvider, setSttProvider / getSttProvider for swap. Pure-TS pcm16-to-wav encoder — no transcription-only ffmpeg dep. - voice/transcribe.ts: transcribeSegments — post-hoc, idempotent per-segment, degrades gracefully when no provider is configured. - config/configure.ts: scenario.configure({ stt }) entry point. Tests in follow-up commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- tts.test.ts: cache key is (sha256(text), voice); effects apply AFTER cache hit (third call with new effect reads ORIGINAL cached PCM, not effect-baked bytes). - stt.test.ts: default model = gpt-4o-transcribe; provider swap via setSttProvider; STTProvider interface minimal (no OpenAI types leak); >25-min audio splits into sub-chunks with concatenated transcripts. - transcribe.test.ts: transcribeSegments fills missing transcripts in place, skips already-filled segments; missing STT degrades gracefully with a warning and never raises. - configure.test.ts: scenario.configure({ stt }) round-trips a custom provider; null clears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cucumber Retrofits PR #513's hand-rolled tests so the 7 scenarios they claim to cover actually load and execute against specs/voice-agents.feature via @amiceli/vitest-cucumber, matching the pattern landed by #517. Scenarios tagged @ts-tts, @ts-stt, @ts-transcribe (domain-specific sub-tags alongside @Unit) so each test file's includeTags filter targets exactly the scenarios it owns without disturbing voice-contract-surface.test.ts (which uses @ts-bound for the original 5 scenarios from PR1). - tts.test.ts: loadFeature + describeFeature({ includeTags: ["ts-tts"] }) binding "TTS cache key is (text, voice) only and effects apply after cache hit" - stt.test.ts: loadFeature + describeFeature({ includeTags: ["ts-stt"] }) binding 4 STT scenarios: default gpt-4o-transcribe, provider swap, minimal interface, >25-min chunking - transcribe.test.ts: loadFeature + describeFeature({ includeTags: ["ts-transcribe"] }) binding transcribe_segments fills-in-place + missing STT degrades gracefully Refs #516 (spec-binding retrofit), #517 (PR-A, merged), #372 (slice plan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drewdrewthis · 2026-05-21T18:08:04Z

No description provided.

Two /review must-fixes: 1. transcribe.test.ts had `void transcribeSegments(...).then(expect...)` inside a synchronous Then callback. The promise resolved after the step completed, so any assertion failure was silently swallowed by vitest. Made the Then async and awaited the call directly. 2. Doc-comment headers in stt/tts/transcribe.test.ts incorrectly cited `@ts-bound`. Updated to cite each file's actual tag (`@ts-stt`, `@ts-tts`, `@ts-transcribe`) so the next reader doesn't get misled. Note: transcribe.test.ts header already said `@ts-transcribe` correctly; only stt.test.ts and tts.test.ts needed updating. Reviewer convergence (3x on #1, 2x on #2): test + principles + hygiene + principles. Refs #516, #517, #513. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drewdrewthis · 2026-05-21T18:13:27Z

No description provided.

drewdrewthis · 2026-05-21T18:13:29Z

No description provided.

… vitest-cucumber Retrofits PR #515's hand-rolled tests for adapter lifecycle, hooks, and VAD fallback to actually load and execute specs/voice-agents.feature via @amiceli/vitest-cucumber, matching the pattern landed by #517 and #513. Tags by test file (per-file tagging needed because vitest-cucumber v6 fails the suite for scenarios that match a file's includeTags but aren't bound in that file): - @ts-adapter: connect/disconnect fires per-scenario - @ts-hooks: on_audio_chunk and on_voice_event fire - @ts-vad: VAD fallback / native-VAD does not trigger / one-shot warning Key implementation note: vitest-cucumber v6 runs each Given/When/Then step as a separate vitest it(). Module-level beforeEach/afterEach hooks fire around each step, not around the whole scenario. For scenarios that need to assert on console.warn calls across step boundaries, the spy is installed locally within the When step and captured warn messages are carried via closure-scoped variables into Then/And — avoiding the floating-promise and spy-reset antipatterns. Refs #516 (spec-binding retrofit), #517 (PR-A, merged), #513 (PR-B, ready for review), #372 (slice plan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three /review must-fixes: 1. vad-fallback.test.ts: replaced the closure-capture spy pattern with the library's BeforeEachScenario/AfterEachScenario hooks. The coder's earlier workaround was based on the false belief that vitest-cucumber lacked scenario-level lifecycle hooks. The hooks exist (verified at @amiceli/vitest-cucumber 6.5.0 describe-feature.js:311-322). BeforeEachScenario fires via beforeAll inside the scenario describe block — once per scenario, not per step. Spy is shared; capturedWarnCalls accumulates across steps within the same scenario. Removed ~28 lines of SPY STRATEGY prose comments. 2. hooks.test.ts: extracted the "throwing hook doesn't break scenario" check from inside the on_voice_event scenario's When step. It was asserting behavior the bound feature scenario didn't claim. Now a plain it() block outside describeFeature. Option (a) chosen: no spec scenario exists for this behavior in voice-agents.feature. 3. adapter-lifecycle.test.ts: split 5 sub-cases out of one packed And step. Kept only the happy-path disconnect assertion in the bound And step (disconnect fires once on success). Lifted fail/throw/ multi-adapter/disconnect-swallow to 4 plain it() blocks. Option (b) chosen: specs/voice-agents.feature line 143 names the And step as a single AC ("regardless of pass/fail/exception") — the 4 sub-cases are implementation-level guarantees not individually specced. Reviewer convergence: principles + test (3x). Refs #516, #517, #513, #515. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

drewdrewthis · 2026-05-26T11:20:33Z

+}
+
+/** The global STT provider — defaults to {@link OpenAISTTProvider}. */
+let provider: STTProvider | null = new OpenAISTTProvider();


We shouldn't have a global provider like this -- this seems like a mistake. Do we not have an individual scenario state? what if we have two scenarios with two providers running in parallel. Global state/global things are a smell, imo?

drewdrewthis · 2026-05-27T08:49:15Z

Superseded by the consolidated TypeScript voice stack: #561 (voice/372-refactor → main).

Per the EDR (#560), the TS voice work was sliced into flat-sibling PRs that each forked one point off the integration branch, so no slice saw the others' contracts — producing the drift #560 documents (3 adapter.ts forks, divergent STTProvider/synthesize, a module-global STT provider violating ADR-001, an invented configure({stt}), a live createAudioMessage format mismatch). We rebuilt one clean stack against main. This PR's TTS + STT plumbing code was salvaged into #561 — reviewed and carried forward, not discarded. See #560 §0.1 and the epic #370.

…cucumber Retrofits PR #513's hand-rolled tests so the 7 scenarios they claim to cover actually load and execute against specs/voice-agents.feature via @amiceli/vitest-cucumber, matching the pattern landed by #517. Scenarios tagged @ts-tts, @ts-stt, @ts-transcribe (domain-specific sub-tags alongside @Unit) so each test file's includeTags filter targets exactly the scenarios it owns without disturbing voice-contract-surface.test.ts (which uses @ts-bound for the original 5 scenarios from PR1). - tts.test.ts: loadFeature + describeFeature({ includeTags: ["ts-tts"] }) binding "TTS cache key is (text, voice) only and effects apply after cache hit" - stt.test.ts: loadFeature + describeFeature({ includeTags: ["ts-stt"] }) binding 4 STT scenarios: default gpt-4o-transcribe, provider swap, minimal interface, >25-min chunking - transcribe.test.ts: loadFeature + describeFeature({ includeTags: ["ts-transcribe"] }) binding transcribe_segments fills-in-place + missing STT degrades gracefully Refs #516 (spec-binding retrofit), #517 (PR-A, merged), #372 (slice plan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two /review must-fixes: 1. transcribe.test.ts had `void transcribeSegments(...).then(expect...)` inside a synchronous Then callback. The promise resolved after the step completed, so any assertion failure was silently swallowed by vitest. Made the Then async and awaited the call directly. 2. Doc-comment headers in stt/tts/transcribe.test.ts incorrectly cited `@ts-bound`. Updated to cite each file's actual tag (`@ts-stt`, `@ts-tts`, `@ts-transcribe`) so the next reader doesn't get misled. Note: transcribe.test.ts header already said `@ts-transcribe` correctly; only stt.test.ts and tts.test.ts needed updating. Reviewer convergence (3x on #1, 2x on #2): test + principles + hygiene + principles. Refs #516, #517, #513. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… vitest-cucumber Retrofits PR #515's hand-rolled tests for adapter lifecycle, hooks, and VAD fallback to actually load and execute specs/voice-agents.feature via @amiceli/vitest-cucumber, matching the pattern landed by #517 and #513. Tags by test file (per-file tagging needed because vitest-cucumber v6 fails the suite for scenarios that match a file's includeTags but aren't bound in that file): - @ts-adapter: connect/disconnect fires per-scenario - @ts-hooks: on_audio_chunk and on_voice_event fire - @ts-vad: VAD fallback / native-VAD does not trigger / one-shot warning Key implementation note: vitest-cucumber v6 runs each Given/When/Then step as a separate vitest it(). Module-level beforeEach/afterEach hooks fire around each step, not around the whole scenario. For scenarios that need to assert on console.warn calls across step boundaries, the spy is installed locally within the When step and captured warn messages are carried via closure-scoped variables into Then/And — avoiding the floating-promise and spy-reset antipatterns. Refs #516 (spec-binding retrofit), #517 (PR-A, merged), #513 (PR-B, ready for review), #372 (slice plan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three /review must-fixes: 1. vad-fallback.test.ts: replaced the closure-capture spy pattern with the library's BeforeEachScenario/AfterEachScenario hooks. The coder's earlier workaround was based on the false belief that vitest-cucumber lacked scenario-level lifecycle hooks. The hooks exist (verified at @amiceli/vitest-cucumber 6.5.0 describe-feature.js:311-322). BeforeEachScenario fires via beforeAll inside the scenario describe block — once per scenario, not per step. Spy is shared; capturedWarnCalls accumulates across steps within the same scenario. Removed ~28 lines of SPY STRATEGY prose comments. 2. hooks.test.ts: extracted the "throwing hook doesn't break scenario" check from inside the on_voice_event scenario's When step. It was asserting behavior the bound feature scenario didn't claim. Now a plain it() block outside describeFeature. Option (a) chosen: no spec scenario exists for this behavior in voice-agents.feature. 3. adapter-lifecycle.test.ts: split 5 sub-cases out of one packed And step. Kept only the happy-path disconnect assertion in the bound And step (disconnect fires once on success). Lifted fail/throw/ multi-adapter/disconnect-swallow to 4 plain it() blocks. Option (b) chosen: specs/voice-agents.feature line 143 names the And step as a single AC ("regardless of pass/fail/exception") — the 4 sub-cases are implementation-level guarantees not individually specced. Reviewer convergence: principles + test (3x). Refs #516, #517, #513, #515. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…udio messages (PR4 of N) Ports the python voice path for simulator and judge to TypeScript: - javascript/src/voice/messages.ts: createAudioMessage/extractAudio/ messageHasAudio helpers using the local AudioMessageParam type. No openai package import — uses messages.types.ts (Decision 2(b)). - javascript/src/agents/user-simulator-agent.ts: voice config triggers audio-message emission; per-step voice + per-step audio_effects + persona composition. stripAudioContent keeps LLM calls text-only. - javascript/src/agents/judge/judge-agent.ts: JudgeAgent exported as class with static conversationHasAudio; effectiveIncludeAudio/Timeline/Traces helpers; auto-detect multimodal model via model name substrings; include_audio=false escape hatch. 13 scenarios bound to specs/voice-agents.feature via vitest-cucumber: - 5 simulator scenarios (@ts-simulator) - 7 judge scenarios (@ts-judge) - 1 assistant-role scenario (@ts-assistant-role) Tag convention: per-subject (@ts-simulator / @ts-judge / @ts-assistant-role) instead of @ts-bound to avoid colliding with PR1's voice-contract-surface test (which uses includeTags: ["ts-bound"] and would over-match new scenarios). Per-file tagging is established by #513/#515; tag-convention decision tracked at #523. Refs #372 (slice plan), #517 (PR1 infra, merged), #513 (PR2, ready), Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…anded (PR7 of N) PR7 of issue #372 — the first real voice transport. Ports three Python adapters to TS and binds 7 scenarios in `specs/voice-agents.feature`. What lands: - `javascript/src/voice/adapters/elevenlabs.ts` — `ElevenLabsAgentAdapter`, the hosted ConvAI adapter. Connects to `wss://api.elevenlabs.io/v1/convai/conversation` via the `ws` package; PCM16/24kHz base64-over-JSON; full event handling (audio, ping, transcript, correction, init-metadata, interruption). Mirrors `python/scenario/voice/adapters/elevenlabs.py`. - `javascript/src/voice/adapters/composable.ts` — `ComposableVoiceAgent` + `STTProvider` interface + `ElevenLabsSTTProvider` + inline `synthesize` helper (elevenlabs/ provider only — PR2 #513 supplies the rest). LLM is any ai-sdk `LanguageModel`. Mirrors `python/scenario/voice/adapters/composable.py`. - `javascript/src/voice/adapters/eleven-labs-voice-agent.ts` — `ElevenLabsVoiceAgent`, the branded preset. Provider-typed options; defaults to `ElevenLabsSTTProvider` + `openai("gpt-5.4-mini")` + `elevenlabs/EXAVITQu4vr4xnSDxMaL` (Sarah — free-tier premade); each piece independently overridable. `eleven_v3` TTS model hardcoded for paralinguistic-marker support (per Python tts.py:107 comment). Tests: - `javascript/src/voice/adapters/__tests__/elevenlabs.test.ts` — 5 unit scenarios bound via `describeFeature(..., { includeTags: [["unit", "ts-elevenlabs"]] })`. - `javascript/examples/vitest/tests/voice/elevenlabs-hosted.test.ts` — 2 e2e scenarios env-gated on `ELEVENLABS_API_KEY` (+ `ELEVENLABS_AGENT_ID` for the hosted demo). Without keys, the suite cleanly skips. Tag convention: `@ts-elevenlabs` (per-subject) rather than `@ts-bound` — per the precedent from PRs #517 / #528 (`@ts-simulator`, `@ts-judge`, `@ts-assistant-role`), per-subject tags avoid the `checkUncalledScenario` collision with PR1's contract-surface test. See #523 for the tag-convention decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rotocol tests Review pass on PR #536 surfaced four actionable concerns. Addressed: - **#1 (blocking) — `connect()` left WS without `error`/`close` handlers after `onOpen` called `removeAllListeners()`.** An unhandled `error` on a Node EventEmitter crashes the process. Re-attach `message` + `error` + `close` listeners atomically post-open. The new `error` handler nulls `this.ws` so subsequent `sendAudio`/`receiveAudio` fail fast instead of writing to a dead socket. Pending receivers drain to empty `AudioChunk` so the executor unwinds rather than hanging. - **#2 (blocking) — `onMessage` branches were untested.** Added 14 wire-protocol unit tests (plain vitest, not cucumber-bound) covering: base64 PCM16 decode, odd-byte trim invariant, audio queue/waiter FIFO, ping → pong with `event_id`, ping defensive (no `event_id` skip), `user_transcript` capture, `agent_response` capture, `agent_response_correction` override, format-drift warning, interruption + unknown event swallow, non-JSON frames ignored, post-open socket error drain, socket close drain, and `receiveAudio` timeout. - **#3 — Default LLM identifier was inlined in `eleven-labs-voice-agent.ts`, violating `voice-models.ts`'s self-declared single-source-of-truth contract.** Hoisted `COMPOSABLE_VOICE_LLM_MODEL` + `ELEVENLABS_DEFAULT_VOICE_ID` + `ELEVENLABS_TTS_MODEL` + `ELEVENLABS_STT_MODEL` into `voice-models.ts` (Python parity: `python/scenario/config/voice_models.py`). Adapters now import from there. - **#6 — `receiveAudio` referenced `waiter` from inside the timer body before its `const` declaration.** Worked by event-loop ordering; fragile to refactor. Forward-declared `let timer` and put `waiter` ahead of the `setTimeout` so the dependency graph is explicit. Tests: 411 / 22 files passing (previously 397 / 22; +14 wire-protocol tests). Build: tsup CJS + ESM + DTS clean. Deferred (intentional, tracked in PR body): - #4/#5: inline `pcm16ToWavBytes` + `synthesize` helpers — duplicate-by-design with PR2 (#513); merge-order constraint. - #7: `turnOutputEmitted` latch contract with PR3 executor — surface in PR3 review. - #8: distinguish natural end-of-turn from socket close — design-level, needs PR3 design conversation. - #9: `featurePath()` helper — extract once a 3rd test file would duplicate the climb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@todo

- Add @ts-elevenlabs to the bare-@Unit 'ElevenLabsAgentAdapter connects to conversational AI endpoint' scenario so the AND-filter in elevenlabs.test.ts binds it (was skipped — 4 steps). EL suite now 35/0 skipped. - Rewrite 'the judge requests a transcript' → 'the audio is auto-transcribed and the judge receives text' (§7.3 — no such judge tool; STT is upstream). - Rewrite scenario.configure(stt=...) step strings → run({ voice: { stt } }) (§7.5 — the removed invented API; per-run carrier per ADR-002). Updated the matching elevenlabs.test.ts step binding string. - Strip 'PR2 of #372' / 'PR2 / #513' PR-reference comments from transcribe/tts/ user-simulator-voice test headers + the spec @todo (§7.6). Refreshed the voiceStyle @todo to note the plumbing is now wired (audible effect pending). SALVAGE markers stay at 0. All affected suites green (no broken bindings). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cucumber Retrofits PR #513's hand-rolled tests so the 7 scenarios they claim to cover actually load and execute against specs/voice-agents.feature via @amiceli/vitest-cucumber, matching the pattern landed by #517. Scenarios tagged @ts-tts, @ts-stt, @ts-transcribe (domain-specific sub-tags alongside @Unit) so each test file's includeTags filter targets exactly the scenarios it owns without disturbing voice-contract-surface.test.ts (which uses @ts-bound for the original 5 scenarios from PR1). - tts.test.ts: loadFeature + describeFeature({ includeTags: ["ts-tts"] }) binding "TTS cache key is (text, voice) only and effects apply after cache hit" - stt.test.ts: loadFeature + describeFeature({ includeTags: ["ts-stt"] }) binding 4 STT scenarios: default gpt-4o-transcribe, provider swap, minimal interface, >25-min chunking - transcribe.test.ts: loadFeature + describeFeature({ includeTags: ["ts-transcribe"] }) binding transcribe_segments fills-in-place + missing STT degrades gracefully Refs #516 (spec-binding retrofit), #517 (PR-A, merged), #372 (slice plan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two /review must-fixes: 1. transcribe.test.ts had `void transcribeSegments(...).then(expect...)` inside a synchronous Then callback. The promise resolved after the step completed, so any assertion failure was silently swallowed by vitest. Made the Then async and awaited the call directly. 2. Doc-comment headers in stt/tts/transcribe.test.ts incorrectly cited `@ts-bound`. Updated to cite each file's actual tag (`@ts-stt`, `@ts-tts`, `@ts-transcribe`) so the next reader doesn't get misled. Note: transcribe.test.ts header already said `@ts-transcribe` correctly; only stt.test.ts and tts.test.ts needed updating. Reviewer convergence (3x on #1, 2x on #2): test + principles + hygiene + principles. Refs #516, #517, #513. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… vitest-cucumber Retrofits PR #515's hand-rolled tests for adapter lifecycle, hooks, and VAD fallback to actually load and execute specs/voice-agents.feature via @amiceli/vitest-cucumber, matching the pattern landed by #517 and #513. Tags by test file (per-file tagging needed because vitest-cucumber v6 fails the suite for scenarios that match a file's includeTags but aren't bound in that file): - @ts-adapter: connect/disconnect fires per-scenario - @ts-hooks: on_audio_chunk and on_voice_event fire - @ts-vad: VAD fallback / native-VAD does not trigger / one-shot warning Key implementation note: vitest-cucumber v6 runs each Given/When/Then step as a separate vitest it(). Module-level beforeEach/afterEach hooks fire around each step, not around the whole scenario. For scenarios that need to assert on console.warn calls across step boundaries, the spy is installed locally within the When step and captured warn messages are carried via closure-scoped variables into Then/And — avoiding the floating-promise and spy-reset antipatterns. Refs #516 (spec-binding retrofit), #517 (PR-A, merged), #513 (PR-B, ready for review), #372 (slice plan). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three /review must-fixes: 1. vad-fallback.test.ts: replaced the closure-capture spy pattern with the library's BeforeEachScenario/AfterEachScenario hooks. The coder's earlier workaround was based on the false belief that vitest-cucumber lacked scenario-level lifecycle hooks. The hooks exist (verified at @amiceli/vitest-cucumber 6.5.0 describe-feature.js:311-322). BeforeEachScenario fires via beforeAll inside the scenario describe block — once per scenario, not per step. Spy is shared; capturedWarnCalls accumulates across steps within the same scenario. Removed ~28 lines of SPY STRATEGY prose comments. 2. hooks.test.ts: extracted the "throwing hook doesn't break scenario" check from inside the on_voice_event scenario's When step. It was asserting behavior the bound feature scenario didn't claim. Now a plain it() block outside describeFeature. Option (a) chosen: no spec scenario exists for this behavior in voice-agents.feature. 3. adapter-lifecycle.test.ts: split 5 sub-cases out of one packed And step. Kept only the happy-path disconnect assertion in the bound And step (disconnect fires once on success). Lifted fail/throw/ multi-adapter/disconnect-swallow to 4 plain it() blocks. Option (b) chosen: specs/voice-agents.feature line 143 names the And step as a single AC ("regardless of pass/fail/exception") — the 4 sub-cases are implementation-level guarantees not individually specced. Reviewer convergence: principles + test (3x). Refs #516, #517, #513, #515. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…udio messages (PR4 of N) Ports the python voice path for simulator and judge to TypeScript: - javascript/src/voice/messages.ts: createAudioMessage/extractAudio/ messageHasAudio helpers using the local AudioMessageParam type. No openai package import — uses messages.types.ts (Decision 2(b)). - javascript/src/agents/user-simulator-agent.ts: voice config triggers audio-message emission; per-step voice + per-step audio_effects + persona composition. stripAudioContent keeps LLM calls text-only. - javascript/src/agents/judge/judge-agent.ts: JudgeAgent exported as class with static conversationHasAudio; effectiveIncludeAudio/Timeline/Traces helpers; auto-detect multimodal model via model name substrings; include_audio=false escape hatch. 13 scenarios bound to specs/voice-agents.feature via vitest-cucumber: - 5 simulator scenarios (@ts-simulator) - 7 judge scenarios (@ts-judge) - 1 assistant-role scenario (@ts-assistant-role) Tag convention: per-subject (@ts-simulator / @ts-judge / @ts-assistant-role) instead of @ts-bound to avoid colliding with PR1's voice-contract-surface test (which uses includeTags: ["ts-bound"] and would over-match new scenarios). Per-file tagging is established by #513/#515; tag-convention decision tracked at #523. Refs #372 (slice plan), #517 (PR1 infra, merged), #513 (PR2, ready), Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…anded (PR7 of N) PR7 of issue #372 — the first real voice transport. Ports three Python adapters to TS and binds 7 scenarios in `specs/voice-agents.feature`. What lands: - `javascript/src/voice/adapters/elevenlabs.ts` — `ElevenLabsAgentAdapter`, the hosted ConvAI adapter. Connects to `wss://api.elevenlabs.io/v1/convai/conversation` via the `ws` package; PCM16/24kHz base64-over-JSON; full event handling (audio, ping, transcript, correction, init-metadata, interruption). Mirrors `python/scenario/voice/adapters/elevenlabs.py`. - `javascript/src/voice/adapters/composable.ts` — `ComposableVoiceAgent` + `STTProvider` interface + `ElevenLabsSTTProvider` + inline `synthesize` helper (elevenlabs/ provider only — PR2 #513 supplies the rest). LLM is any ai-sdk `LanguageModel`. Mirrors `python/scenario/voice/adapters/composable.py`. - `javascript/src/voice/adapters/eleven-labs-voice-agent.ts` — `ElevenLabsVoiceAgent`, the branded preset. Provider-typed options; defaults to `ElevenLabsSTTProvider` + `openai("gpt-5.4-mini")` + `elevenlabs/EXAVITQu4vr4xnSDxMaL` (Sarah — free-tier premade); each piece independently overridable. `eleven_v3` TTS model hardcoded for paralinguistic-marker support (per Python tts.py:107 comment). Tests: - `javascript/src/voice/adapters/__tests__/elevenlabs.test.ts` — 5 unit scenarios bound via `describeFeature(..., { includeTags: [["unit", "ts-elevenlabs"]] })`. - `javascript/examples/vitest/tests/voice/elevenlabs-hosted.test.ts` — 2 e2e scenarios env-gated on `ELEVENLABS_API_KEY` (+ `ELEVENLABS_AGENT_ID` for the hosted demo). Without keys, the suite cleanly skips. Tag convention: `@ts-elevenlabs` (per-subject) rather than `@ts-bound` — per the precedent from PRs #517 / #528 (`@ts-simulator`, `@ts-judge`, `@ts-assistant-role`), per-subject tags avoid the `checkUncalledScenario` collision with PR1's contract-surface test. See #523 for the tag-convention decision. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rotocol tests Review pass on PR #536 surfaced four actionable concerns. Addressed: - **#1 (blocking) — `connect()` left WS without `error`/`close` handlers after `onOpen` called `removeAllListeners()`.** An unhandled `error` on a Node EventEmitter crashes the process. Re-attach `message` + `error` + `close` listeners atomically post-open. The new `error` handler nulls `this.ws` so subsequent `sendAudio`/`receiveAudio` fail fast instead of writing to a dead socket. Pending receivers drain to empty `AudioChunk` so the executor unwinds rather than hanging. - **#2 (blocking) — `onMessage` branches were untested.** Added 14 wire-protocol unit tests (plain vitest, not cucumber-bound) covering: base64 PCM16 decode, odd-byte trim invariant, audio queue/waiter FIFO, ping → pong with `event_id`, ping defensive (no `event_id` skip), `user_transcript` capture, `agent_response` capture, `agent_response_correction` override, format-drift warning, interruption + unknown event swallow, non-JSON frames ignored, post-open socket error drain, socket close drain, and `receiveAudio` timeout. - **#3 — Default LLM identifier was inlined in `eleven-labs-voice-agent.ts`, violating `voice-models.ts`'s self-declared single-source-of-truth contract.** Hoisted `COMPOSABLE_VOICE_LLM_MODEL` + `ELEVENLABS_DEFAULT_VOICE_ID` + `ELEVENLABS_TTS_MODEL` + `ELEVENLABS_STT_MODEL` into `voice-models.ts` (Python parity: `python/scenario/config/voice_models.py`). Adapters now import from there. - **#6 — `receiveAudio` referenced `waiter` from inside the timer body before its `const` declaration.** Worked by event-loop ordering; fragile to refactor. Forward-declared `let timer` and put `waiter` ahead of the `setTimeout` so the dependency graph is explicit. Tests: 411 / 22 files passing (previously 397 / 22; +14 wire-protocol tests). Build: tsup CJS + ESM + DTS clean. Deferred (intentional, tracked in PR body): - #4/#5: inline `pcm16ToWavBytes` + `synthesize` helpers — duplicate-by-design with PR2 (#513); merge-order constraint. - #7: `turnOutputEmitted` latch contract with PR3 executor — surface in PR3 review. - #8: distinguish natural end-of-turn from socket close — design-level, needs PR3 design conversation. - #9: `featurePath()` helper — extract once a 3rd test file would duplicate the climb. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@todo