feat(typescript-sdk): voice agent testing — consolidated clean stack by drewdrewthis · Pull Request #561 · langwatch/scenario

drewdrewthis · 2026-05-27T08:20:05Z

Why

Closes #372 (epic #370). The TypeScript voice-agent-testing port had fragmented into 10 flat-sibling PRs (#513/#515/#528/#534–540) carrying pre-EDR design drift. This replaces all of them with one clean stack rebuilt against main, conforming to the EDR (#560 / ADR-002) and the decided public API (PRD). Python is the reference implementation; this brings TS to parity.

Acceptance Criteria

The behavioral contract is specs/voice-agents.feature — 127 scenarios ported from the Python source-of-truth (79 @unit, 14 @integration, 39 @e2e, 2 @todo). Collapsed to the headline ACs below, each ticked only with evidence on this branch. "Done" = (1) the ci-checks gate (units + build + non-bot @ts-e2e) green, (2) the committed demo recordings present, and (3) the app-side ACs below (AC13–18) verified — a voice run correctly ingested, queryable, and rendered in the langwatch app. SDK-green alone is necessary but not sufficient.

Status legend: ✅ = evidence verified in-tree/observed · ⏳ = implemented + author-reported green locally, not yet confirmed by CI · ◻️ = open. (2026-06-04: the unit suite now runs in CI — ci-checks (24.x) green on run 26960103187 after the rebase + lockfile fix; formerly-⏳ unit-backed rows are CI-confirmed.)

Binding status: 29/127 scenarios carry @ts-bound (wired to an executable TS test). The remaining 98 are contract-level (ported from the Python spec), covered indirectly rather than by 1:1 bindings. The full suite is CI-confirmed at 796 pass / 1 skip on HEAD 390e52c (run 26960103187) — the former lockfile abort is fixed, so unit-backed ACs below are ✅ on real CI evidence. Demo-recording-backed ACs are ✅ (the 16 recordings are committed and present in-tree — verified).

#	Acceptance criterion	Evidence	Status
AC1	Same entrypoint — voice uses `scenario.run()`, text-only scenarios unaffected by voice deps	`voice_text_parity` recording (in-tree ✅) + unit "Existing text-only scenarios unaffected" (✅ CI)	✅
AC2	Per-run provider state (ADR-002) — voice config on `ScenarioConfig.voice`, no module-global `configure()`	`composable_stt_swap` recording (in-tree ✅); STT-swap units (✅ CI)	✅
AC3	Barge-in primitive — `agent({ wait:false })` + `interrupt()` cuts off a reply mid-utterance, marked truncated	`interruption_recovery` recording, judge hard-gated (in-tree ✅)	✅
AC4	Real server-VAD barge-in on Gemini Live + ElevenLabs (mid-stream cut-off)	`gemini_live_interruption` + `elevenlabs_interruption` recordings (in-tree ✅)	✅
AC5	Adapter parity — OpenAI Realtime (agent+user), ElevenLabs (hosted/branded/composable), Gemini Live, Pipecat	`openai_realtime_{agent,user}`, `elevenlabs_{hosted,branded}`, `gemini_live`, `pipecat_{scenario,ws}` recordings (in-tree ✅)	✅
AC6	Audio effects + tonal realism — noise floor, distortion, audible anger	`angry_customer` recording, noise-floor≫silence assertion (in-tree ✅)	✅
AC7	Voice-aware judge — auto-detects audio, transcript fallback for non-multimodal, structured timeline	`@ts-judge` 7-scenario unit suite — CI-confirmed (run 26960103187)	✅
AC8	Capability matrix is a contract — every adapter publishes `capabilities`; `dtmf()` raises `UnsupportedCapabilityError` off-telephony; matrix in docs	`@ts-contract-surface` units (✅ CI) + `adapters/*.mdx` tables (in-tree ✅)	✅
AC9	PCM16 @ 24kHz mono internal format; pluggable STT (default OpenAI `gpt-4o-transcribe`); SDK-side VAD fallback with one-shot warning	`@ts-contract-surface` / `@ts-stt` / `@ts-vad` units — CI-confirmed (same run)	✅
AC10	CI merge gate green — `ci-checks` (units + build + non-bot `@ts-e2e`) passes on HEAD	✅ `ci-checks (24.x)` pass (6m2s, suite executed) + `javascript-complete`/`docs-complete`/`python-complete`/`evaluate` all pass on HEAD `390e52c` — run 26960103187. Fixed by rebase onto main + recording the `@ungap/structured-clone` override in `pnpm-lock.yaml` (`f7273e7`) + 2 stale doc-contract test assertions (`f2cdf58`). Local: 796 pass / 1 skip.	✅
AC11	Telephony transport (Twilio) — TwiML endpoint, signature rejection, tunnel harness, clear-buffer interrupt	`@ts-twilio-proto` + `@ts-twilio-server` integration scenarios bound — unit/integration layers CI-confirmed (same run)	✅
AC12	Remaining platform adapters (LiveKit, Vapi, generic WebRTC/WebSocket) raise `PendingTransportError`	stub adapters + units (✅ CI); full transports deferred to #371	✅

Net (honest read, updated 2026-06-04): all SDK-side ACs (AC1–12) are now ✅. AC1–6 were demo-recording-verified; AC7–9 + AC11 flipped from ⏳ to ✅ when the unit suite executed in CI for the first time on this PR — ci-checks (24.x) pass in 6m2s on HEAD 390e52c (run 26960103187), 796 pass / 1 skip. AC10's lockfile blocker is fixed (rebase onto main + @ungap/structured-clone override recorded in the lockfile + 2 stale doc-contract test assertions aligned).

App-side ACs — langwatch ingestion + rendering (SDK-green is necessary but NOT sufficient)

Parity is not done when the SDK suite passes — it's done when a voice scenario.run() is correctly ingested, queryable, and rendered in the langwatch app. These require a live end-to-end run (SDK → langwatch ingest → API/UI), not a unit test. Surface mapped against langwatch/langwatch:

#	App-side acceptance criterion	Concrete check	Status
AC13	Traces received — a voice run's OTel trace (incl. `input_audio` content parts) lands in langwatch	`POST /api/collector` ingests the span; trace appears for the project	✅ verified 2026-06-04 — live TS run `scenariorun_3EfWWrSEiX95SWhzwUCwd7n22OA` (`openai_realtime_agent` demo) landed 2 traces in `project_bZspxwkhCD4POvqmIgOr2` with `scenario.run_id` metadata (`sdk: langwatch-observability-sdk typescript 0.16.1`); judge spans carry the audio as file parts: `{"type":"file","mediaType":"audio/pcm16",…~230KB}`
AC14	Traces queryable via API — the received trace is retrievable through the public API	`POST /api/traces/search` `{projectId, filters}` returns the voice run's trace	✅ `POST /api/traces/search` (startDate/endDate window) returned both: `c50549165a184a73d5fb509525230755`, `7c45a3b4edf40ad1466e33afd176b2f5`; `GET /api/trace/:id` returns full span detail (5 spans incl. `OpenAIRealtimeAgentAdapter.call`, `_JudgeAgent.call`)
AC15	Scenario events ingested — `RUN_STARTED` / `MESSAGE_SNAPSHOT` / `RUN_FINISHED` persisted, each message carrying optional `trace_id`	`scenario_events` ES index holds the run's events; queryable via tRPC	✅ run persisted with 4 messages (`MESSAGE_SNAPSHOT`) carrying 8 `input_audio` content parts; the 2 assistant messages carry `trace_id` refs matching the two traces above (cross-linkage proven); `status: SUCCESS` + `results.verdict: success` (3/3 criteria) = `RUN_FINISHED` recorded
AC16	Scenario run visible in the app — the voice run shows in the simulations UI + REST	`GET /api/simulation-runs/:scenarioRunId` returns it; `/[project]/simulations` renders it	✅ `GET /api/simulation-runs/scenariorun_3EfWWrSEiX95SWhzwUCwd7n22OA` → HTTP 200 (`demo_openai_realtime_agent`, 41.8s, cost $0.0018); `platformUrl: app.langwatch.ai/scenario-tracing-bZspxw/simulations` (API-verified; in-app visual spot-check needs an authed browser — one click)
AC17	Audio messages rendered — `input_audio` content renders as an inline player, not a raw JSON blob	`<ScenarioMessageRenderer>` → `<MediaPart>` emits `<audio controls>` for the message	✅ shipped on `langwatch@main` via #4058 (the old #3781 is stale/superseded — not a blocker). `MediaPart.tsx` emits `<audio data-testid="media-part-audio" controls>` with `onLoadedData` probing; `visit-content-part.ts` handles `input_audio` parts; integration-tested (`MediaPart.integration.test.tsx`). The verified run's messages carry exactly the shape it renders: `{"type":"input_audio","input_audio":{"mimeType":"audio/pcm16","url":"/api/files/so_…"}}`
AC18	Audio itself verified — the rendered audio actually decodes + plays the correct content	base64 WAV decodes in `<audio>` (`onLoadedData` → ok); externalized blobs resolve via `/api/files/:id`	✅ verified byte-level 2026-06-04 on run `scenariorun_3EfWWrSEiX95SWhzwUCwd7n22OA`: `GET /api/files/so_000000000002CR60L0eAWXH3ILtke` → HTTP 200, 249,600 bytes; decodes as valid 5.20s pcm16 @ 24kHz mono (the AC9 contract format); OpenAI STT transcribes it to "Of course! Where would you like to go, or what kind of activities are you interested in?" — the coherent reply to the run's user turn "Hello, can you help me plan a weekend trip?". The user-side part (the SDK's other audio path — user-sim TTS) round-trips verbatim: 3.60s pcm16 → STT → "Hello, can you help me plan a weekend trip?", the exact scripted turn. Decode ✓ format ✓ correct-content ✓ both-paths ✓ serving-route ✓; the in-browser `<audio>` element rendering these same URLs is covered by the integration tests above. Follow-up hardening on-branch: `885d294` WAV-wraps raw pcm16 at the langwatch-bound converter (supersedes `5db36c8`; e2e-verified in its commit: `/api/files` now serves `audio/wav` with a RIFF header, ffprobe-decodable) + `390e52c` aligns the conversion test

Verification matrix (2026-06-04, all live against production app.langwatch.ai):

Every run below links to the live prod app — open it (project members) and press play on any audio message to hear the run the table describes.

Family / variant	Run	Ingest (AC13–15)	REST (AC16)	Audio bytes (AC18)
OpenAI Realtime (agent)	`scenariorun_3EfWWrSEiX95SWhzwUCwd7n22OA`	✅ 2 traces x-linked, 4 msgs / 8 parts	✅ 200, SUCCESS 3/3	✅ agent reply STT-coherent + user TTS verbatim round-trip
Gemini Live + interruption	`scenariorun_3Efeze6fAvYmZd8gQeyVxz8SoOm`	✅ 3 traces, 5 msgs / 5 parts — FAILED verdict persists correctly (2 met / 1 unmet). The unmet criterion is a judge semantic trap, not an adapter fault: the criterion says "over a real* Gemini Live session"* and the judge read the framework's own `origin: simulation` trace metadata as disqualifying — while its reasoning simultaneously confirms the live mid-utterance VAD cut-off. Vitest mechanics assertions: 3/3 pass. Suggested post-merge polish: drop the word "real" from that criterion	✅ 200	✅ truncated segment is its own cut-off transcript (`"I am a large"` / STT `"I am a lar"`, 0.96s)
ElevenLabs hosted	`scenariorun_3EffH3ID8Q7HU5S5anb2c8u4hhy`	✅ 3 msgs / 3 parts	✅ 200, SUCCESS	✅ 1.56s → "Hello, how can I help you today?"
ElevenLabs branded + audioEffects	`scenariorun_3EffKAJl4dXbP1IU41rxCzYyiwT`	✅ 4 msgs / 4 parts	✅ 200, SUCCESS	✅ effects-processed audio STT-clean
Pipecat (mulaw@8000 source)	`scenariorun_3EffZC024s9IrYTOaIoKbQawMpq`	✅ 5 msgs / 5 parts, all normalized `audio/pcm16` — AC9 contract held for a telephony source format	✅ 200, SUCCESS	✅ 3.80s → "Hello, thank you for calling. How can I help you today?"
Twilio (PSTN)	—	not runnable in this environment (needs live telephony + public tunnel); SDK layer CI-confirmed per AC11	—	—

App-side net (final, 2026-06-04): all app-side ACs (AC13–18) are ✅ verified against production across the adapter matrix — 5 live runs / 4 adapter families (matrix above), covering happy-path, interruption/truncation, failed-verdict persistence, audio effects, and a telephony source format. Every leg: SDK → collector → traces/search → simulation-runs REST → /api/files audio bytes, with message↔trace cross-linkage intact and served audio STT-verified as the correct conversational content at the contract format (pcm16/24kHz). The inline player shipped via #4058 (#3781 is stale, not a blocker). Parity board AC1–18: complete. Optional formality: eyeball any run at the simulations page — every layer beneath that click is independently verified.

What changed (decisions)

Per-run provider state, not module-global (ADR-002): voice config rides on ScenarioConfig.voice and reaches the adapters per-run — no global configure(stt=). One AI-SDK file audio format end-to-end; STT/TTS are one-file-per-provider.
agent({ wait: false }) is the barge-in primitive (PRD §4.4): the executor starts the agent's reply without blocking so a user() / interrupt() lands mid-utterance; the transport's native cancel fires and the cut-off segment is marked truncated.
Demo promises are encoded as gates, split by what's verifiable: an LLM judge reading a transcript can't see audio properties, so judge criteria assert the conversational half (empathy, acknowledging the specific request, recovery) and deterministic code assertions assert the audio half (transcriptTruncated + shorter segment for cut-off; noise-floor ≫ silence for mixed ambience). A hollow demo now fails.
Real-key voice demos run via a manual voice-integration workflow, not PR CI — mirroring Python's voice-integration.yml. They cost real API money and can flake, so they never gate a merge; ci-checks (units + build + the non-bot @ts-e2e gate) is the merge gate.
Pre-step interrupt scheduling (maybeScheduleInterruptedAgentTurn, daa357d): ports Python's pre-step pattern; the executor decides to interrupt before the next agent turn begins rather than patching it up post-step. Eliminates the prior hollow post-step path.
Inline-TTS barge-in (fa84c8f): voiceProceed fires the interrupt via voiceifyText while the AGENT is still in-flight, bypassing the user-sim LLM to win the race against fast-streaming bots. delayRange is honoured in this path (d20a49c).
Pipecat adapter buffer-clear on interrupt (557cac2): sets a discardingInboundAudio flag so late bot frames don't contaminate the next agent turn.
VoiceEvent discriminated union (4a49585): five variant interfaces (AgentSpeakingEvent, UserSpeakingEvent, etc.) replace the prior type:string — the compiler now narrows on event.type without casts.
RealtimeUserAgent + VoiceUserSimulator structural interfaces + type guards (3f234f4): kills as unknown as casts in the executor; adapter conformance is checked structurally at compile time.
interruptRng / interruptWaitForSpeechMs renamed to drop leading underscore (50cf3f2): test-seam fields are now @internal JSDoc-tagged rather than visually private — consistent with the project's naming convention.
interruption_recovery judge promoted to hard gate (5bce766): was informative-only; now the scenario fails if the judge says the agent didn't recover. Recording regenerated at be600de.
Gemini Live spurious-pair handling proven by 3 new adapter unit tests (4bd40c6): receiveAudio() deduplicate logic is exercised deterministically without a live connection.
Docs caught up to the shipped API (4 commits, 8eb7f55…c9c0f9a):
- recipes/interrupt.mdx now documents voiceProceed({ interruptions: new InterruptionConfig({...}) }) — the PR's primary random-barge-in API — alongside the explicit agent({ wait: false }) + interrupt() primitive
- adapters/pipecat.mdx capability table corrected (drops opus that the TS adapter doesn't support; source: pipecat.ts:139-140)
- adapters/elevenlabs.mdx adds systemPromptOverride / firstMessageOverride per-session options
- recipes/effects.mdx drops the stale "on the roadmap" callout (Python script.py:user() already accepts voice_style + audio_effects)

How it works

Module tree — 45 created + 18 modified source modules (+44 test files) under javascript/src/, single responsibilities, and how it functions together

Created

javascript/src/
├── config/configure.ts ················ global scenario configuration entry point (voice config rides ScenarioConfig.voice per ADR-002 — no module-global)
├── domain/agents/agent-shapes.ts ······ narrow structural interfaces for duck-typed user-simulator capabilities
├── script/voice-steps.ts ·············· voice script steps (sleep/silence/audio/dtmf/interrupt) + voiceProceed pre-step scheduling
└── voice/
    ├── config.ts ······················ per-run VoiceConfig + resolveVoiceConfig — keystone of the voice state model
    ├── adapter.runtime.ts ············· executor-side adapter wiring: connected-state gate, defaultVoiceCall, response draining
    ├── messages.ts ···················· the SOLE AudioChunk ↔ ModelMessage gateway (unified audio-message producer)
    ├── factories.ts ··················· lowercase adapter factories (PRD §9 idiom): openAIRealtimeAgent(), pipecatAgent(), …
    ├── interruption.ts ················ InterruptionConfig for proceed({ interruptions }) — probabilistic barge-in
    ├── vad.ts ························· SDK-side VAD fallback for adapters without native VAD (one-shot warning)
    ├── judge-stt.ts ··················· judge STT pre-pass — auto-transcription seam for non-multimodal judges
    ├── transcribe.ts ·················· post-hoc STT over a VoiceRecording (fills segment transcripts)
    ├── recording.runtime.ts ··········· WAV/MP3 save + segment directory + byte-accurate JSON manifest
    ├── playback.ts ···················· local-speaker playback sink (configure({ audioPlayback }))
    ├── segment-utils.ts ··············· pure segment/timeline post-processing (byte-cursor model)
    ├── ffmpeg.ts ······················ ffmpeg binary resolution (bundled via ffmpeg-static — no system dep)
    ├── utils.ts ······················· shared voice-layer utilities
    ├── agent-shapes.ts ················ @deprecated re-export shim → domain/agents/agent-shapes
    ├── adapters/ ······················ one transport per file
    │   ├── openai-realtime.ts ········· OpenAI Realtime (agent + user roles — the model IS the agent)
    │   ├── gemini-live.ts ············· Gemini Live native-audio (real server-VAD barge-in)
    │   ├── elevenlabs.ts ·············· ElevenLabs hosted ConvAI + branded transports
    │   ├── pipecat.ts ················· WebSocket client to a user-run Pipecat bot (pcm16/mulaw/opus)
    │   ├── twilio.ts ·················· real-phone transport via Twilio Media Streams
    │   ├── twilio-server.ts ··········· TwiML webhook + WS server impersonating Twilio's edge locally
    │   ├── twilio-shared.ts ··········· shared Media-Streams primitives (one canonical copy — Gap #6)
    │   ├── twilio-tunnel.ts ··········· public-tunnel harness so Twilio can reach the local server
    │   ├── twilio-logger.ts ··········· minimal parity logger for the Twilio adapter
    │   ├── composable.ts ·············· ComposableVoiceAgent: local STT → LLM → TTS (voice-test a text agent without a transport)
    │   ├── pending-transport-error.ts · PendingTransportError for stub transports (LiveKit/Vapi/WebRTC/WS — full transports → #371)
    │   └── index.ts ··················· adapter barrel
    ├── stt/ ··························· pluggable speech-to-text
    │   ├── stt-provider.ts ············ provider contract + "provider/model" router
    │   ├── openai-stt.ts ·············· default leaf (gpt-4o-transcribe)
    │   ├── elevenlabs-stt.ts ·········· ElevenLabs Scribe leaf
    │   ├── wav.ts ····················· minimal RIFF/WAV encoder for the STT upload edge
    │   └── index.ts ··················· barrel + registration site
    ├── tts/ ··························· pluggable text-to-speech
    │   ├── tts.ts ····················· router + cache core
    │   ├── openai-tts.ts ·············· default openai/<voice> leaf
    │   ├── elevenlabs-tts.ts ·········· elevenlabs/<voiceId> leaf (Gap #10)
    │   └── index.ts ··················· barrel + registration site
    ├── effects/ ······················· user-simulator audio-effects pipeline (§4.5)
    │   ├── index.ts ··················· pipeline + public surface
    │   ├── common.ts ·················· PCM16 @ 24kHz mono bytes ↔ Int16Array helpers
    │   ├── noise.ts ··················· backgroundNoise, static, multipleVoices
    │   ├── quality.ts ················· phoneQuality, lowQuality, packetLoss, echo, robotic, breakingUp
    │   ├── prosody.ts ················· volume scaling, time-stretching
    │   └── custom.ts ·················· arbitrary user Uint8Array → Uint8Array effects
    └── assets/noise/ ·················· 5 bundled noise WAVs (airport/babble/cafe/office/street) + LICENSES.md

Modified (existing modules the voice layer hooks into)

javascript/src/
├── index.ts ··························· public surface: voice namespace + lowercase factories exported
├── runner/run.ts ······················ scenario.run() host — resolves per-run voice config, threads it to the executor
├── execution/scenario-execution.ts ···· executor core — voice turn loop, pre-step interrupted-agent scheduling, interruptOverrides test seam
├── execution/scenario-execution-state.ts  run state — carries the voice executor reference across resets
├── agents/user-simulator-agent.ts ····· voice-aware user sim — per-run TTS voice, persona, audioEffects
├── agents/judge/judge-agent.ts ········ voice-aware judge — audio auto-detect, transcript fallback, structured timeline
├── config/index.ts ···················· config barrel — ScenarioConfig.voice wiring
├── domain/agents/index.ts ············· domain barrel — UserSimulatorAgentWithVoice
├── domain/core/execution.ts ··········· core types — agent({ wait:false }) non-blocking primitive
├── domain/scenarios/index.ts ·········· scenario domain types — voice fields
├── script/index.ts ···················· script barrel — voice steps surfaced
├── utils/convert-core-messages-to-agui-messages.ts  message conversion — audio file parts → input_audio; raw pcm16 is RIFF/WAV-wrapped at this langwatch-bound boundary so the app player decodes it (`885d294`)
└── voice/ (pre-existing PR1 bases, extended)
    ├── adapter.ts ····················· VoiceAgentAdapter base + AdapterCapabilities contract (+ isConnected gate)
    ├── index.ts ······················· voice namespace barrel
    ├── messages.types.ts ·············· audio message type surface
    ├── recording.types.ts ············· VoiceRecording/segment type surface
    ├── voice-executor-state.ts ········ runtime voice state — agent-speaking event, byte cursor
    └── voice-models.ts ················ canonical model ids (OPENAI_REALTIME_MODEL, …)

How it functions together

scenario.run({ voice }) (runner/run.ts) resolves a per-run VoiceConfig (voice/config.ts — no module-global state, ADR-002) and hands it to the executor (execution/scenario-execution.ts), which connects the chosen transport (voice/adapters/*) through the uniform connected-state gate (voice/adapter.runtime.ts). On each user turn the simulator (agents/user-simulator-agent.ts) synthesizes speech via the TTS router (voice/tts/) and layers realism through the effects pipeline (voice/effects/ + bundled assets/noise/); audio streams over the transport and the agent's reply is drained on tail-silence windows (adapter.runtime.ts). Every chunk crosses a single gateway (voice/messages.ts) into ModelMessages carrying input_audio file parts — the exact shape langwatch ingests (utils/convert-core-messages-to-agui-messages.ts). Barge-ins come from script steps (script/voice-steps.ts) or probabilistic config (voice/interruption.ts): native server-VAD where the adapter supports it, SDK fallback otherwise (voice/vad.ts), with truncation marked on the byte cursor (voice/segment-utils.ts, voice-executor-state.ts). At judgment time a pre-pass (voice/judge-stt.ts → voice/stt/) transcribes audio for non-multimodal judges (agents/judge/judge-agent.ts). The run's audio persists as full.wav + per-segment files + a byte-accurate manifest (voice/recording.runtime.ts), optionally monitored live (voice/playback.ts), and rides ScenarioResult alongside timeline + latency extensions.

Asset parity: the 5 noise WAVs are byte-identical between javascript/src/voice/assets/noise/ and python/scenario/voice/assets/noise/ (md5-verified 2026-06-04, 144,044 B each), produced by the single deterministic generator javascript/scripts/generate-noise-samples.mjs; both sides carry LICENSES.md.

File-org corrections (from audit)

Three PR-internal file-org corrections landed late in the PR after a structure audit:

scripts/generate-noise-samples.mjs → javascript/scripts/ (TS-only generator belongs in package scripts, not repo-root cross-language scripts)
docs/voice/internal-design.md → docs/adr/003-voice-internal-design.md (it's an ADR by description; lives with ADR-001/002)
javascript/examples/vitest/recordings/ → javascript/examples/vitest/outputs/recordings/ (semantic clarity + future-proof for traces/logs siblings)

Pre-existing-on-main cleanup landed separately as PR #586 (deletes orphan docs, publishes happy-path guides, folds capability-matrix duplication, python recordings → outputs/recordings rename, python noise-sample parity refresh). After #586 merges this branch rebases off cleaned main.

Test plan

pnpm -F @langwatch/scenario test → 791 pass / 1 skipped (incl. the new interrupt-truncation, noise-energy, byte-cursor, proceed-loop pre-step, and Gemini Live spurious-pair unit tests).
@ts-e2e round-trip gate (real keys) green; tsc --noEmit, build:all, lint:all, typecheck:all clean.
Regression guard: the interrupt clock-mismatch fix is covered by a unit test exercising divergent cursor-vs-wall-clock times (the prior same-scale test masked the bug).

How I can prove it works

16 committed demo recordings (javascript/examples/vitest/outputs/recordings/<demo>/ — full.wav + byte-accurate manifest.json), generated against live providers + the bundled Pipecat bot. The ones demonstrating the headline behaviors (open the blob → GitHub shows an audio player):

Behavior	Demo
Reply cut off mid-sentence, then recovers (judge hard-gated)	interruption_recovery
Probabilistic barge-ins via inline-TTS + canned-phrase strategy	random_interruptions
Server-VAD barge-in on Gemini Live (real mid-stream cut-off)	gemini_live_interruption
Audible anger (ElevenLabs tonal markers) + cafe noise	angry_customer
Genuine engagement, not a canned greeting	basic_greeting

Full set (16) also covers the adapters (openai_realtime ×2, elevenlabs hosted/branded, gemini_live), composable_stt_swap, recording_playback, voice_text_parity, pipecat ×2, background_handoff.

Anything surprising

evaluate check is red — expected & non-blocking, fix lands with PR chore: main-side cleanup — docs + spec + python/TS parity #586. This PR's diff exceeds GitHub's 20k-line API cap, so the eval bot can't fetch it (HTTP 406). Not a required check. PR #586 (commits bafdbf7e15 + cdce271bb2) catches the 406 with a grep-specific oversized=true path + env: pattern hardening for oversized_reason; supersedes PR fix(ci/#571): soft-pass oversized PR diffs in pr-auto-approve evaluate #572 and closes ci: evaluate workflow hard-fails on PRs >20k-line diff instead of its oversized path #571 on merge. Once chore: main-side cleanup — docs + spec + python/TS parity #586 merges and this branch rebases on the new main, evaluate will go green on this PR too.
The 6 Pipecat-bot demos run only in the manual voice-integration workflow (which now sets SCENARIO_PIPECAT_BOT_UP); pre-merge they're proven by the committed recordings + local real-key runs, and the workflow guards them post-merge.
random_interruptions recording is honest about Pipecat-bot limits. The bundled Pipecat stub bot bursts TTS frames in ~50 ms of wall time (not realtime streaming), so by the time adapter.interrupt() fires the bot has already sent all frames — real mid-stream audio cut-off isn't observable with this transport. The demo's assertions encode what the bot can prove: interrupt fires + fired_after_speech outcome + canned-phrase strategy ran + truncation label + agent recovers + multi-turn conversation. For real audio cut-off under server-side cancel, see gemini_live_interruption. A transport-upgrade follow-up is drafted (see /tmp/voice-spec/issue-random-interruptions-followup.md).
Follow-ups filed: fix(voice/python): port noise-sample generator + split audio-property claims out of judge criteria (TS parity from #561) #568 (Python demo-fidelity parity), fix(voice/ts): AgentSpeakingEvent.set() fires on an empty first audio chunk (adapter.runtime.ts) #569 (speakingEvent empty-chunk), harden(voice/ts): validate paths/URLs in backgroundNoise/multipleVoices + pipecat adapter #570 (path/URL hardening), ci: evaluate workflow hard-fails on PRs >20k-line diff instead of its oversized path #571 (evaluate oversized-diff). Remaining transports (LiveKit/Vapi/WebRTC/WebSocket) tracked under Voice agents: remaining platform adapter transports #371.

🤖 Generated with Claude Code

Closes (574-585 grind — landed in this PR)

Closes #574 #575 #576 #578 #579 #580 #581 #582 #583 #584 #585

All 11 follow-up issues from the post-review NIT batch were addressed in this PR via 60+ commits since 4d83724. Per-issue close-out comments are on each issue; partial outcomes for #580 (Gemini adapter improved, demo workaround retained) and #583 (adapter dequeue race fixed, transport switch reverted) are documented honestly there.

drewdrewthis · 2026-05-27T09:25:24Z

🎧 Voice demo recordings — click to listen

Real audio captured from each demo's scenario.run() (committed under javascript/recordings/<demo>/). Click to play in a browser tab:

Demo	Listen
OpenAI Realtime — agent	▶ play
OpenAI Realtime — user-sim	▶ play
ElevenLabs — hosted ConvAI	▶ play
ElevenLabs — branded/composable	▶ play
Gemini Live	▶ play
Composable (STT swap)	▶ play
Pipecat (WebSocket)	▶ play
Recording + playback	▶ play
Voice ↔ text parity	▶ play

Embedded player: GitHub renders a player only for files uploaded as comment attachments — drag a .wav into a reply box and it embeds inline. Per-segment WAVs + manifest.json live in each demo folder.

drewdrewthis · 2026-05-27T18:02:43Z

No description provided.

…s → outputs User feedback: "recordings" describes the file format; "outputs" describes the purpose (these dirs hold what the example tests produced). The helper that writes here keeps its name (saveDemoRecording) — it still SAVES a recording, the recording is just NAMED an output now. Updates the writing helper's RECORDINGS_ROOT to point at outputs/, all test-file doc-comment path refs, the recordings README (title, intro, GitHub blob URL example, section header), .gitignore patterns, the voice-integration CI workflow's upload path, TESTING.md fixture paths, and fixes the (pre-existing) broken link in javascript/README.md that pointed at ./recordings/README.md. Python's python/recordings/ stays for now; renaming there is a follow-up issue (filed separately). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@e2e

…t types User feedback: outputs/ should be a parent for all test-run artifact types (recordings now, traces/logs/screenshots later). Moves every demo into outputs/recordings/<demo>/ and adds a new thin outputs/README.md that documents the artifact-parent shape. The rich audio policy / per-demo coverage table stays where it belongs at outputs/recordings/README.md. Writer (tests/voice/helpers/save-demo-recording.ts) updated: RECORDINGS_ROOT now resolves to .../outputs/recordings/, so newly written recordings land in the new shape without further changes. Other ref updates: - .gitignore: every committed-demo whitelist + segments re-ignore moved under outputs/recordings/, plus a sibling re-include for the new outputs/README.md. - .github/workflows/javascript-voice-integration.yml: upload-artifact path → outputs/recordings/**. - javascript/README.md: doc link → outputs/recordings/README.md. - TESTING.md: footprint paths + du command. - All @e2e demo test docstrings (15 files): "Recording lands in outputs/recordings/<demo>/". Sanity: typecheck PASS, build PASS, tests 791/792 PASS (1 pre-existing skip, unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Feature file `specs/voice-agents.feature:971` (added by main commit 71dd5ed / PR #492) lists `interruption` in the adapter-capabilities declaration. The vitest-cucumber binding at voice-contract-surface.test.ts:177 still had the pre-71dd5ed step title (missing `interruption`), so StepAble couldn't find the matching feature step. Update the step title to match the feature file and add the live-adapter `typeof caps.interruption === "boolean"` check (the empty-adapter check on line 192 already exists). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ts to input_audio shape The TS SDK was pre-stringifying message content arrays in the AG-UI conversion (convertModelMessagesToAguiMessages), which had two consequences for voice runs: 1. The langwatch ingest content-extractor walks `content` only when it is an ARRAY of parts. JSON.stringify happens BEFORE the POST, so the extractor saw a string and never recursed → inline base64 audio bytes flowed straight through. 2. The extractor's array walker handles the OpenAI Realtime `input_audio` shape but not the canonical AI-SDK `file`+`audio/*` shape that `createAudioMessage` emits, so even an array would have been a no-op. End-to-end consequence: voice runs persisted full base64 PCM16 audio inline in ClickHouse Messages.Content. The simulations list query (`getSuiteRunData`) slurped the first 6 messages' Content back per scenario — a single voice scenario set returned 90+ MB. This commit: - Stops pre-stringifying user/assistant array content. The langwatch ingest schema (`chatMessageSchema.content`) accepts arrays via `union(string, array(chatRichContent))`, so the wire contract is preserved. AG-UI's stricter `string`-only typing is bypassed with a cast at the conversion boundary (single point, well-commented). - Translates AI-SDK `{type:"file", mediaType:"audio/*", data:"<b64>"}` parts into the OpenAI Realtime `{type:"input_audio", input_audio:{data, format, mimeType}}` shape so the langwatch extractor's existing inputAudio handler externalises the bytes to stored-objects. - Collapses pure single-text-part arrays back to a plain string to keep the preview payload compact for the list view. Tests updated to assert the new contract (array passthrough + input_audio translation + non-audio file-part passthrough). Companion langwatch backend changes (separate repo PRs): - Add a `file`-part branch to the content-extractor visitor (defence in depth for any future SDK that emits the AI-SDK file shape). - Cap Messages.Content size in the simulation-run projection so a misbehaving SDK can never again turn into a 90 MB list-page response.

Main's f716e46 added the pnpm override (CWE-502 bump) to package.json; the branch lockfile predated it, so CI died on ERR_PNPM_LOCKFILE_CONFIG_MISMATCH before running any tests. Regenerated via pnpm install --no-frozen-lockfile; --frozen-lockfile now exits 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ble + hosted URL Two stale doc-contract expectations, pre-existing on the branch tip but never caught: CI aborted on the lockfile mismatch before the suite ran, and the author's 791-pass count predates the docs restructure alignment. - voice-steps: the UnsupportedCapabilityError message points at the hosted docs URL (scenario-docs.langwatch.ai/voice/capability-matrix), same as Python's capabilities.py — the test still expected the old repo-relative .md path. The sibling assertion in voice-contract-surface already used the hosted URL. - voice-contract-surface: the capability rows now live in the auto-generated _generated/voice/capability-matrix.mdx imported by the wrapper page; assert the underscore column keys the feature step actually names (streaming_transcripts, native_vad, dtmf, input_formats, output_formats) across wrapper + generated content. Suite: 796 pass / 1 skip / 0 fail; build:all + tsc clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The OpenAI Realtime adapter emitted the model's audio as a raw `audio/pcm16` file part. PCM16 is headerless, so the LangWatch simulations UI (and any browser `<audio>`) could not decode it and rendered an `[error]` badge instead of an inline player. WAV-wrap the PCM before persisting and emit `audio/wav`, reusing the existing `encodeWav` (now exported). Mirrors the Python twin (`python/scenario/voice/messages.py`, which already emits `format: wav`) — this was a TS-vs-Python parity gap, not a wire-protocol issue. Adds a ResponseFormatter unit test asserting the emitted part is `audio/wav` with a valid RIFF/WAVE header. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…rs a player Supersedes 5db36c8, which wrapped at response-formatter.ts — the wrong layer: it only touched the realtime-agent path and missed the user-simulator audio (both speakers showed `audio/pcm16 [error]` in the simulations UI). The SDK deliberately carries in-message audio as raw headerless PCM16 (one encoder/extractor in voice/messages.ts; the WAV-vs-pcm16 disagreement was a prior live bug — keep it closed). So wrap ONLY at the langwatch-bound converter (convert-core-messages-to-agui-messages.ts): raw `audio/pcm16` file parts become `audio/wav` + `format:"wav"` with a RIFF container, so a browser `<audio>` can decode them. Matches the Python twin's shipped shape (voice/messages.py -> format:"wav"). SDK-internal raw-PCM16 contract is untouched (readers never see this conversion). Reverts the response-formatter.ts / recording.runtime.ts changes from 5db36c8; removes the now-moot response-formatter unit test. Verified end-to-end: live openai-realtime run -> langwatch /api/files serves content-type audio/wav, RIFF/WAVE header, ffprobe pcm_s16le/24kHz/3.25s (was audio/pcm16, undecodable). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…y WAV wrap 885d294 moved the WAV wrap to the langwatch-bound converter but left the pcm16-passthrough expectation pinned in message-conversion.test.ts (the only red in CI run 26958577920: 1 failed / 795 passed). Pin the deterministic wrapped shape instead: RIFF/WAV container at the AudioChunk contract params (24kHz mono 16-bit), format "wav", mimeType audio/wav — matching the Python twin and the commit's verified e2e behavior. Local: file 10/10; full suite 796 pass / 1 skip; tsc clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-04T15:01:08Z

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

This PR's diff could not be evaluated automatically: Diff too large to fetch via GitHub API (fetch error: could not find pull request diff: HTTP 406: Sorry, the diff exceeded the maximum number of lines (20000) (https://api.github.com/repos/langwatch/scenario/pulls/561)
PullRequest.diff too_large). Manual review required.

This PR requires a manual review before merging.

#604 reality Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…t-gen (#610) * docs(voice/#606): expand STT/TTS doc comments and relax audio-to-text judge criteria Adds deliberate-choice rationale comments to OPENAI_STT_MODEL and OPENAI_TTS_MODEL in both JS (voice-models.ts) and Python (voice_models.py), noting no gpt-5-family transcription/TTS models exist on the public API as of 2026-06. Also documents the Python-only OPENAI_BOT_STT_MODEL gap in the TS file. Relaxes the multimodal-audio-to-text judge criteria from overly-specific assertions (exact voice gender, exact repeat phrasing) to behavioural checks (processed audio, coherent response, non-text format acknowledgement). Updates the stale skip comment to reflect the model swap in PR #607. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(voice/#606): update feature-file contract counts to match post-#561/#604 reality Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(voice/#606): add AC4/AC5 doc comments — STT lock rationale + TTS callable-swap pattern - openai-realtime.ts: explain why `input.transcription.model` is locked to OPENAI_STT_MODEL and not exposed as a constructor option (Realtime API only accepts transcription-class models; callers who need a different model subclass the adapter) - openai-tts.ts: document that the TTS model is not a parameter by design — the pattern is to swap the whole TTSCallable rather than parameterise this one; link to OPENAI_TTS_MODEL for the current-gen rationale Closes #606 (AC4 + AC5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(examples/voice/#606): correct stale comment — model swap + unskip are in #607, not this branch Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

The voice-to-voice example helper and the audio-to-text example pinned `gpt-4o-audio-preview`, which OpenAI has removed (404 model_not_found since 2026-05-19). Any user running the canonical voice example hit an immediate 404. Switch to `gpt-audio-mini` — OpenAI's current cost-efficient GA audio-chat model — matching the Python twin, which already migrated (python/scenario/config/voice_models.py:44 OPENAI_AUDIO_CHAT_MODEL, python/examples/test_audio_to_text.py:157). Verified live: gpt-audio-mini accepts the identical chat.completions shape (modalities:["text","audio"], audio:{voice,format}) and returns audio. Re-ran the voice-to-voice e2e against prod LangWatch — success: true, real 2-turn conversation, traces landed (project_bZspxwkhCD4POvqmIgOr2). SDK core was unaffected (OpenAIRealtimeAgentAdapter uses gpt-realtime-mini). This closes a py↔ts example-parity gap left by #561. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

drewdrewthis mentioned this pull request May 27, 2026

Voice Agents #370

Open

drewdrewthis requested review from 0xdeafcafe, Aryansharma28, rogeriochaves and sergioestebance May 28, 2026 12:06

drewdrewthis changed the title ~~feat(typescript-sdk/#372): voice agent testing — consolidated clean stack~~ feat(typescript-sdk): voice agent testing — consolidated clean stack May 28, 2026

drewdrewthis and others added 6 commits June 4, 2026 10:21

drewdrewthis force-pushed the voice/372-refactor branch from 180bab4 to f2cdf58 Compare June 4, 2026 10:37

drewdrewthis requested a review from rogeriochaves June 4, 2026 12:22

drewdrewthis and others added 3 commits June 4, 2026 16:15

0xdeafcafe approved these changes Jun 4, 2026

View reviewed changes

drewdrewthis merged commit 5847c4b into main Jun 4, 2026
19 checks passed

drewdrewthis deleted the voice/372-refactor branch June 4, 2026 15:10

rogeriochaves mentioned this pull request Jun 4, 2026

chore(main): release javascript 0.4.12 #386

Merged

This was referenced Jun 4, 2026

fix(examples/voice): swap deleted gpt-4o-audio-preview → gpt-audio-mini #607

Closed

fix(voice): main python-ci red — stale feature-file contract counts (108→127) after #561 #609

Closed

drewdrewthis added a commit that referenced this pull request Jun 4, 2026

fix(voice/#606): update feature-file contract counts to match post-#561/

8199d0e

#604 reality Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

drewdrewthis mentioned this pull request Jun 4, 2026

Voice STT/TTS defaults still use gpt-4o-* models — decide modernization path #606

Closed

drewdrewthis added a commit that referenced this pull request Jun 4, 2026

fix(voice/#606): update feature-file contract counts to match post-#561/

6ea8b8d

#604 reality Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This was referenced Jun 4, 2026

chore(examples/voice/#486): retire legacy gpt-4o-audio-preview surface, migrate supported audio examples to gpt-audio-mini #612

Open

docs(voice/#606): document STT/TTS model choices as deliberate current-gen #610

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(typescript-sdk): voice agent testing — consolidated clean stack#561

feat(typescript-sdk): voice agent testing — consolidated clean stack#561
drewdrewthis merged 179 commits into
mainfrom
voice/372-refactor

drewdrewthis commented May 27, 2026 •

edited

Loading

Uh oh!

drewdrewthis commented May 27, 2026

Uh oh!

drewdrewthis commented May 27, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

drewdrewthis commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Acceptance Criteria

App-side ACs — langwatch ingestion + rendering (SDK-green is necessary but NOT sufficient)

What changed (decisions)

How it works

Created

Modified (existing modules the voice layer hooks into)

How it functions together

File-org corrections (from audit)

Test plan

How I can prove it works

Anything surprising

Closes (574-585 grind — landed in this PR)

Uh oh!

drewdrewthis commented May 27, 2026

🎧 Voice demo recordings — click to listen

Uh oh!

drewdrewthis commented May 27, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

drewdrewthis commented May 27, 2026 •

edited

Loading