Skip to content

feat(typescript-sdk): voice agent testing — consolidated clean stack#561

Merged
drewdrewthis merged 179 commits into
mainfrom
voice/372-refactor
Jun 4, 2026
Merged

feat(typescript-sdk): voice agent testing — consolidated clean stack#561
drewdrewthis merged 179 commits into
mainfrom
voice/372-refactor

Conversation

@drewdrewthis
Copy link
Copy Markdown
Collaborator

@drewdrewthis drewdrewthis commented May 27, 2026

Why

Closes #372 (epic #370). The TypeScript voice-agent-testing port had fragmented into 10 flat-sibling PRs (#513/#515/#528/#534–540) carrying pre-EDR design drift. This replaces all of them with one clean stack rebuilt against main, conforming to the EDR (#560 / ADR-002) and the decided public API (PRD). Python is the reference implementation; this brings TS to parity.

Acceptance Criteria

The behavioral contract is specs/voice-agents.feature — 127 scenarios ported from the Python source-of-truth (79 @unit, 14 @integration, 39 @e2e, 2 @todo). Collapsed to the headline ACs below, each ticked only with evidence on this branch. "Done" = (1) the ci-checks gate (units + build + non-bot @ts-e2e) green, (2) the committed demo recordings present, and (3) the app-side ACs below (AC13–18) verified — a voice run correctly ingested, queryable, and rendered in the langwatch app. SDK-green alone is necessary but not sufficient.

Status legend: ✅ = evidence verified in-tree/observed · ⏳ = implemented + author-reported green locally, not yet confirmed by CI · ◻️ = open. (2026-06-04: the unit suite now runs in CI — ci-checks (24.x) green on run 26960103187 after the rebase + lockfile fix; formerly-⏳ unit-backed rows are CI-confirmed.)

Binding status: 29/127 scenarios carry @ts-bound (wired to an executable TS test). The remaining 98 are contract-level (ported from the Python spec), covered indirectly rather than by 1:1 bindings. The full suite is CI-confirmed at 796 pass / 1 skip on HEAD 390e52c (run 26960103187) — the former lockfile abort is fixed, so unit-backed ACs below are ✅ on real CI evidence. Demo-recording-backed ACs are ✅ (the 16 recordings are committed and present in-tree — verified).

# Acceptance criterion Evidence Status
AC1 Same entrypoint — voice uses scenario.run(), text-only scenarios unaffected by voice deps voice_text_parity recording (in-tree ✅) + unit "Existing text-only scenarios unaffected" (✅ CI)
AC2 Per-run provider state (ADR-002) — voice config on ScenarioConfig.voice, no module-global configure() composable_stt_swap recording (in-tree ✅); STT-swap units (✅ CI)
AC3 Barge-in primitiveagent({ wait:false }) + interrupt() cuts off a reply mid-utterance, marked truncated interruption_recovery recording, judge hard-gated (in-tree ✅)
AC4 Real server-VAD barge-in on Gemini Live + ElevenLabs (mid-stream cut-off) gemini_live_interruption + elevenlabs_interruption recordings (in-tree ✅)
AC5 Adapter parity — OpenAI Realtime (agent+user), ElevenLabs (hosted/branded/composable), Gemini Live, Pipecat openai_realtime_{agent,user}, elevenlabs_{hosted,branded}, gemini_live, pipecat_{scenario,ws} recordings (in-tree ✅)
AC6 Audio effects + tonal realism — noise floor, distortion, audible anger angry_customer recording, noise-floor≫silence assertion (in-tree ✅)
AC7 Voice-aware judge — auto-detects audio, transcript fallback for non-multimodal, structured timeline @ts-judge 7-scenario unit suite — CI-confirmed (run 26960103187)
AC8 Capability matrix is a contract — every adapter publishes capabilities; dtmf() raises UnsupportedCapabilityError off-telephony; matrix in docs @ts-contract-surface units (✅ CI) + adapters/*.mdx tables (in-tree ✅)
AC9 PCM16 @ 24kHz mono internal format; pluggable STT (default OpenAI gpt-4o-transcribe); SDK-side VAD fallback with one-shot warning @ts-contract-surface / @ts-stt / @ts-vad units — CI-confirmed (same run)
AC10 CI merge gate greenci-checks (units + build + non-bot @ts-e2e) passes on HEAD ci-checks (24.x) pass (6m2s, suite executed) + javascript-complete/docs-complete/python-complete/evaluate all pass on HEAD 390e52crun 26960103187. Fixed by rebase onto main + recording the @ungap/structured-clone override in pnpm-lock.yaml (f7273e7) + 2 stale doc-contract test assertions (f2cdf58). Local: 796 pass / 1 skip.
AC11 Telephony transport (Twilio) — TwiML endpoint, signature rejection, tunnel harness, clear-buffer interrupt @ts-twilio-proto + @ts-twilio-server integration scenarios bound — unit/integration layers CI-confirmed (same run)
AC12 Remaining platform adapters (LiveKit, Vapi, generic WebRTC/WebSocket) raise PendingTransportError stub adapters + units (✅ CI); full transports deferred to #371

Net (honest read, updated 2026-06-04): all SDK-side ACs (AC1–12) are now ✅. AC1–6 were demo-recording-verified; AC7–9 + AC11 flipped from ⏳ to ✅ when the unit suite executed in CI for the first time on this PR — ci-checks (24.x) pass in 6m2s on HEAD 390e52c (run 26960103187), 796 pass / 1 skip. AC10's lockfile blocker is fixed (rebase onto main + @ungap/structured-clone override recorded in the lockfile + 2 stale doc-contract test assertions aligned).

App-side ACs — langwatch ingestion + rendering (SDK-green is necessary but NOT sufficient)

Parity is not done when the SDK suite passes — it's done when a voice scenario.run() is correctly ingested, queryable, and rendered in the langwatch app. These require a live end-to-end run (SDK → langwatch ingest → API/UI), not a unit test. Surface mapped against langwatch/langwatch:

# App-side acceptance criterion Concrete check Status
AC13 Traces received — a voice run's OTel trace (incl. input_audio content parts) lands in langwatch POST /api/collector ingests the span; trace appears for the project ✅ verified 2026-06-04 — live TS run scenariorun_3EfWWrSEiX95SWhzwUCwd7n22OA (openai_realtime_agent demo) landed 2 traces in project_bZspxwkhCD4POvqmIgOr2 with scenario.run_id metadata (sdk: langwatch-observability-sdk typescript 0.16.1); judge spans carry the audio as file parts: {"type":"file","mediaType":"audio/pcm16",…~230KB}
AC14 Traces queryable via API — the received trace is retrievable through the public API POST /api/traces/search {projectId, filters} returns the voice run's trace POST /api/traces/search (startDate/endDate window) returned both: c50549165a184a73d5fb509525230755, 7c45a3b4edf40ad1466e33afd176b2f5; GET /api/trace/:id returns full span detail (5 spans incl. OpenAIRealtimeAgentAdapter.call, _JudgeAgent.call)
AC15 Scenario events ingestedRUN_STARTED / MESSAGE_SNAPSHOT / RUN_FINISHED persisted, each message carrying optional trace_id scenario_events ES index holds the run's events; queryable via tRPC ✅ run persisted with 4 messages (MESSAGE_SNAPSHOT) carrying 8 input_audio content parts; the 2 assistant messages carry trace_id refs matching the two traces above (cross-linkage proven); status: SUCCESS + results.verdict: success (3/3 criteria) = RUN_FINISHED recorded
AC16 Scenario run visible in the app — the voice run shows in the simulations UI + REST GET /api/simulation-runs/:scenarioRunId returns it; /[project]/simulations renders it GET /api/simulation-runs/scenariorun_3EfWWrSEiX95SWhzwUCwd7n22OAHTTP 200 (demo_openai_realtime_agent, 41.8s, cost $0.0018); platformUrl: app.langwatch.ai/scenario-tracing-bZspxw/simulations (API-verified; in-app visual spot-check needs an authed browser — one click)
AC17 Audio messages renderedinput_audio content renders as an inline player, not a raw JSON blob <ScenarioMessageRenderer><MediaPart> emits <audio controls> for the message ✅ shipped on langwatch@main via #4058 (the old #3781 is stale/superseded — not a blocker). MediaPart.tsx emits <audio data-testid="media-part-audio" controls> with onLoadedData probing; visit-content-part.ts handles input_audio parts; integration-tested (MediaPart.integration.test.tsx). The verified run's messages carry exactly the shape it renders: {"type":"input_audio","input_audio":{"mimeType":"audio/pcm16","url":"/api/files/so_…"}}
AC18 Audio itself verified — the rendered audio actually decodes + plays the correct content base64 WAV decodes in <audio> (onLoadedData → ok); externalized blobs resolve via /api/files/:id ✅ verified byte-level 2026-06-04 on run scenariorun_3EfWWrSEiX95SWhzwUCwd7n22OA: GET /api/files/so_000000000002CR60L0eAWXH3ILtke → HTTP 200, 249,600 bytes; decodes as valid 5.20s pcm16 @ 24kHz mono (the AC9 contract format); OpenAI STT transcribes it to "Of course! Where would you like to go, or what kind of activities are you interested in?" — the coherent reply to the run's user turn "Hello, can you help me plan a weekend trip?". The user-side part (the SDK's other audio path — user-sim TTS) round-trips verbatim: 3.60s pcm16 → STT → "Hello, can you help me plan a weekend trip?", the exact scripted turn. Decode ✓ format ✓ correct-content ✓ both-paths ✓ serving-route ✓; the in-browser <audio> element rendering these same URLs is covered by the integration tests above. Follow-up hardening on-branch: 885d294 WAV-wraps raw pcm16 at the langwatch-bound converter (supersedes 5db36c8; e2e-verified in its commit: /api/files now serves audio/wav with a RIFF header, ffprobe-decodable) + 390e52c aligns the conversion test

Verification matrix (2026-06-04, all live against production app.langwatch.ai):

Every run below links to the live prod app — open it (project members) and press play on any audio message to hear the run the table describes.

Family / variant Run Ingest (AC13–15) REST (AC16) Audio bytes (AC18)
OpenAI Realtime (agent) scenariorun_3EfWWrSEiX95SWhzwUCwd7n22OA ✅ 2 traces x-linked, 4 msgs / 8 parts ✅ 200, SUCCESS 3/3 ✅ agent reply STT-coherent + user TTS verbatim round-trip
Gemini Live + interruption scenariorun_3Efeze6fAvYmZd8gQeyVxz8SoOm ✅ 3 traces, 5 msgs / 5 parts — FAILED verdict persists correctly (2 met / 1 unmet). The unmet criterion is a judge semantic trap, not an adapter fault: the criterion says "over a real Gemini Live session" and the judge read the framework's own origin: simulation trace metadata as disqualifying — while its reasoning simultaneously confirms the live mid-utterance VAD cut-off. Vitest mechanics assertions: 3/3 pass. Suggested post-merge polish: drop the word "real" from that criterion ✅ 200 ✅ truncated segment is its own cut-off transcript ("I am a large" / STT "I am a lar", 0.96s)
ElevenLabs hosted scenariorun_3EffH3ID8Q7HU5S5anb2c8u4hhy ✅ 3 msgs / 3 parts ✅ 200, SUCCESS ✅ 1.56s → "Hello, how can I help you today?"
ElevenLabs branded + audioEffects scenariorun_3EffKAJl4dXbP1IU41rxCzYyiwT ✅ 4 msgs / 4 parts ✅ 200, SUCCESS ✅ effects-processed audio STT-clean
Pipecat (mulaw@8000 source) scenariorun_3EffZC024s9IrYTOaIoKbQawMpq ✅ 5 msgs / 5 parts, all normalized audio/pcm16 — AC9 contract held for a telephony source format ✅ 200, SUCCESS ✅ 3.80s → "Hello, thank you for calling. How can I help you today?"
Twilio (PSTN) not runnable in this environment (needs live telephony + public tunnel); SDK layer CI-confirmed per AC11

App-side net (final, 2026-06-04): all app-side ACs (AC13–18) are ✅ verified against production across the adapter matrix — 5 live runs / 4 adapter families (matrix above), covering happy-path, interruption/truncation, failed-verdict persistence, audio effects, and a telephony source format. Every leg: SDK → collector → traces/searchsimulation-runs REST → /api/files audio bytes, with message↔trace cross-linkage intact and served audio STT-verified as the correct conversational content at the contract format (pcm16/24kHz). The inline player shipped via #4058 (#3781 is stale, not a blocker). Parity board AC1–18: complete. Optional formality: eyeball any run at the simulations page — every layer beneath that click is independently verified.

What changed (decisions)

  • Per-run provider state, not module-global (ADR-002): voice config rides on ScenarioConfig.voice and reaches the adapters per-run — no global configure(stt=). One AI-SDK file audio format end-to-end; STT/TTS are one-file-per-provider.
  • agent({ wait: false }) is the barge-in primitive (PRD §4.4): the executor starts the agent's reply without blocking so a user() / interrupt() lands mid-utterance; the transport's native cancel fires and the cut-off segment is marked truncated.
  • Demo promises are encoded as gates, split by what's verifiable: an LLM judge reading a transcript can't see audio properties, so judge criteria assert the conversational half (empathy, acknowledging the specific request, recovery) and deterministic code assertions assert the audio half (transcriptTruncated + shorter segment for cut-off; noise-floor ≫ silence for mixed ambience). A hollow demo now fails.
  • Real-key voice demos run via a manual voice-integration workflow, not PR CI — mirroring Python's voice-integration.yml. They cost real API money and can flake, so they never gate a merge; ci-checks (units + build + the non-bot @ts-e2e gate) is the merge gate.
  • Pre-step interrupt scheduling (maybeScheduleInterruptedAgentTurn, daa357d): ports Python's pre-step pattern; the executor decides to interrupt before the next agent turn begins rather than patching it up post-step. Eliminates the prior hollow post-step path.
  • Inline-TTS barge-in (fa84c8f): voiceProceed fires the interrupt via voiceifyText while the AGENT is still in-flight, bypassing the user-sim LLM to win the race against fast-streaming bots. delayRange is honoured in this path (d20a49c).
  • Pipecat adapter buffer-clear on interrupt (557cac2): sets a discardingInboundAudio flag so late bot frames don't contaminate the next agent turn.
  • VoiceEvent discriminated union (4a49585): five variant interfaces (AgentSpeakingEvent, UserSpeakingEvent, etc.) replace the prior type:string — the compiler now narrows on event.type without casts.
  • RealtimeUserAgent + VoiceUserSimulator structural interfaces + type guards (3f234f4): kills as unknown as casts in the executor; adapter conformance is checked structurally at compile time.
  • interruptRng / interruptWaitForSpeechMs renamed to drop leading underscore (50cf3f2): test-seam fields are now @internal JSDoc-tagged rather than visually private — consistent with the project's naming convention.
  • interruption_recovery judge promoted to hard gate (5bce766): was informative-only; now the scenario fails if the judge says the agent didn't recover. Recording regenerated at be600de.
  • Gemini Live spurious-pair handling proven by 3 new adapter unit tests (4bd40c6): receiveAudio() deduplicate logic is exercised deterministically without a live connection.
  • Docs caught up to the shipped API (4 commits, 8eb7f55c9c0f9a):
    • recipes/interrupt.mdx now documents voiceProceed({ interruptions: new InterruptionConfig({...}) }) — the PR's primary random-barge-in API — alongside the explicit agent({ wait: false }) + interrupt() primitive
    • adapters/pipecat.mdx capability table corrected (drops opus that the TS adapter doesn't support; source: pipecat.ts:139-140)
    • adapters/elevenlabs.mdx adds systemPromptOverride / firstMessageOverride per-session options
    • recipes/effects.mdx drops the stale "on the roadmap" callout (Python script.py:user() already accepts voice_style + audio_effects)

How it works

Module tree — 45 created + 18 modified source modules (+44 test files) under javascript/src/, single responsibilities, and how it functions together

Created

javascript/src/
├── config/configure.ts ················ global scenario configuration entry point (voice config rides ScenarioConfig.voice per ADR-002 — no module-global)
├── domain/agents/agent-shapes.ts ······ narrow structural interfaces for duck-typed user-simulator capabilities
├── script/voice-steps.ts ·············· voice script steps (sleep/silence/audio/dtmf/interrupt) + voiceProceed pre-step scheduling
└── voice/
    ├── config.ts ······················ per-run VoiceConfig + resolveVoiceConfig — keystone of the voice state model
    ├── adapter.runtime.ts ············· executor-side adapter wiring: connected-state gate, defaultVoiceCall, response draining
    ├── messages.ts ···················· the SOLE AudioChunk ↔ ModelMessage gateway (unified audio-message producer)
    ├── factories.ts ··················· lowercase adapter factories (PRD §9 idiom): openAIRealtimeAgent(), pipecatAgent(), …
    ├── interruption.ts ················ InterruptionConfig for proceed({ interruptions }) — probabilistic barge-in
    ├── vad.ts ························· SDK-side VAD fallback for adapters without native VAD (one-shot warning)
    ├── judge-stt.ts ··················· judge STT pre-pass — auto-transcription seam for non-multimodal judges
    ├── transcribe.ts ·················· post-hoc STT over a VoiceRecording (fills segment transcripts)
    ├── recording.runtime.ts ··········· WAV/MP3 save + segment directory + byte-accurate JSON manifest
    ├── playback.ts ···················· local-speaker playback sink (configure({ audioPlayback }))
    ├── segment-utils.ts ··············· pure segment/timeline post-processing (byte-cursor model)
    ├── ffmpeg.ts ······················ ffmpeg binary resolution (bundled via ffmpeg-static — no system dep)
    ├── utils.ts ······················· shared voice-layer utilities
    ├── agent-shapes.ts ················ @deprecated re-export shim → domain/agents/agent-shapes
    ├── adapters/ ······················ one transport per file
    │   ├── openai-realtime.ts ········· OpenAI Realtime (agent + user roles — the model IS the agent)
    │   ├── gemini-live.ts ············· Gemini Live native-audio (real server-VAD barge-in)
    │   ├── elevenlabs.ts ·············· ElevenLabs hosted ConvAI + branded transports
    │   ├── pipecat.ts ················· WebSocket client to a user-run Pipecat bot (pcm16/mulaw/opus)
    │   ├── twilio.ts ·················· real-phone transport via Twilio Media Streams
    │   ├── twilio-server.ts ··········· TwiML webhook + WS server impersonating Twilio's edge locally
    │   ├── twilio-shared.ts ··········· shared Media-Streams primitives (one canonical copy — Gap #6)
    │   ├── twilio-tunnel.ts ··········· public-tunnel harness so Twilio can reach the local server
    │   ├── twilio-logger.ts ··········· minimal parity logger for the Twilio adapter
    │   ├── composable.ts ·············· ComposableVoiceAgent: local STT → LLM → TTS (voice-test a text agent without a transport)
    │   ├── pending-transport-error.ts · PendingTransportError for stub transports (LiveKit/Vapi/WebRTC/WS — full transports → #371)
    │   └── index.ts ··················· adapter barrel
    ├── stt/ ··························· pluggable speech-to-text
    │   ├── stt-provider.ts ············ provider contract + "provider/model" router
    │   ├── openai-stt.ts ·············· default leaf (gpt-4o-transcribe)
    │   ├── elevenlabs-stt.ts ·········· ElevenLabs Scribe leaf
    │   ├── wav.ts ····················· minimal RIFF/WAV encoder for the STT upload edge
    │   └── index.ts ··················· barrel + registration site
    ├── tts/ ··························· pluggable text-to-speech
    │   ├── tts.ts ····················· router + cache core
    │   ├── openai-tts.ts ·············· default openai/<voice> leaf
    │   ├── elevenlabs-tts.ts ·········· elevenlabs/<voiceId> leaf (Gap #10)
    │   └── index.ts ··················· barrel + registration site
    ├── effects/ ······················· user-simulator audio-effects pipeline (§4.5)
    │   ├── index.ts ··················· pipeline + public surface
    │   ├── common.ts ·················· PCM16 @ 24kHz mono bytes ↔ Int16Array helpers
    │   ├── noise.ts ··················· backgroundNoise, static, multipleVoices
    │   ├── quality.ts ················· phoneQuality, lowQuality, packetLoss, echo, robotic, breakingUp
    │   ├── prosody.ts ················· volume scaling, time-stretching
    │   └── custom.ts ·················· arbitrary user Uint8Array → Uint8Array effects
    └── assets/noise/ ·················· 5 bundled noise WAVs (airport/babble/cafe/office/street) + LICENSES.md

Modified (existing modules the voice layer hooks into)

javascript/src/
├── index.ts ··························· public surface: voice namespace + lowercase factories exported
├── runner/run.ts ······················ scenario.run() host — resolves per-run voice config, threads it to the executor
├── execution/scenario-execution.ts ···· executor core — voice turn loop, pre-step interrupted-agent scheduling, interruptOverrides test seam
├── execution/scenario-execution-state.ts  run state — carries the voice executor reference across resets
├── agents/user-simulator-agent.ts ····· voice-aware user sim — per-run TTS voice, persona, audioEffects
├── agents/judge/judge-agent.ts ········ voice-aware judge — audio auto-detect, transcript fallback, structured timeline
├── config/index.ts ···················· config barrel — ScenarioConfig.voice wiring
├── domain/agents/index.ts ············· domain barrel — UserSimulatorAgentWithVoice
├── domain/core/execution.ts ··········· core types — agent({ wait:false }) non-blocking primitive
├── domain/scenarios/index.ts ·········· scenario domain types — voice fields
├── script/index.ts ···················· script barrel — voice steps surfaced
├── utils/convert-core-messages-to-agui-messages.ts  message conversion — audio file parts → input_audio; raw pcm16 is RIFF/WAV-wrapped at this langwatch-bound boundary so the app player decodes it (`885d294`)
└── voice/ (pre-existing PR1 bases, extended)
    ├── adapter.ts ····················· VoiceAgentAdapter base + AdapterCapabilities contract (+ isConnected gate)
    ├── index.ts ······················· voice namespace barrel
    ├── messages.types.ts ·············· audio message type surface
    ├── recording.types.ts ············· VoiceRecording/segment type surface
    ├── voice-executor-state.ts ········ runtime voice state — agent-speaking event, byte cursor
    └── voice-models.ts ················ canonical model ids (OPENAI_REALTIME_MODEL, …)

How it functions together

scenario.run({ voice }) (runner/run.ts) resolves a per-run VoiceConfig (voice/config.ts — no module-global state, ADR-002) and hands it to the executor (execution/scenario-execution.ts), which connects the chosen transport (voice/adapters/*) through the uniform connected-state gate (voice/adapter.runtime.ts). On each user turn the simulator (agents/user-simulator-agent.ts) synthesizes speech via the TTS router (voice/tts/) and layers realism through the effects pipeline (voice/effects/ + bundled assets/noise/); audio streams over the transport and the agent's reply is drained on tail-silence windows (adapter.runtime.ts). Every chunk crosses a single gateway (voice/messages.ts) into ModelMessages carrying input_audio file parts — the exact shape langwatch ingests (utils/convert-core-messages-to-agui-messages.ts). Barge-ins come from script steps (script/voice-steps.ts) or probabilistic config (voice/interruption.ts): native server-VAD where the adapter supports it, SDK fallback otherwise (voice/vad.ts), with truncation marked on the byte cursor (voice/segment-utils.ts, voice-executor-state.ts). At judgment time a pre-pass (voice/judge-stt.tsvoice/stt/) transcribes audio for non-multimodal judges (agents/judge/judge-agent.ts). The run's audio persists as full.wav + per-segment files + a byte-accurate manifest (voice/recording.runtime.ts), optionally monitored live (voice/playback.ts), and rides ScenarioResult alongside timeline + latency extensions.

Asset parity: the 5 noise WAVs are byte-identical between javascript/src/voice/assets/noise/ and python/scenario/voice/assets/noise/ (md5-verified 2026-06-04, 144,044 B each), produced by the single deterministic generator javascript/scripts/generate-noise-samples.mjs; both sides carry LICENSES.md.

File-org corrections (from audit)

Three PR-internal file-org corrections landed late in the PR after a structure audit:

  • scripts/generate-noise-samples.mjsjavascript/scripts/ (TS-only generator belongs in package scripts, not repo-root cross-language scripts)
  • docs/voice/internal-design.mddocs/adr/003-voice-internal-design.md (it's an ADR by description; lives with ADR-001/002)
  • javascript/examples/vitest/recordings/javascript/examples/vitest/outputs/recordings/ (semantic clarity + future-proof for traces/logs siblings)

Pre-existing-on-main cleanup landed separately as PR #586 (deletes orphan docs, publishes happy-path guides, folds capability-matrix duplication, python recordings → outputs/recordings rename, python noise-sample parity refresh). After #586 merges this branch rebases off cleaned main.

Test plan

  • pnpm -F @langwatch/scenario test791 pass / 1 skipped (incl. the new interrupt-truncation, noise-energy, byte-cursor, proceed-loop pre-step, and Gemini Live spurious-pair unit tests).
  • @ts-e2e round-trip gate (real keys) green; tsc --noEmit, build:all, lint:all, typecheck:all clean.
  • Regression guard: the interrupt clock-mismatch fix is covered by a unit test exercising divergent cursor-vs-wall-clock times (the prior same-scale test masked the bug).

How I can prove it works

16 committed demo recordings (javascript/examples/vitest/outputs/recordings/<demo>/full.wav + byte-accurate manifest.json), generated against live providers + the bundled Pipecat bot. The ones demonstrating the headline behaviors (open the blob → GitHub shows an audio player):

Behavior Demo
Reply cut off mid-sentence, then recovers (judge hard-gated) interruption_recovery
Probabilistic barge-ins via inline-TTS + canned-phrase strategy random_interruptions
Server-VAD barge-in on Gemini Live (real mid-stream cut-off) gemini_live_interruption
Audible anger (ElevenLabs tonal markers) + cafe noise angry_customer
Genuine engagement, not a canned greeting basic_greeting

Full set (16) also covers the adapters (openai_realtime ×2, elevenlabs hosted/branded, gemini_live), composable_stt_swap, recording_playback, voice_text_parity, pipecat ×2, background_handoff.

Anything surprising

🤖 Generated with Claude Code


Closes (574-585 grind — landed in this PR)

Closes #574 #575 #576 #578 #579 #580 #581 #582 #583 #584 #585

All 11 follow-up issues from the post-review NIT batch were addressed in this PR via 60+ commits since 4d83724. Per-issue close-out comments are on each issue; partial outcomes for #580 (Gemini adapter improved, demo workaround retained) and #583 (adapter dequeue race fixed, transport switch reverted) are documented honestly there.

This was referenced May 27, 2026
@drewdrewthis
Copy link
Copy Markdown
Collaborator Author

🎧 Voice demo recordings — click to listen

Real audio captured from each demo's scenario.run() (committed under javascript/recordings/<demo>/). Click to play in a browser tab:

Demo Listen
OpenAI Realtime — agent ▶ play
OpenAI Realtime — user-sim ▶ play
ElevenLabs — hosted ConvAI ▶ play
ElevenLabs — branded/composable ▶ play
Gemini Live ▶ play
Composable (STT swap) ▶ play
Pipecat (WebSocket) ▶ play
Recording + playback ▶ play
Voice ↔ text parity ▶ play

Embedded player: GitHub renders a player only for files uploaded as comment attachments — drag a .wav into a reply box and it embeds inline. Per-segment WAVs + manifest.json live in each demo folder.

@drewdrewthis
Copy link
Copy Markdown
Collaborator Author

No description provided.

@drewdrewthis drewdrewthis mentioned this pull request May 27, 2026
@drewdrewthis drewdrewthis changed the title feat(typescript-sdk/#372): voice agent testing — consolidated clean stack feat(typescript-sdk): voice agent testing — consolidated clean stack May 28, 2026
drewdrewthis and others added 6 commits June 4, 2026 10:21
…s → outputs

User feedback: "recordings" describes the file format; "outputs"
describes the purpose (these dirs hold what the example tests
produced). The helper that writes here keeps its name
(saveDemoRecording) — it still SAVES a recording, the recording is
just NAMED an output now.

Updates the writing helper's RECORDINGS_ROOT to point at outputs/,
all test-file doc-comment path refs, the recordings README (title,
intro, GitHub blob URL example, section header), .gitignore patterns,
the voice-integration CI workflow's upload path, TESTING.md fixture
paths, and fixes the (pre-existing) broken link in javascript/README.md
that pointed at ./recordings/README.md.

Python's python/recordings/ stays for now; renaming there is a
follow-up issue (filed separately).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t types

User feedback: outputs/ should be a parent for all test-run artifact
types (recordings now, traces/logs/screenshots later). Moves every demo
into outputs/recordings/<demo>/ and adds a new thin outputs/README.md
that documents the artifact-parent shape. The rich audio policy /
per-demo coverage table stays where it belongs at
outputs/recordings/README.md.

Writer (tests/voice/helpers/save-demo-recording.ts) updated:
RECORDINGS_ROOT now resolves to .../outputs/recordings/, so newly
written recordings land in the new shape without further changes.

Other ref updates:
- .gitignore: every committed-demo whitelist + segments re-ignore moved
  under outputs/recordings/, plus a sibling re-include for the new
  outputs/README.md.
- .github/workflows/javascript-voice-integration.yml: upload-artifact
  path → outputs/recordings/**.
- javascript/README.md: doc link → outputs/recordings/README.md.
- TESTING.md: footprint paths + du command.
- All @e2e demo test docstrings (15 files): "Recording lands in
  outputs/recordings/<demo>/".

Sanity: typecheck PASS, build PASS, tests 791/792 PASS (1 pre-existing
skip, unrelated).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Feature file `specs/voice-agents.feature:971` (added by main commit 71dd5ed
/ PR #492) lists `interruption` in the adapter-capabilities declaration.
The vitest-cucumber binding at voice-contract-surface.test.ts:177 still
had the pre-71dd5ed step title (missing `interruption`), so StepAble
couldn't find the matching feature step.

Update the step title to match the feature file and add the live-adapter
`typeof caps.interruption === "boolean"` check (the empty-adapter check
on line 192 already exists).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts to input_audio shape

The TS SDK was pre-stringifying message content arrays in the AG-UI
conversion (convertModelMessagesToAguiMessages), which had two
consequences for voice runs:

1. The langwatch ingest content-extractor walks `content` only when it
   is an ARRAY of parts. JSON.stringify happens BEFORE the POST, so the
   extractor saw a string and never recursed → inline base64 audio
   bytes flowed straight through.

2. The extractor's array walker handles the OpenAI Realtime `input_audio`
   shape but not the canonical AI-SDK `file`+`audio/*` shape that
   `createAudioMessage` emits, so even an array would have been a no-op.

End-to-end consequence: voice runs persisted full base64 PCM16 audio
inline in ClickHouse Messages.Content. The simulations list query
(`getSuiteRunData`) slurped the first 6 messages' Content back per
scenario — a single voice scenario set returned 90+ MB.

This commit:
- Stops pre-stringifying user/assistant array content. The langwatch
  ingest schema (`chatMessageSchema.content`) accepts arrays via
  `union(string, array(chatRichContent))`, so the wire contract is
  preserved. AG-UI's stricter `string`-only typing is bypassed with a
  cast at the conversion boundary (single point, well-commented).
- Translates AI-SDK `{type:"file", mediaType:"audio/*", data:"<b64>"}`
  parts into the OpenAI Realtime
  `{type:"input_audio", input_audio:{data, format, mimeType}}` shape so
  the langwatch extractor's existing inputAudio handler externalises the
  bytes to stored-objects.
- Collapses pure single-text-part arrays back to a plain string to keep
  the preview payload compact for the list view.

Tests updated to assert the new contract (array passthrough +
input_audio translation + non-audio file-part passthrough).

Companion langwatch backend changes (separate repo PRs):
- Add a `file`-part branch to the content-extractor visitor (defence in
  depth for any future SDK that emits the AI-SDK file shape).
- Cap Messages.Content size in the simulation-run projection so a
  misbehaving SDK can never again turn into a 90 MB list-page response.
Main's f716e46 added the pnpm override (CWE-502 bump) to package.json;
the branch lockfile predated it, so CI died on
ERR_PNPM_LOCKFILE_CONFIG_MISMATCH before running any tests. Regenerated
via pnpm install --no-frozen-lockfile; --frozen-lockfile now exits 0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ble + hosted URL

Two stale doc-contract expectations, pre-existing on the branch tip but
never caught: CI aborted on the lockfile mismatch before the suite ran,
and the author's 791-pass count predates the docs restructure alignment.

- voice-steps: the UnsupportedCapabilityError message points at the
  hosted docs URL (scenario-docs.langwatch.ai/voice/capability-matrix),
  same as Python's capabilities.py — the test still expected the old
  repo-relative .md path. The sibling assertion in
  voice-contract-surface already used the hosted URL.
- voice-contract-surface: the capability rows now live in the
  auto-generated _generated/voice/capability-matrix.mdx imported by the
  wrapper page; assert the underscore column keys the feature step
  actually names (streaming_transcripts, native_vad, dtmf,
  input_formats, output_formats) across wrapper + generated content.

Suite: 796 pass / 1 skip / 0 fail; build:all + tsc clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
drewdrewthis and others added 3 commits June 4, 2026 16:15
The OpenAI Realtime adapter emitted the model's audio as a raw
`audio/pcm16` file part. PCM16 is headerless, so the LangWatch
simulations UI (and any browser `<audio>`) could not decode it and
rendered an `[error]` badge instead of an inline player.

WAV-wrap the PCM before persisting and emit `audio/wav`, reusing the
existing `encodeWav` (now exported). Mirrors the Python twin
(`python/scenario/voice/messages.py`, which already emits `format: wav`)
— this was a TS-vs-Python parity gap, not a wire-protocol issue.

Adds a ResponseFormatter unit test asserting the emitted part is
`audio/wav` with a valid RIFF/WAVE header.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rs a player

Supersedes 5db36c8, which wrapped at response-formatter.ts — the wrong
layer: it only touched the realtime-agent path and missed the user-simulator
audio (both speakers showed `audio/pcm16 [error]` in the simulations UI).

The SDK deliberately carries in-message audio as raw headerless PCM16 (one
encoder/extractor in voice/messages.ts; the WAV-vs-pcm16 disagreement was a
prior live bug — keep it closed). So wrap ONLY at the langwatch-bound
converter (convert-core-messages-to-agui-messages.ts): raw `audio/pcm16`
file parts become `audio/wav` + `format:"wav"` with a RIFF container, so a
browser `<audio>` can decode them. Matches the Python twin's shipped shape
(voice/messages.py -> format:"wav"). SDK-internal raw-PCM16 contract is
untouched (readers never see this conversion).

Reverts the response-formatter.ts / recording.runtime.ts changes from
5db36c8; removes the now-moot response-formatter unit test.

Verified end-to-end: live openai-realtime run -> langwatch /api/files serves
content-type audio/wav, RIFF/WAVE header, ffprobe pcm_s16le/24kHz/3.25s
(was audio/pcm16, undecodable).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…y WAV wrap

885d294 moved the WAV wrap to the langwatch-bound converter but left the
pcm16-passthrough expectation pinned in message-conversion.test.ts (the
only red in CI run 26958577920: 1 failed / 795 passed). Pin the
deterministic wrapped shape instead: RIFF/WAV container at the AudioChunk
contract params (24kHz mono 16-bit), format "wav", mimeType audio/wav —
matching the Python twin and the commit's verified e2e behavior.

Local: file 10/10; full suite 796 pass / 1 skip; tsc clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

Automated low-risk assessment

This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.

This PR's diff could not be evaluated automatically: Diff too large to fetch via GitHub API (fetch error: could not find pull request diff: HTTP 406: Sorry, the diff exceeded the maximum number of lines (20000) (https://api.github.com/repos/langwatch/scenario/pulls/561)
PullRequest.diff too_large). Manual review required.

This PR requires a manual review before merging.

@drewdrewthis drewdrewthis merged commit 5847c4b into main Jun 4, 2026
19 checks passed
@drewdrewthis drewdrewthis deleted the voice/372-refactor branch June 4, 2026 15:10
drewdrewthis added a commit that referenced this pull request Jun 4, 2026
#604 reality

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
drewdrewthis added a commit that referenced this pull request Jun 4, 2026
#604 reality

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
drewdrewthis added a commit that referenced this pull request Jun 5, 2026
…t-gen (#610)

* docs(voice/#606): expand STT/TTS doc comments and relax audio-to-text judge criteria

Adds deliberate-choice rationale comments to OPENAI_STT_MODEL and
OPENAI_TTS_MODEL in both JS (voice-models.ts) and Python (voice_models.py),
noting no gpt-5-family transcription/TTS models exist on the public API as
of 2026-06. Also documents the Python-only OPENAI_BOT_STT_MODEL gap in the
TS file. Relaxes the multimodal-audio-to-text judge criteria from
overly-specific assertions (exact voice gender, exact repeat phrasing) to
behavioural checks (processed audio, coherent response, non-text format
acknowledgement). Updates the stale skip comment to reflect the model swap
in PR #607.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(voice/#606): update feature-file contract counts to match post-#561/#604 reality

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(voice/#606): add AC4/AC5 doc comments — STT lock rationale + TTS callable-swap pattern

- openai-realtime.ts: explain why `input.transcription.model` is locked to
  OPENAI_STT_MODEL and not exposed as a constructor option (Realtime API
  only accepts transcription-class models; callers who need a different model
  subclass the adapter)
- openai-tts.ts: document that the TTS model is not a parameter by design —
  the pattern is to swap the whole TTSCallable rather than parameterise this
  one; link to OPENAI_TTS_MODEL for the current-gen rationale

Closes #606 (AC4 + AC5)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs(examples/voice/#606): correct stale comment — model swap + unskip are in #607, not this branch

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
drewdrewthis added a commit that referenced this pull request Jun 5, 2026
The voice-to-voice example helper and the audio-to-text example pinned
`gpt-4o-audio-preview`, which OpenAI has removed (404 model_not_found
since 2026-05-19). Any user running the canonical voice example hit an
immediate 404.

Switch to `gpt-audio-mini` — OpenAI's current cost-efficient GA
audio-chat model — matching the Python twin, which already migrated
(python/scenario/config/voice_models.py:44 OPENAI_AUDIO_CHAT_MODEL,
python/examples/test_audio_to_text.py:157). Verified live: gpt-audio-mini
accepts the identical chat.completions shape (modalities:["text","audio"],
audio:{voice,format}) and returns audio. Re-ran the voice-to-voice e2e
against prod LangWatch — success: true, real 2-turn conversation, traces
landed (project_bZspxwkhCD4POvqmIgOr2).

SDK core was unaffected (OpenAIRealtimeAgentAdapter uses gpt-realtime-mini).
This closes a py↔ts example-parity gap left by #561.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
drewdrewthis added a commit that referenced this pull request Jun 5, 2026
The voice-to-voice example helper and the audio-to-text example pinned
`gpt-4o-audio-preview`, which OpenAI has removed (404 model_not_found
since 2026-05-19). Any user running the canonical voice example hit an
immediate 404.

Switch to `gpt-audio-mini` — OpenAI's current cost-efficient GA
audio-chat model — matching the Python twin, which already migrated
(python/scenario/config/voice_models.py:44 OPENAI_AUDIO_CHAT_MODEL,
python/examples/test_audio_to_text.py:157). Verified live: gpt-audio-mini
accepts the identical chat.completions shape (modalities:["text","audio"],
audio:{voice,format}) and returns audio. Re-ran the voice-to-voice e2e
against prod LangWatch — success: true, real 2-turn conversation, traces
landed (project_bZspxwkhCD4POvqmIgOr2).

SDK core was unaffected (OpenAIRealtimeAgentAdapter uses gpt-realtime-mini).
This closes a py↔ts example-parity gap left by #561.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

5 participants