feat(typescript-sdk): voice agent testing — consolidated clean stack#561
Conversation
🎧 Voice demo recordings — click to listenReal audio captured from each demo's
|
|
No description provided. |
…s → outputs User feedback: "recordings" describes the file format; "outputs" describes the purpose (these dirs hold what the example tests produced). The helper that writes here keeps its name (saveDemoRecording) — it still SAVES a recording, the recording is just NAMED an output now. Updates the writing helper's RECORDINGS_ROOT to point at outputs/, all test-file doc-comment path refs, the recordings README (title, intro, GitHub blob URL example, section header), .gitignore patterns, the voice-integration CI workflow's upload path, TESTING.md fixture paths, and fixes the (pre-existing) broken link in javascript/README.md that pointed at ./recordings/README.md. Python's python/recordings/ stays for now; renaming there is a follow-up issue (filed separately). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t types User feedback: outputs/ should be a parent for all test-run artifact types (recordings now, traces/logs/screenshots later). Moves every demo into outputs/recordings/<demo>/ and adds a new thin outputs/README.md that documents the artifact-parent shape. The rich audio policy / per-demo coverage table stays where it belongs at outputs/recordings/README.md. Writer (tests/voice/helpers/save-demo-recording.ts) updated: RECORDINGS_ROOT now resolves to .../outputs/recordings/, so newly written recordings land in the new shape without further changes. Other ref updates: - .gitignore: every committed-demo whitelist + segments re-ignore moved under outputs/recordings/, plus a sibling re-include for the new outputs/README.md. - .github/workflows/javascript-voice-integration.yml: upload-artifact path → outputs/recordings/**. - javascript/README.md: doc link → outputs/recordings/README.md. - TESTING.md: footprint paths + du command. - All @e2e demo test docstrings (15 files): "Recording lands in outputs/recordings/<demo>/". Sanity: typecheck PASS, build PASS, tests 791/792 PASS (1 pre-existing skip, unrelated). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Feature file `specs/voice-agents.feature:971` (added by main commit 71dd5ed / PR #492) lists `interruption` in the adapter-capabilities declaration. The vitest-cucumber binding at voice-contract-surface.test.ts:177 still had the pre-71dd5ed step title (missing `interruption`), so StepAble couldn't find the matching feature step. Update the step title to match the feature file and add the live-adapter `typeof caps.interruption === "boolean"` check (the empty-adapter check on line 192 already exists). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ts to input_audio shape
The TS SDK was pre-stringifying message content arrays in the AG-UI
conversion (convertModelMessagesToAguiMessages), which had two
consequences for voice runs:
1. The langwatch ingest content-extractor walks `content` only when it
is an ARRAY of parts. JSON.stringify happens BEFORE the POST, so the
extractor saw a string and never recursed → inline base64 audio
bytes flowed straight through.
2. The extractor's array walker handles the OpenAI Realtime `input_audio`
shape but not the canonical AI-SDK `file`+`audio/*` shape that
`createAudioMessage` emits, so even an array would have been a no-op.
End-to-end consequence: voice runs persisted full base64 PCM16 audio
inline in ClickHouse Messages.Content. The simulations list query
(`getSuiteRunData`) slurped the first 6 messages' Content back per
scenario — a single voice scenario set returned 90+ MB.
This commit:
- Stops pre-stringifying user/assistant array content. The langwatch
ingest schema (`chatMessageSchema.content`) accepts arrays via
`union(string, array(chatRichContent))`, so the wire contract is
preserved. AG-UI's stricter `string`-only typing is bypassed with a
cast at the conversion boundary (single point, well-commented).
- Translates AI-SDK `{type:"file", mediaType:"audio/*", data:"<b64>"}`
parts into the OpenAI Realtime
`{type:"input_audio", input_audio:{data, format, mimeType}}` shape so
the langwatch extractor's existing inputAudio handler externalises the
bytes to stored-objects.
- Collapses pure single-text-part arrays back to a plain string to keep
the preview payload compact for the list view.
Tests updated to assert the new contract (array passthrough +
input_audio translation + non-audio file-part passthrough).
Companion langwatch backend changes (separate repo PRs):
- Add a `file`-part branch to the content-extractor visitor (defence in
depth for any future SDK that emits the AI-SDK file shape).
- Cap Messages.Content size in the simulation-run projection so a
misbehaving SDK can never again turn into a 90 MB list-page response.
Main's f716e46 added the pnpm override (CWE-502 bump) to package.json; the branch lockfile predated it, so CI died on ERR_PNPM_LOCKFILE_CONFIG_MISMATCH before running any tests. Regenerated via pnpm install --no-frozen-lockfile; --frozen-lockfile now exits 0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ble + hosted URL Two stale doc-contract expectations, pre-existing on the branch tip but never caught: CI aborted on the lockfile mismatch before the suite ran, and the author's 791-pass count predates the docs restructure alignment. - voice-steps: the UnsupportedCapabilityError message points at the hosted docs URL (scenario-docs.langwatch.ai/voice/capability-matrix), same as Python's capabilities.py — the test still expected the old repo-relative .md path. The sibling assertion in voice-contract-surface already used the hosted URL. - voice-contract-surface: the capability rows now live in the auto-generated _generated/voice/capability-matrix.mdx imported by the wrapper page; assert the underscore column keys the feature step actually names (streaming_transcripts, native_vad, dtmf, input_formats, output_formats) across wrapper + generated content. Suite: 796 pass / 1 skip / 0 fail; build:all + tsc clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
180bab4 to
f2cdf58
Compare
The OpenAI Realtime adapter emitted the model's audio as a raw `audio/pcm16` file part. PCM16 is headerless, so the LangWatch simulations UI (and any browser `<audio>`) could not decode it and rendered an `[error]` badge instead of an inline player. WAV-wrap the PCM before persisting and emit `audio/wav`, reusing the existing `encodeWav` (now exported). Mirrors the Python twin (`python/scenario/voice/messages.py`, which already emits `format: wav`) — this was a TS-vs-Python parity gap, not a wire-protocol issue. Adds a ResponseFormatter unit test asserting the emitted part is `audio/wav` with a valid RIFF/WAVE header. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rs a player Supersedes 5db36c8, which wrapped at response-formatter.ts — the wrong layer: it only touched the realtime-agent path and missed the user-simulator audio (both speakers showed `audio/pcm16 [error]` in the simulations UI). The SDK deliberately carries in-message audio as raw headerless PCM16 (one encoder/extractor in voice/messages.ts; the WAV-vs-pcm16 disagreement was a prior live bug — keep it closed). So wrap ONLY at the langwatch-bound converter (convert-core-messages-to-agui-messages.ts): raw `audio/pcm16` file parts become `audio/wav` + `format:"wav"` with a RIFF container, so a browser `<audio>` can decode them. Matches the Python twin's shipped shape (voice/messages.py -> format:"wav"). SDK-internal raw-PCM16 contract is untouched (readers never see this conversion). Reverts the response-formatter.ts / recording.runtime.ts changes from 5db36c8; removes the now-moot response-formatter unit test. Verified end-to-end: live openai-realtime run -> langwatch /api/files serves content-type audio/wav, RIFF/WAVE header, ffprobe pcm_s16le/24kHz/3.25s (was audio/pcm16, undecodable). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…y WAV wrap 885d294 moved the WAV wrap to the langwatch-bound converter but left the pcm16-passthrough expectation pinned in message-conversion.test.ts (the only red in CI run 26958577920: 1 failed / 795 passed). Pin the deterministic wrapped shape instead: RIFF/WAV container at the AudioChunk contract params (24kHz mono 16-bit), format "wav", mimeType audio/wav — matching the Python twin and the commit's verified e2e behavior. Local: file 10/10; full suite 796 pass / 1 skip; tsc clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
#604 reality Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
#604 reality Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t-gen (#610) * docs(voice/#606): expand STT/TTS doc comments and relax audio-to-text judge criteria Adds deliberate-choice rationale comments to OPENAI_STT_MODEL and OPENAI_TTS_MODEL in both JS (voice-models.ts) and Python (voice_models.py), noting no gpt-5-family transcription/TTS models exist on the public API as of 2026-06. Also documents the Python-only OPENAI_BOT_STT_MODEL gap in the TS file. Relaxes the multimodal-audio-to-text judge criteria from overly-specific assertions (exact voice gender, exact repeat phrasing) to behavioural checks (processed audio, coherent response, non-text format acknowledgement). Updates the stale skip comment to reflect the model swap in PR #607. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(voice/#606): update feature-file contract counts to match post-#561/#604 reality Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(voice/#606): add AC4/AC5 doc comments — STT lock rationale + TTS callable-swap pattern - openai-realtime.ts: explain why `input.transcription.model` is locked to OPENAI_STT_MODEL and not exposed as a constructor option (Realtime API only accepts transcription-class models; callers who need a different model subclass the adapter) - openai-tts.ts: document that the TTS model is not a parameter by design — the pattern is to swap the whole TTSCallable rather than parameterise this one; link to OPENAI_TTS_MODEL for the current-gen rationale Closes #606 (AC4 + AC5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs(examples/voice/#606): correct stale comment — model swap + unskip are in #607, not this branch Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
The voice-to-voice example helper and the audio-to-text example pinned
`gpt-4o-audio-preview`, which OpenAI has removed (404 model_not_found
since 2026-05-19). Any user running the canonical voice example hit an
immediate 404.
Switch to `gpt-audio-mini` — OpenAI's current cost-efficient GA
audio-chat model — matching the Python twin, which already migrated
(python/scenario/config/voice_models.py:44 OPENAI_AUDIO_CHAT_MODEL,
python/examples/test_audio_to_text.py:157). Verified live: gpt-audio-mini
accepts the identical chat.completions shape (modalities:["text","audio"],
audio:{voice,format}) and returns audio. Re-ran the voice-to-voice e2e
against prod LangWatch — success: true, real 2-turn conversation, traces
landed (project_bZspxwkhCD4POvqmIgOr2).
SDK core was unaffected (OpenAIRealtimeAgentAdapter uses gpt-realtime-mini).
This closes a py↔ts example-parity gap left by #561.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The voice-to-voice example helper and the audio-to-text example pinned
`gpt-4o-audio-preview`, which OpenAI has removed (404 model_not_found
since 2026-05-19). Any user running the canonical voice example hit an
immediate 404.
Switch to `gpt-audio-mini` — OpenAI's current cost-efficient GA
audio-chat model — matching the Python twin, which already migrated
(python/scenario/config/voice_models.py:44 OPENAI_AUDIO_CHAT_MODEL,
python/examples/test_audio_to_text.py:157). Verified live: gpt-audio-mini
accepts the identical chat.completions shape (modalities:["text","audio"],
audio:{voice,format}) and returns audio. Re-ran the voice-to-voice e2e
against prod LangWatch — success: true, real 2-turn conversation, traces
landed (project_bZspxwkhCD4POvqmIgOr2).
SDK core was unaffected (OpenAIRealtimeAgentAdapter uses gpt-realtime-mini).
This closes a py↔ts example-parity gap left by #561.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Why
Closes #372 (epic #370). The TypeScript voice-agent-testing port had fragmented into 10 flat-sibling PRs (#513/#515/#528/#534–540) carrying pre-EDR design drift. This replaces all of them with one clean stack rebuilt against
main, conforming to the EDR (#560 / ADR-002) and the decided public API (PRD). Python is the reference implementation; this brings TS to parity.Acceptance Criteria
The behavioral contract is
specs/voice-agents.feature— 127 scenarios ported from the Python source-of-truth (79@unit, 14@integration, 39@e2e, 2@todo). Collapsed to the headline ACs below, each ticked only with evidence on this branch. "Done" = (1) theci-checksgate (units + build + non-bot@ts-e2e) green, (2) the committed demo recordings present, and (3) the app-side ACs below (AC13–18) verified — a voice run correctly ingested, queryable, and rendered in the langwatch app. SDK-green alone is necessary but not sufficient.Status legend: ✅ = evidence verified in-tree/observed · ⏳ = implemented + author-reported green locally, not yet confirmed by CI · ◻️ = open. (2026-06-04: the unit suite now runs in CI —
ci-checks (24.x)green on run 26960103187 after the rebase + lockfile fix; formerly-⏳ unit-backed rows are CI-confirmed.)Binding status: 29/127 scenarios carry
@ts-bound(wired to an executable TS test). The remaining 98 are contract-level (ported from the Python spec), covered indirectly rather than by 1:1 bindings. The full suite is CI-confirmed at 796 pass / 1 skip on HEAD390e52c(run 26960103187) — the former lockfile abort is fixed, so unit-backed ACs below are ✅ on real CI evidence. Demo-recording-backed ACs are ✅ (the 16 recordings are committed and present in-tree — verified).scenario.run(), text-only scenarios unaffected by voice depsvoice_text_parityrecording (in-tree ✅) + unit "Existing text-only scenarios unaffected" (✅ CI)ScenarioConfig.voice, no module-globalconfigure()composable_stt_swaprecording (in-tree ✅); STT-swap units (✅ CI)agent({ wait:false })+interrupt()cuts off a reply mid-utterance, marked truncatedinterruption_recoveryrecording, judge hard-gated (in-tree ✅)gemini_live_interruption+elevenlabs_interruptionrecordings (in-tree ✅)openai_realtime_{agent,user},elevenlabs_{hosted,branded},gemini_live,pipecat_{scenario,ws}recordings (in-tree ✅)angry_customerrecording, noise-floor≫silence assertion (in-tree ✅)@ts-judge7-scenario unit suite — CI-confirmed (run 26960103187)capabilities;dtmf()raisesUnsupportedCapabilityErroroff-telephony; matrix in docs@ts-contract-surfaceunits (✅ CI) +adapters/*.mdxtables (in-tree ✅)gpt-4o-transcribe); SDK-side VAD fallback with one-shot warning@ts-contract-surface/@ts-stt/@ts-vadunits — CI-confirmed (same run)ci-checks(units + build + non-bot@ts-e2e) passes on HEADci-checks (24.x)pass (6m2s, suite executed) +javascript-complete/docs-complete/python-complete/evaluateall pass on HEAD390e52c— run 26960103187. Fixed by rebase onto main + recording the@ungap/structured-cloneoverride inpnpm-lock.yaml(f7273e7) + 2 stale doc-contract test assertions (f2cdf58). Local: 796 pass / 1 skip.@ts-twilio-proto+@ts-twilio-serverintegration scenarios bound — unit/integration layers CI-confirmed (same run)PendingTransportErrorNet (honest read, updated 2026-06-04): all SDK-side ACs (AC1–12) are now ✅. AC1–6 were demo-recording-verified; AC7–9 + AC11 flipped from ⏳ to ✅ when the unit suite executed in CI for the first time on this PR —
ci-checks (24.x)pass in 6m2s on HEAD390e52c(run 26960103187), 796 pass / 1 skip. AC10's lockfile blocker is fixed (rebase onto main +@ungap/structured-cloneoverride recorded in the lockfile + 2 stale doc-contract test assertions aligned).App-side ACs — langwatch ingestion + rendering (SDK-green is necessary but NOT sufficient)
Parity is not done when the SDK suite passes — it's done when a voice
scenario.run()is correctly ingested, queryable, and rendered in the langwatch app. These require a live end-to-end run (SDK → langwatch ingest → API/UI), not a unit test. Surface mapped againstlangwatch/langwatch:input_audiocontent parts) lands in langwatchPOST /api/collectoringests the span; trace appears for the projectscenariorun_3EfWWrSEiX95SWhzwUCwd7n22OA(openai_realtime_agentdemo) landed 2 traces inproject_bZspxwkhCD4POvqmIgOr2withscenario.run_idmetadata (sdk: langwatch-observability-sdk typescript 0.16.1); judge spans carry the audio as file parts:{"type":"file","mediaType":"audio/pcm16",…~230KB}POST /api/traces/search{projectId, filters}returns the voice run's tracePOST /api/traces/search(startDate/endDate window) returned both:c50549165a184a73d5fb509525230755,7c45a3b4edf40ad1466e33afd176b2f5;GET /api/trace/:idreturns full span detail (5 spans incl.OpenAIRealtimeAgentAdapter.call,_JudgeAgent.call)RUN_STARTED/MESSAGE_SNAPSHOT/RUN_FINISHEDpersisted, each message carrying optionaltrace_idscenario_eventsES index holds the run's events; queryable via tRPCMESSAGE_SNAPSHOT) carrying 8input_audiocontent parts; the 2 assistant messages carrytrace_idrefs matching the two traces above (cross-linkage proven);status: SUCCESS+results.verdict: success(3/3 criteria) =RUN_FINISHEDrecordedGET /api/simulation-runs/:scenarioRunIdreturns it;/[project]/simulationsrenders itGET /api/simulation-runs/scenariorun_3EfWWrSEiX95SWhzwUCwd7n22OA→ HTTP 200 (demo_openai_realtime_agent, 41.8s, cost $0.0018);platformUrl: app.langwatch.ai/scenario-tracing-bZspxw/simulations(API-verified; in-app visual spot-check needs an authed browser — one click)input_audiocontent renders as an inline player, not a raw JSON blob<ScenarioMessageRenderer>→<MediaPart>emits<audio controls>for the messagelangwatch@mainvia #4058 (the old #3781 is stale/superseded — not a blocker).MediaPart.tsxemits<audio data-testid="media-part-audio" controls>withonLoadedDataprobing;visit-content-part.tshandlesinput_audioparts; integration-tested (MediaPart.integration.test.tsx). The verified run's messages carry exactly the shape it renders:{"type":"input_audio","input_audio":{"mimeType":"audio/pcm16","url":"/api/files/so_…"}}<audio>(onLoadedData→ ok); externalized blobs resolve via/api/files/:idscenariorun_3EfWWrSEiX95SWhzwUCwd7n22OA:GET /api/files/so_000000000002CR60L0eAWXH3ILtke→ HTTP 200, 249,600 bytes; decodes as valid 5.20s pcm16 @ 24kHz mono (the AC9 contract format); OpenAI STT transcribes it to "Of course! Where would you like to go, or what kind of activities are you interested in?" — the coherent reply to the run's user turn "Hello, can you help me plan a weekend trip?". The user-side part (the SDK's other audio path — user-sim TTS) round-trips verbatim: 3.60s pcm16 → STT → "Hello, can you help me plan a weekend trip?", the exact scripted turn. Decode ✓ format ✓ correct-content ✓ both-paths ✓ serving-route ✓; the in-browser<audio>element rendering these same URLs is covered by the integration tests above. Follow-up hardening on-branch:885d294WAV-wraps raw pcm16 at the langwatch-bound converter (supersedes5db36c8; e2e-verified in its commit:/api/filesnow servesaudio/wavwith a RIFF header, ffprobe-decodable) +390e52caligns the conversion testVerification matrix (2026-06-04, all live against production
app.langwatch.ai):Every run below links to the live prod app — open it (project members) and press play on any audio message to hear the run the table describes.
scenariorun_3EfWWrSEiX95SWhzwUCwd7n22OAscenariorun_3Efeze6fAvYmZd8gQeyVxz8SoOmorigin: simulationtrace metadata as disqualifying — while its reasoning simultaneously confirms the live mid-utterance VAD cut-off. Vitest mechanics assertions: 3/3 pass. Suggested post-merge polish: drop the word "real" from that criterion"I am a large"/ STT"I am a lar", 0.96s)scenariorun_3EffH3ID8Q7HU5S5anb2c8u4hhyscenariorun_3EffKAJl4dXbP1IU41rxCzYyiwTscenariorun_3EffZC024s9IrYTOaIoKbQawMpqaudio/pcm16— AC9 contract held for a telephony source formatApp-side net (final, 2026-06-04): all app-side ACs (AC13–18) are ✅ verified against production across the adapter matrix — 5 live runs / 4 adapter families (matrix above), covering happy-path, interruption/truncation, failed-verdict persistence, audio effects, and a telephony source format. Every leg: SDK → collector →
traces/search→simulation-runsREST →/api/filesaudio bytes, with message↔trace cross-linkage intact and served audio STT-verified as the correct conversational content at the contract format (pcm16/24kHz). The inline player shipped via #4058 (#3781 is stale, not a blocker). Parity board AC1–18: complete. Optional formality: eyeball any run at the simulations page — every layer beneath that click is independently verified.What changed (decisions)
ScenarioConfig.voiceand reaches the adapters per-run — no globalconfigure(stt=). One AI-SDKfileaudio format end-to-end; STT/TTS are one-file-per-provider.agent({ wait: false })is the barge-in primitive (PRD §4.4): the executor starts the agent's reply without blocking so auser()/interrupt()lands mid-utterance; the transport's native cancel fires and the cut-off segment is marked truncated.transcriptTruncated+ shorter segment for cut-off; noise-floor ≫ silence for mixed ambience). A hollow demo now fails.voice-integrationworkflow, not PR CI — mirroring Python'svoice-integration.yml. They cost real API money and can flake, so they never gate a merge;ci-checks(units + build + the non-bot@ts-e2egate) is the merge gate.maybeScheduleInterruptedAgentTurn,daa357d): ports Python's pre-step pattern; the executor decides to interrupt before the next agent turn begins rather than patching it up post-step. Eliminates the prior hollow post-step path.fa84c8f):voiceProceedfires the interrupt viavoiceifyTextwhile the AGENT is still in-flight, bypassing the user-sim LLM to win the race against fast-streaming bots.delayRangeis honoured in this path (d20a49c).557cac2): sets adiscardingInboundAudioflag so late bot frames don't contaminate the next agent turn.VoiceEventdiscriminated union (4a49585): five variant interfaces (AgentSpeakingEvent,UserSpeakingEvent, etc.) replace the priortype:string— the compiler now narrows onevent.typewithout casts.RealtimeUserAgent+VoiceUserSimulatorstructural interfaces + type guards (3f234f4): killsas unknown ascasts in the executor; adapter conformance is checked structurally at compile time.interruptRng/interruptWaitForSpeechMsrenamed to drop leading underscore (50cf3f2): test-seam fields are now@internalJSDoc-tagged rather than visually private — consistent with the project's naming convention.interruption_recoveryjudge promoted to hard gate (5bce766): was informative-only; now the scenario fails if the judge says the agent didn't recover. Recording regenerated atbe600de.4bd40c6):receiveAudio()deduplicate logic is exercised deterministically without a live connection.8eb7f55…c9c0f9a):recipes/interrupt.mdxnow documentsvoiceProceed({ interruptions: new InterruptionConfig({...}) })— the PR's primary random-barge-in API — alongside the explicitagent({ wait: false }) + interrupt()primitiveadapters/pipecat.mdxcapability table corrected (dropsopusthat the TS adapter doesn't support; source:pipecat.ts:139-140)adapters/elevenlabs.mdxaddssystemPromptOverride/firstMessageOverrideper-session optionsrecipes/effects.mdxdrops the stale "on the roadmap" callout (Pythonscript.py:user()already acceptsvoice_style+audio_effects)How it works
Module tree — 45 created + 18 modified source modules (+44 test files) under
javascript/src/, single responsibilities, and how it functions togetherCreated
Modified (existing modules the voice layer hooks into)
How it functions together
scenario.run({ voice })(runner/run.ts) resolves a per-runVoiceConfig(voice/config.ts — no module-global state, ADR-002) and hands it to the executor (execution/scenario-execution.ts), which connects the chosen transport (voice/adapters/*) through the uniform connected-state gate (voice/adapter.runtime.ts). On each user turn the simulator (agents/user-simulator-agent.ts) synthesizes speech via the TTS router (voice/tts/) and layers realism through the effects pipeline (voice/effects/ + bundled assets/noise/); audio streams over the transport and the agent's reply is drained on tail-silence windows (adapter.runtime.ts). Every chunk crosses a single gateway (voice/messages.ts) intoModelMessages carryinginput_audiofile parts — the exact shape langwatch ingests (utils/convert-core-messages-to-agui-messages.ts). Barge-ins come from script steps (script/voice-steps.ts) or probabilistic config (voice/interruption.ts): native server-VAD where the adapter supports it, SDK fallback otherwise (voice/vad.ts), with truncation marked on the byte cursor (voice/segment-utils.ts, voice-executor-state.ts). At judgment time a pre-pass (voice/judge-stt.ts → voice/stt/) transcribes audio for non-multimodal judges (agents/judge/judge-agent.ts). The run's audio persists asfull.wav+ per-segment files + a byte-accurate manifest (voice/recording.runtime.ts), optionally monitored live (voice/playback.ts), and ridesScenarioResultalongside timeline + latency extensions.Asset parity: the 5 noise WAVs are byte-identical between
javascript/src/voice/assets/noise/andpython/scenario/voice/assets/noise/(md5-verified 2026-06-04, 144,044 B each), produced by the single deterministic generatorjavascript/scripts/generate-noise-samples.mjs; both sides carryLICENSES.md.File-org corrections (from audit)
Three PR-internal file-org corrections landed late in the PR after a structure audit:
scripts/generate-noise-samples.mjs→javascript/scripts/(TS-only generator belongs in package scripts, not repo-root cross-language scripts)docs/voice/internal-design.md→docs/adr/003-voice-internal-design.md(it's an ADR by description; lives with ADR-001/002)javascript/examples/vitest/recordings/→javascript/examples/vitest/outputs/recordings/(semantic clarity + future-proof for traces/logs siblings)Pre-existing-on-main cleanup landed separately as PR #586 (deletes orphan docs, publishes happy-path guides, folds capability-matrix duplication, python
recordings → outputs/recordingsrename, python noise-sample parity refresh). After #586 merges this branch rebases off cleaned main.Test plan
pnpm -F @langwatch/scenario test→ 791 pass / 1 skipped (incl. the newinterrupt-truncation, noise-energy, byte-cursor, proceed-loop pre-step, and Gemini Live spurious-pair unit tests).@ts-e2eround-trip gate (real keys) green;tsc --noEmit,build:all,lint:all,typecheck:allclean.How I can prove it works
16 committed demo recordings (
javascript/examples/vitest/outputs/recordings/<demo>/—full.wav+ byte-accuratemanifest.json), generated against live providers + the bundled Pipecat bot. The ones demonstrating the headline behaviors (open the blob → GitHub shows an audio player):Full set (16) also covers the adapters (openai_realtime ×2, elevenlabs hosted/branded, gemini_live), composable_stt_swap, recording_playback, voice_text_parity, pipecat ×2, background_handoff.
Anything surprising
evaluatecheck is red — expected & non-blocking, fix lands with PR chore: main-side cleanup — docs + spec + python/TS parity #586. This PR's diff exceeds GitHub's 20k-line API cap, so the eval bot can't fetch it (HTTP 406). Not a required check. PR #586 (commitsbafdbf7e15+cdce271bb2) catches the 406 with a grep-specificoversized=truepath + env: pattern hardening foroversized_reason; supersedes PR fix(ci/#571): soft-pass oversized PR diffs in pr-auto-approve evaluate #572 and closes ci: evaluate workflow hard-fails on PRs >20k-line diff instead of its oversized path #571 on merge. Once chore: main-side cleanup — docs + spec + python/TS parity #586 merges and this branch rebases on the new main,evaluatewill go green on this PR too.voice-integrationworkflow (which now setsSCENARIO_PIPECAT_BOT_UP); pre-merge they're proven by the committed recordings + local real-key runs, and the workflow guards them post-merge.random_interruptionsrecording is honest about Pipecat-bot limits. The bundled Pipecat stub bot bursts TTS frames in ~50 ms of wall time (not realtime streaming), so by the timeadapter.interrupt()fires the bot has already sent all frames — real mid-stream audio cut-off isn't observable with this transport. The demo's assertions encode what the bot can prove: interrupt fires +fired_after_speechoutcome + canned-phrase strategy ran + truncation label + agent recovers + multi-turn conversation. For real audio cut-off under server-side cancel, seegemini_live_interruption. A transport-upgrade follow-up is drafted (see/tmp/voice-spec/issue-random-interruptions-followup.md).speakingEventempty-chunk), harden(voice/ts): validate paths/URLs in backgroundNoise/multipleVoices + pipecat adapter #570 (path/URL hardening), ci: evaluate workflow hard-fails on PRs >20k-line diff instead of its oversized path #571 (evaluateoversized-diff). Remaining transports (LiveKit/Vapi/WebRTC/WebSocket) tracked under Voice agents: remaining platform adapter transports #371.🤖 Generated with Claude Code
Closes (574-585 grind — landed in this PR)
Closes #574 #575 #576 #578 #579 #580 #581 #582 #583 #584 #585
All 11 follow-up issues from the post-review NIT batch were addressed in this PR via 60+ commits since
4d83724. Per-issue close-out comments are on each issue; partial outcomes for #580 (Gemini adapter improved, demo workaround retained) and #583 (adapter dequeue race fixed, transport switch reverted) are documented honestly there.