feat(inference): enhance schemas and models for TTS and STT#1680
Merged
Conversation
- Added `ttsGenerationConfigSchema` for TTS generation configuration. - Updated `ttsInputTranscriptEventSchema` to include optional `generation_config` and `extra` fields. - Expanded model types in `llm.ts`, `stt.ts`, and `tts.ts` to include new models and options for various providers. - Introduced new interfaces for `DeepgramFluxOptions`, `InworldSTTOptions`, and `XaiTTSOptions` to support additional configurations. - Improved handling of mid-stream updates in `SpeechStream` and `SynthesizeStream` classes for better performance and flexibility.
toubatbrian
approved these changes
Jun 1, 2026
tinalenguyen
approved these changes
Jun 1, 2026
osimhi213
added a commit
to de-id/livekit-agents-js
that referenced
this pull request
Jun 4, 2026
* upstream/main: (26 commits) fix(voice): align session.start recording with Python primary-session semantics (livekit#1704) docs: add TcpSessionTransport to Remote Sessions section (livekit#1703) (format): remove whitespace (livekit#1705) feat(realtime): add reasoning configuration for gpt-realtime-2 models (livekit#1575) feat(voice): support granular RecordingOptions in session.start (livekit#1702) fix(inference): guard agent sid header (livekit#1700) feat(voice): add TcpSessionTransport and updateIo session handler (livekit#1693) fix(amd): defer SIP listening until answer (livekit#1639) fix(job): close RecorderIO at session end (livekit#1682) docs: fix incorrect inference model file reference (livekit#1685) docs: update cartesia plugin capabilities for STT support (livekit#1686) feat(inference): add agent ID header to inference requests (livekit#1687) fix(soniox): exclude test files from dist build (livekit#1689) Version Packages (livekit#1683) fix(recorder): prevent close hang (livekit#1684) docs(cartesia): add cartesia stt to readme (livekit#1681) fix(inworld): harden TTS connection layer and default to inworld-tts-2 (livekit#1675) feat(inference): enhance schemas and models for TTS and STT (livekit#1680) fix(llm): make ToolOptions.abortSignal required (livekit#1678) chore(deps): update dependency vitest to v4.1.0 [security] (livekit#1673) ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Brings the
@livekit/agentsInference clients (LLM / STT / TTS underagents/src/inference) back in sync with theinference-gatewaycapabilities. This covers model-catalog drift plus three functional gaps where the gateway supports a feature the JS client never exercised: LLMresponse_format, STT mid-streamsession.update, and TTS per-utterancegeneration_config.Changes Made
llm.ts) — Synced the model catalog with the gateway config (addedzai/glm-5.1,moonshotai/kimi-k2.5/k2.6,openai/gpt-5.5/gpt-5.4-mini/gpt-5.4-nano/chat-latest,google/gemini-3.1-pro/gemini-3.1-flash-lite/gemini-3.5-flash; dropped stale entries). Added aresponse_formatpassthrough field toChatCompletionOptionsso structured-output requests reach the gateway.stt.ts) — Split Deepgram Nova vs. Flux intoDeepgramModels/DeepgramFluxModels(incl.deepgram/flux-general-multi) with a dedicatedDeepgramFluxOptions(eager_eot_threshold, etc.); addedcartesia/ink-2(-latest),assemblyai/u3-rt-pro, andinworld/inworld-stt-1+InworldSTTOptions. ReworkedSpeechStream.updateOptionsto send a mid-streamsession.updateover the live socket (AssemblyAI/Flux apply it without a reconnect) instead of forcing a reconnect.tts.ts,api_protos.ts) — Added thexai/tts-1provider +XaiTTSOptions; expanded Cartesia (sonic-3.5/sonic-3-latest/sonic-latest), ElevenLabs (eleven_v3), Rime (coda/mistv3/mist), and Inworld (inworld-tts-1.5) catalogs.input_transcriptnow carriesgeneration_config(voice/model/language) andextra(modelOptions) so mid-stream changes ride the gateway hot path; extended the Zod schema accordingly.index.ts) — Exported the new public types (ZAIModels,DeepgramFluxModels/DeepgramFluxOptions,InworldSTTModels/InworldSTTOptions,XaiTTSModels/XaiTTSOptions).Pre-Review Checklist
tsc --noEmitclean, ESLint clean (only pre-existing tsdoc nits in untouchedInworldOptions/RimeOptionscomments), 122/122 inference unit tests passagents/src/inference; each is either catalog sync or a gateway feature the client was missingTesting
inference/*.test.ts(incl.api_protos) cover the changed surfacesrestaurant_agent.tsandrealtime_agent.tswork properly — not run; ran a direct inference smoke test against prod instead (below)Ran a live end-to-end smoke test against the production gateway (built the package, exercised each client; STT tests feed real TTS-synthesized speech back through the recognizers): 9/9 passed.
openai/gpt-5.5,google/gemini-3.5-flash,zai/glm-5.1all replied;response_format: { type: 'json_object' }returned valid JSON.cartesia/sonic-3+add_timestampsproduced audio and 9 word-level alignments (confirmsgeneration_config/extra→output_alignment);xai/tts-1produced audio.inworld/inworld-stt-1,deepgram/flux-general-multi, andassemblyai/u3-rt-pro(with a mid-streamsession.update) all transcribed the phrase correctly.Additional Notes
deepgram/flux-general-multionly finalizes with real-time-paced audio + a trailing silence gap (turn-based end-of-turn detection); a burst feed yields only a preflight transcript. This is model behavior, not a client issue — the client parses Flux'spreflight/finalevents correctly.agent-gateway'sconfig.yaml; if a model is added to the client before prod deploys it, calls for that model will be rejected by the gateway until rollout.