feat(inference): enhance schemas and models for TTS and STT by russellmartin-livekit · Pull Request #1680 · livekit/agents-js

russellmartin-livekit · 2026-06-01T20:58:33Z

Description

Brings the @livekit/agents Inference clients (LLM / STT / TTS under agents/src/inference) back in sync with the inference-gateway capabilities. This covers model-catalog drift plus three functional gaps where the gateway supports a feature the JS client never exercised: LLM response_format, STT mid-stream session.update, and TTS per-utterance generation_config.

Changes Made

LLM (llm.ts) — Synced the model catalog with the gateway config (added zai/glm-5.1, moonshotai/kimi-k2.5/k2.6, openai/gpt-5.5/gpt-5.4-mini/gpt-5.4-nano/chat-latest, google/gemini-3.1-pro/gemini-3.1-flash-lite/gemini-3.5-flash; dropped stale entries). Added a response_format passthrough field to ChatCompletionOptions so structured-output requests reach the gateway.
STT (stt.ts) — Split Deepgram Nova vs. Flux into DeepgramModels/DeepgramFluxModels (incl. deepgram/flux-general-multi) with a dedicated DeepgramFluxOptions (eager_eot_threshold, etc.); added cartesia/ink-2(-latest), assemblyai/u3-rt-pro, and inworld/inworld-stt-1 + InworldSTTOptions. Reworked SpeechStream.updateOptions to send a mid-stream session.update over the live socket (AssemblyAI/Flux apply it without a reconnect) instead of forcing a reconnect.
TTS (tts.ts, api_protos.ts) — Added the xai/tts-1 provider + XaiTTSOptions; expanded Cartesia (sonic-3.5/sonic-3-latest/sonic-latest), ElevenLabs (eleven_v3), Rime (coda/mistv3/mist), and Inworld (inworld-tts-1.5) catalogs. input_transcript now carries generation_config (voice/model/language) and extra (modelOptions) so mid-stream changes ride the gateway hot path; extended the Zod schema accordingly.
Exports (index.ts) — Exported the new public types (ZAIModels, DeepgramFluxModels/DeepgramFluxOptions, InworldSTTModels/InworldSTTOptions, XaiTTSModels/XaiTTSOptions).

Pre-Review Checklist

Build passes: tsc --noEmit clean, ESLint clean (only pre-existing tsdoc nits in untouched InworldOptions/RimeOptions comments), 122/122 inference unit tests pass
AI-generated code reviewed: No narration comments; comments explain only non-obvious gateway-protocol intent
Changes explained: See above
Scope appropriate: All changes are within agents/src/inference; each is either catalog sync or a gateway feature the client was missing
Video demo: N/A — verified directly against the production gateway instead (see Testing)

Testing

Automated tests added/updated (if applicable) — no new tests; existing inference/*.test.ts (incl. api_protos) cover the changed surfaces
All tests pass — 122/122 inference unit tests
Make sure both restaurant_agent.ts and realtime_agent.ts work properly — not run; ran a direct inference smoke test against prod instead (below)

Ran a live end-to-end smoke test against the production gateway (built the package, exercised each client; STT tests feed real TTS-synthesized speech back through the recognizers): 9/9 passed.

LLM: openai/gpt-5.5, google/gemini-3.5-flash, zai/glm-5.1 all replied; response_format: { type: 'json_object' } returned valid JSON.
TTS: cartesia/sonic-3 + add_timestamps produced audio and 9 word-level alignments (confirms generation_config/extra → output_alignment); xai/tts-1 produced audio.
STT: inworld/inworld-stt-1, deepgram/flux-general-multi, and assemblyai/u3-rt-pro (with a mid-stream session.update) all transcribed the phrase correctly.

Additional Notes

deepgram/flux-general-multi only finalizes with real-time-paced audio + a trailing silence gap (turn-based end-of-turn detection); a burst feed yields only a preflight transcript. This is model behavior, not a client issue — the client parses Flux's preflight/final events correctly.
Catalog entries are synced from agent-gateway's config.yaml; if a model is added to the client before prod deploys it, calls for that model will be rejected by the gateway until rollout.

- Added `ttsGenerationConfigSchema` for TTS generation configuration. - Updated `ttsInputTranscriptEventSchema` to include optional `generation_config` and `extra` fields. - Expanded model types in `llm.ts`, `stt.ts`, and `tts.ts` to include new models and options for various providers. - Introduced new interfaces for `DeepgramFluxOptions`, `InworldSTTOptions`, and `XaiTTSOptions` to support additional configurations. - Improved handling of mid-stream updates in `SpeechStream` and `SynthesizeStream` classes for better performance and flexibility.

devin-ai-integration

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

* upstream/main: (26 commits) fix(voice): align session.start recording with Python primary-session semantics (livekit#1704) docs: add TcpSessionTransport to Remote Sessions section (livekit#1703) (format): remove whitespace (livekit#1705) feat(realtime): add reasoning configuration for gpt-realtime-2 models (livekit#1575) feat(voice): support granular RecordingOptions in session.start (livekit#1702) fix(inference): guard agent sid header (livekit#1700) feat(voice): add TcpSessionTransport and updateIo session handler (livekit#1693) fix(amd): defer SIP listening until answer (livekit#1639) fix(job): close RecorderIO at session end (livekit#1682) docs: fix incorrect inference model file reference (livekit#1685) docs: update cartesia plugin capabilities for STT support (livekit#1686) feat(inference): add agent ID header to inference requests (livekit#1687) fix(soniox): exclude test files from dist build (livekit#1689) Version Packages (livekit#1683) fix(recorder): prevent close hang (livekit#1684) docs(cartesia): add cartesia stt to readme (livekit#1681) fix(inworld): harden TTS connection layer and default to inworld-tts-2 (livekit#1675) feat(inference): enhance schemas and models for TTS and STT (livekit#1680) fix(llm): make ToolOptions.abortSignal required (livekit#1678) chore(deps): update dependency vitest to v4.1.0 [security] (livekit#1673) ...

russellmartin-livekit requested review from a team, theomonnom and tinalenguyen June 1, 2026 20:58

devin-ai-integration Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread agents/src/inference/stt.ts

tinalenguyen mentioned this pull request Jun 1, 2026

feat(inference): update model literals #1676

Closed

toubatbrian approved these changes Jun 1, 2026

View reviewed changes

tinalenguyen approved these changes Jun 1, 2026

View reviewed changes

toubatbrian merged commit 6dc619f into main Jun 1, 2026
9 checks passed

toubatbrian deleted the inference-model-update branch June 1, 2026 22:18

detail-app Bot mentioned this pull request Jun 2, 2026

docs: fix incorrect inference model file reference #1685

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(inference): enhance schemas and models for TTS and STT#1680

feat(inference): enhance schemas and models for TTS and STT#1680
toubatbrian merged 1 commit into
mainfrom
inference-model-update

russellmartin-livekit commented Jun 1, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

russellmartin-livekit commented Jun 1, 2026

Description

Changes Made

Pre-Review Checklist

Testing

Additional Notes

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants