Skip to content

feat(inference): enhance schemas and models for TTS and STT#1680

Merged
toubatbrian merged 1 commit into
mainfrom
inference-model-update
Jun 1, 2026
Merged

feat(inference): enhance schemas and models for TTS and STT#1680
toubatbrian merged 1 commit into
mainfrom
inference-model-update

Conversation

@russellmartin-livekit
Copy link
Copy Markdown
Contributor

Description

Brings the @livekit/agents Inference clients (LLM / STT / TTS under agents/src/inference) back in sync with the inference-gateway capabilities. This covers model-catalog drift plus three functional gaps where the gateway supports a feature the JS client never exercised: LLM response_format, STT mid-stream session.update, and TTS per-utterance generation_config.

Changes Made

  • LLM (llm.ts) — Synced the model catalog with the gateway config (added zai/glm-5.1, moonshotai/kimi-k2.5/k2.6, openai/gpt-5.5/gpt-5.4-mini/gpt-5.4-nano/chat-latest, google/gemini-3.1-pro/gemini-3.1-flash-lite/gemini-3.5-flash; dropped stale entries). Added a response_format passthrough field to ChatCompletionOptions so structured-output requests reach the gateway.
  • STT (stt.ts) — Split Deepgram Nova vs. Flux into DeepgramModels/DeepgramFluxModels (incl. deepgram/flux-general-multi) with a dedicated DeepgramFluxOptions (eager_eot_threshold, etc.); added cartesia/ink-2(-latest), assemblyai/u3-rt-pro, and inworld/inworld-stt-1 + InworldSTTOptions. Reworked SpeechStream.updateOptions to send a mid-stream session.update over the live socket (AssemblyAI/Flux apply it without a reconnect) instead of forcing a reconnect.
  • TTS (tts.ts, api_protos.ts) — Added the xai/tts-1 provider + XaiTTSOptions; expanded Cartesia (sonic-3.5/sonic-3-latest/sonic-latest), ElevenLabs (eleven_v3), Rime (coda/mistv3/mist), and Inworld (inworld-tts-1.5) catalogs. input_transcript now carries generation_config (voice/model/language) and extra (modelOptions) so mid-stream changes ride the gateway hot path; extended the Zod schema accordingly.
  • Exports (index.ts) — Exported the new public types (ZAIModels, DeepgramFluxModels/DeepgramFluxOptions, InworldSTTModels/InworldSTTOptions, XaiTTSModels/XaiTTSOptions).

Pre-Review Checklist

  • Build passes: tsc --noEmit clean, ESLint clean (only pre-existing tsdoc nits in untouched InworldOptions/RimeOptions comments), 122/122 inference unit tests pass
  • AI-generated code reviewed: No narration comments; comments explain only non-obvious gateway-protocol intent
  • Changes explained: See above
  • Scope appropriate: All changes are within agents/src/inference; each is either catalog sync or a gateway feature the client was missing
  • Video demo: N/A — verified directly against the production gateway instead (see Testing)

Testing

  • Automated tests added/updated (if applicable) — no new tests; existing inference/*.test.ts (incl. api_protos) cover the changed surfaces
  • All tests pass — 122/122 inference unit tests
  • Make sure both restaurant_agent.ts and realtime_agent.ts work properly — not run; ran a direct inference smoke test against prod instead (below)

Ran a live end-to-end smoke test against the production gateway (built the package, exercised each client; STT tests feed real TTS-synthesized speech back through the recognizers): 9/9 passed.

  • LLM: openai/gpt-5.5, google/gemini-3.5-flash, zai/glm-5.1 all replied; response_format: { type: 'json_object' } returned valid JSON.
  • TTS: cartesia/sonic-3 + add_timestamps produced audio and 9 word-level alignments (confirms generation_config/extraoutput_alignment); xai/tts-1 produced audio.
  • STT: inworld/inworld-stt-1, deepgram/flux-general-multi, and assemblyai/u3-rt-pro (with a mid-stream session.update) all transcribed the phrase correctly.

Additional Notes

  • deepgram/flux-general-multi only finalizes with real-time-paced audio + a trailing silence gap (turn-based end-of-turn detection); a burst feed yields only a preflight transcript. This is model behavior, not a client issue — the client parses Flux's preflight/final events correctly.
  • Catalog entries are synced from agent-gateway's config.yaml; if a model is added to the client before prod deploys it, calls for that model will be rejected by the gateway until rollout.

- Added `ttsGenerationConfigSchema` for TTS generation configuration.
- Updated `ttsInputTranscriptEventSchema` to include optional `generation_config` and `extra` fields.
- Expanded model types in `llm.ts`, `stt.ts`, and `tts.ts` to include new models and options for various providers.
- Introduced new interfaces for `DeepgramFluxOptions`, `InworldSTTOptions`, and `XaiTTSOptions` to support additional configurations.
- Improved handling of mid-stream updates in `SpeechStream` and `SynthesizeStream` classes for better performance and flexibility.
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

Comment thread agents/src/inference/stt.ts
@toubatbrian toubatbrian merged commit 6dc619f into main Jun 1, 2026
9 checks passed
@toubatbrian toubatbrian deleted the inference-model-update branch June 1, 2026 22:18
osimhi213 added a commit to de-id/livekit-agents-js that referenced this pull request Jun 4, 2026
* upstream/main: (26 commits)
  fix(voice): align session.start recording with Python primary-session semantics (livekit#1704)
  docs: add TcpSessionTransport to Remote Sessions section (livekit#1703)
  (format): remove whitespace (livekit#1705)
  feat(realtime): add reasoning configuration for gpt-realtime-2 models (livekit#1575)
  feat(voice): support granular RecordingOptions in session.start (livekit#1702)
  fix(inference): guard agent sid header (livekit#1700)
  feat(voice): add TcpSessionTransport and updateIo session handler (livekit#1693)
  fix(amd): defer SIP listening until answer (livekit#1639)
  fix(job): close RecorderIO at session end (livekit#1682)
  docs: fix incorrect inference model file reference (livekit#1685)
  docs: update cartesia plugin capabilities for STT support (livekit#1686)
  feat(inference): add agent ID header to inference requests (livekit#1687)
  fix(soniox): exclude test files from dist build (livekit#1689)
  Version Packages (livekit#1683)
  fix(recorder): prevent close hang (livekit#1684)
  docs(cartesia): add cartesia stt to readme (livekit#1681)
  fix(inworld): harden TTS connection layer and default to inworld-tts-2 (livekit#1675)
  feat(inference): enhance schemas and models for TTS and STT (livekit#1680)
  fix(llm): make ToolOptions.abortSignal required (livekit#1678)
  chore(deps): update dependency vitest to v4.1.0 [security] (livekit#1673)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants