feat: add Sarvam AI STT plugin#1046
Conversation
Add @livekit/agents-plugin-sarvam with text-to-speech support using Sarvam AI's Bulbul models. Supports 11 Indian languages and 45+ speaker voices via the Sarvam REST API. - TTS and ChunkedStream classes following existing plugin patterns - Models/speakers/languages type definitions - Test file using shared @livekit/agents-plugins-test harness - SARVAM_API_KEY added to turbo.json globalEnv - Calls AudioByteStream.flush() to prevent trailing audio truncation
…peaker names - Split TTSSpeakers into TTSV2Speakers and TTSV3Speakers types - Add TTSSampleRates and TTSAudioCodecs types to models.ts - Rewrite TTS with discriminated union options (TTSV2Options/TTSV3Options) - V2-specific: pitch, loudness, enablePreprocessing - V3-specific: temperature - Extract resolveOptions() and buildRequestBody() for SRP - Fix speaker names to lowercase (API requires lowercase, not capitalized) - Export new types from index.ts
AudioByteStream requires raw PCM data, which we obtain by stripping the 44-byte WAV header. Allowing user-configurable outputAudioCodec would produce compressed audio (mp3, opus, etc.) that silently breaks the pipeline. Remove outputAudioCodec from public options and hardcode WAV in the API request.
When updateOptions switches the model (e.g. v2 -> v3), the previous shallow merge kept stale model-specific fields like speaker, pitch, and loudness from the old model. Now delegates to resolveOptions() so model-specific defaults are re-applied correctly.
The spread of ResolvedTTSOptions (model: TTSModels) doesn't satisfy the discriminated union TTSOptions. Cast to TTSOptions before passing to resolveOptions, which handles discrimination internally via isV3 check.
…ioCodecs - Drop model-specific fields (speaker, pitch, loudness, temperature, enablePreprocessing) when switching models so resolveOptions applies correct defaults for the new model - Add type assertions for discriminated union compatibility - Remove unused TTSAudioCodecs type from models.ts
🦋 Changeset detectedLatest commit: 8bcf7fb The changes in this PR will be included in the next version bump. This PR includes changesets to release 20 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
fec3009 to
7df750e
Compare
Add speech-to-text support to the Sarvam plugin using the Sarvam AI speech-to-text REST API. Defaults to the recommended saaras:v3 model with support for 22+ Indian languages and 5 transcription modes (transcribe, translate, verbatim, translit, codemix). Also supports the deprecated saarika:v2.5 model for backward compatibility.
7df750e to
f3fb024
Compare
- Add saaras:v2.5 model with /speech-to-text-translate endpoint routing - Add STTTranslateOptions with prompt param for translate endpoint - Model-aware buildFormData: language_code for saarika/saaras-v3, mode for saaras-v3, prompt for saaras-v2.5 - Endpoint routing: saaras:v2.5 → /speech-to-text-translate, others → /speech-to-text - updateOptions handles model switching across all three models - Language fallback chain: API language_code → configured → 'unknown'
- Add SpeechStream class with full WS streaming (sendTask, listenTask, wsMonitor) - Support both /speech-to-text/ws and /speech-to-text-translate/ws endpoints - Handle VAD events (START_SPEECH, END_SPEECH) and final transcripts - Add all WS query params: high_vad_sensitivity, flush_signal, vad_signals - Add prompt and withTimestamps support for REST endpoints - Robust error parsing (data.message + data.error + top-level fallbacks) - Retry loop with linear backoff on disconnect (matches idle timeout behavior) - end_of_stream includes empty audio field per Sarvam protocol requirement - Add ws and @types/ws dependencies
…nection listenTask was created with this.abortController (the SpeechStream's main controller). When listenTask.cancel() was called in the finally block, it permanently aborted the stream's main signal, causing sendTask to exit immediately on every subsequent WS reconnection (infinite rapid reconnect loop with no audio sent). Fix: remove the shared abortController arg so Task.from() creates its own internal controller. listenTask.cancel() now only aborts that local controller, leaving the stream's main signal intact for reconnection.
…se.all Task objects are not thenables — passing wsMonitor directly to Promise.all caused it to resolve immediately, making WS close detection ineffective. The stream would hang on idle timeout instead of triggering the retry loop.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a55b1ee053
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
plugins/sarvam/src/stt.ts
Outdated
| await Promise.race([ | ||
| this.#resetWS.await, | ||
| Promise.all([sendTask(), listenTask.result, wsMonitor.result]), | ||
| ]); |
There was a problem hiding this comment.
Cancel stale send loop on websocket reset
In #runWS, Promise.race can resolve via this.#resetWS.await when updateOptions() is called, but sendTask() is a plain async function and is never cancelled. After the finally block closes the websocket, that old send loop can still be blocked on this.input.next() and then consume subsequent audio frames, attempting to write them to a closed socket; this drops user audio right after an option/model change.
Useful? React with 👍 / 👎.
plugins/sarvam/src/stt.ts
Outdated
| const listenMessage = new Promise<void>((resolve, reject) => { | ||
| ws.on('message', (msg: RawData) => { | ||
| try { |
There was a problem hiding this comment.
Resolve listener when websocket closes without messages
listenMessage only subscribes to ws.on('message') and only resolves from inside that callback, so a normal shutdown path with no trailing transcript/event message can hang forever. This is reachable when sendTask sets closing = true and cancels wsMonitor (e.g., silent input), because a subsequent socket close then has no code path to settle listenTask, preventing stream completion.
Useful? React with 👍 / 👎.
- Add .gitattributes with `* text=auto eol=lf` to enforce LF endings - Fix sendTask not cancellable on WS reset (session-scoped AbortController) - Fix listenMessage hanging on WS close without trailing messages - Normalize all sarvam plugin files from CRLF to LF - Fix prettier formatting in index.ts and models.ts
Prevents stale #speaking=true from suppressing START_OF_SPEECH events after a WS disconnect that occurred mid-speech.
….cancel()
listenTask was created without the parent abort controller, causing a
deadlock when SpeechStream.close() was called — the finally block
couldn't run because Promise.all was stuck waiting for listenTask.
Passing this.abortController to Task.from allows listenTask to exit
when the stream is closed. Removing listenTask.cancel() from the
finally block prevents it from permanently aborting the parent
controller on WS reconnection. Instead, ws.close() triggers the
ws.once('close') handler in listenMessage for clean exit.
Resetting retries to 0 after TCP connect meant that if the session immediately failed (e.g. auth error, server rejection), the counter never reached maxRetry, causing an infinite tight loop with 0ms delay.
Escape curly braces in JSDoc comment that TSDoc parser was interpreting as malformed inline tags.
…fer slice - Reset retry counter only after sessions that ran >5s, distinguishing expected idle-timeout reconnections from persistent connection failures - Use buffer.slice(byteOffset, byteOffset+byteLength) for AudioByteStream to handle typed array views into pooled/shared ArrayBuffers correctly - Fix TSDoc comment with unescaped braces
|
Can you merge main into the branch to resolve some conflicts? Just merged the TTS plugin |
Summary
_recognize()for single-shot transcription via/speech-to-textand/speech-to-text-translateSpeechStreamclass with real-time audio streaming via/speech-to-text/wsand/speech-to-text-translate/wssaaras:v3model with 22+ Indian languages and 5 transcription modes (transcribe,translate,verbatim,translit,codemix)saarika:v2.5(deprecated) andsaaras:v2.5(Indic-to-English translation with auto language detection)updateOptionson model switch)WebSocket streaming details
sendTask(audio input → base64 JSON),listenTask(server messages → SpeechEvents),wsMonitor(disconnect detection)START_SPEECH/END_SPEECHmapped toSpeechEventTypehigh_vad_sensitivity,flush_signal,vad_signals,input_audio_codecdata.message,data.error, and top-level fallbacksend_of_streamincludes empty audio field per Sarvam protocol requirementFiles changed
plugins/sarvam/src/stt.ts_recognize()+SpeechStreamclass with WS streamingplugins/sarvam/src/stt.test.ts@livekit/agents-plugins-testharnessplugins/sarvam/src/models.tsSTTModels,STTModes,STTV2Languages,STTV3Languagestypesplugins/sarvam/src/index.tsplugins/sarvam/package.jsonwsand@types/wsdependenciesplugins/sarvam/README.mdUsage
Test plan
STT.recognize()transcribes audio correctly withsaaras:v3updateOptions()handles model switching (v3 → v2.5 and back) without leaking model-specific fieldspnpm vitest plugins/sarvamwithSARVAM_API_KEYset