Skip to content

feat: add Sarvam AI STT plugin#1046

Merged
toubatbrian merged 23 commits intolivekit:mainfrom
mshivam019:feat/sarvam-stt-plugin
Feb 14, 2026
Merged

feat: add Sarvam AI STT plugin#1046
toubatbrian merged 23 commits intolivekit:mainfrom
mshivam019:feat/sarvam-stt-plugin

Conversation

@mshivam019
Copy link
Contributor

@mshivam019 mshivam019 commented Feb 12, 2026

Summary

  • Adds speech-to-text (STT) support to the Sarvam AI plugin with both REST and WebSocket streaming
  • REST: _recognize() for single-shot transcription via /speech-to-text and /speech-to-text-translate
  • WebSocket streaming: SpeechStream class with real-time audio streaming via /speech-to-text/ws and /speech-to-text-translate/ws
  • Defaults to the recommended saaras:v3 model with 22+ Indian languages and 5 transcription modes (transcribe, translate, verbatim, translit, codemix)
  • Also supports saarika:v2.5 (deprecated) and saaras:v2.5 (Indic-to-English translation with auto language detection)
  • Follows the same discriminated-union option pattern as the existing TTS plugin (model-specific types, per-model defaults, safe updateOptions on model switch)

WebSocket streaming details

  • Tri-task architecture: sendTask (audio input → base64 JSON), listenTask (server messages → SpeechEvents), wsMonitor (disconnect detection)
  • VAD events: START_SPEECH / END_SPEECH mapped to SpeechEventType
  • All WS query params supported: high_vad_sensitivity, flush_signal, vad_signals, input_audio_codec
  • Robust error parsing: handles data.message, data.error, and top-level fallbacks
  • Retry loop with linear backoff on idle timeout (~20s server-side disconnect)
  • end_of_stream includes empty audio field per Sarvam protocol requirement
  • Matches the Python SDK's reconnect-on-idle approach (no keepalive needed)

Files changed

File Change
plugins/sarvam/src/stt.ts STT class with _recognize() + SpeechStream class with WS streaming
plugins/sarvam/src/stt.test.ts Unit tests using @livekit/agents-plugins-test harness
plugins/sarvam/src/models.ts Added STTModels, STTModes, STTV2Languages, STTV3Languages types
plugins/sarvam/src/index.ts Export STT, SpeechStream, and option types
plugins/sarvam/package.json Added ws and @types/ws dependencies
plugins/sarvam/README.md Updated with STT usage, language list, and mode reference

Usage

import * as sarvam from '@livekit/agents-plugin-sarvam';

// REST (single-shot)
const stt = new sarvam.STT({
  model: 'saaras:v3',
  languageCode: 'en-IN',
  mode: 'transcribe',
});

// WebSocket streaming (real-time)
const stream = stt.stream();

Test plan

  • Verify STT.recognize() transcribes audio correctly with saaras:v3
  • Verify updateOptions() handles model switching (v3 → v2.5 and back) without leaking model-specific fields
  • Verify WS streaming receives transcripts and VAD events in real-time
  • Verify WS reconnects gracefully on idle timeout
  • Verify error parsing handles all server error formats
  • Run pnpm vitest plugins/sarvam with SARVAM_API_KEY set

mshivam019 and others added 7 commits February 10, 2026 12:37
Add @livekit/agents-plugin-sarvam with text-to-speech support using
Sarvam AI's Bulbul models. Supports 11 Indian languages and 45+ speaker
voices via the Sarvam REST API.

- TTS and ChunkedStream classes following existing plugin patterns
- Models/speakers/languages type definitions
- Test file using shared @livekit/agents-plugins-test harness
- SARVAM_API_KEY added to turbo.json globalEnv
- Calls AudioByteStream.flush() to prevent trailing audio truncation
…peaker names

- Split TTSSpeakers into TTSV2Speakers and TTSV3Speakers types
- Add TTSSampleRates and TTSAudioCodecs types to models.ts
- Rewrite TTS with discriminated union options (TTSV2Options/TTSV3Options)
- V2-specific: pitch, loudness, enablePreprocessing
- V3-specific: temperature
- Extract resolveOptions() and buildRequestBody() for SRP
- Fix speaker names to lowercase (API requires lowercase, not capitalized)
- Export new types from index.ts
AudioByteStream requires raw PCM data, which we obtain by stripping
the 44-byte WAV header. Allowing user-configurable outputAudioCodec
would produce compressed audio (mp3, opus, etc.) that silently breaks
the pipeline. Remove outputAudioCodec from public options and hardcode
WAV in the API request.
When updateOptions switches the model (e.g. v2 -> v3), the previous
shallow merge kept stale model-specific fields like speaker, pitch,
and loudness from the old model. Now delegates to resolveOptions()
so model-specific defaults are re-applied correctly.
The spread of ResolvedTTSOptions (model: TTSModels) doesn't satisfy the
discriminated union TTSOptions. Cast to TTSOptions before passing to
resolveOptions, which handles discrimination internally via isV3 check.
…ioCodecs

- Drop model-specific fields (speaker, pitch, loudness, temperature,
  enablePreprocessing) when switching models so resolveOptions applies
  correct defaults for the new model
- Add type assertions for discriminated union compatibility
- Remove unused TTSAudioCodecs type from models.ts
@changeset-bot
Copy link

changeset-bot bot commented Feb 12, 2026

🦋 Changeset detected

Latest commit: 8bcf7fb

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 20 packages
Name Type
@livekit/agents-plugin-sarvam Patch
@livekit/agents Patch
@livekit/agents-plugin-anam Patch
@livekit/agents-plugin-baseten Patch
@livekit/agents-plugin-bey Patch
@livekit/agents-plugin-cartesia Patch
@livekit/agents-plugin-deepgram Patch
@livekit/agents-plugin-elevenlabs Patch
@livekit/agents-plugin-google Patch
@livekit/agents-plugin-hedra Patch
@livekit/agents-plugin-inworld Patch
@livekit/agents-plugin-lemonslice Patch
@livekit/agents-plugin-livekit Patch
@livekit/agents-plugin-neuphonic Patch
@livekit/agents-plugin-openai Patch
@livekit/agents-plugin-resemble Patch
@livekit/agents-plugin-rime Patch
@livekit/agents-plugin-silero Patch
@livekit/agents-plugin-xai Patch
@livekit/agents-plugins-test Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review

@mshivam019 mshivam019 force-pushed the feat/sarvam-stt-plugin branch 3 times, most recently from fec3009 to 7df750e Compare February 12, 2026 11:45
Add speech-to-text support to the Sarvam plugin using the Sarvam AI
speech-to-text REST API. Defaults to the recommended saaras:v3 model
with support for 22+ Indian languages and 5 transcription modes
(transcribe, translate, verbatim, translit, codemix). Also supports
the deprecated saarika:v2.5 model for backward compatibility.
@mshivam019 mshivam019 force-pushed the feat/sarvam-stt-plugin branch from 7df750e to f3fb024 Compare February 12, 2026 11:47
- Add saaras:v2.5 model with /speech-to-text-translate endpoint routing
- Add STTTranslateOptions with prompt param for translate endpoint
- Model-aware buildFormData: language_code for saarika/saaras-v3,
  mode for saaras-v3, prompt for saaras-v2.5
- Endpoint routing: saaras:v2.5 → /speech-to-text-translate,
  others → /speech-to-text
- updateOptions handles model switching across all three models
- Language fallback chain: API language_code → configured → 'unknown'
- Add SpeechStream class with full WS streaming (sendTask, listenTask, wsMonitor)
- Support both /speech-to-text/ws and /speech-to-text-translate/ws endpoints
- Handle VAD events (START_SPEECH, END_SPEECH) and final transcripts
- Add all WS query params: high_vad_sensitivity, flush_signal, vad_signals
- Add prompt and withTimestamps support for REST endpoints
- Robust error parsing (data.message + data.error + top-level fallbacks)
- Retry loop with linear backoff on disconnect (matches idle timeout behavior)
- end_of_stream includes empty audio field per Sarvam protocol requirement
- Add ws and @types/ws dependencies
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 16 additional findings in Devin Review.

Open in Devin Review

…nection

listenTask was created with this.abortController (the SpeechStream's
main controller). When listenTask.cancel() was called in the finally
block, it permanently aborted the stream's main signal, causing
sendTask to exit immediately on every subsequent WS reconnection
(infinite rapid reconnect loop with no audio sent).

Fix: remove the shared abortController arg so Task.from() creates its
own internal controller. listenTask.cancel() now only aborts that
local controller, leaving the stream's main signal intact for
reconnection.
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 18 additional findings in Devin Review.

Open in Devin Review

…se.all

Task objects are not thenables — passing wsMonitor directly to Promise.all
caused it to resolve immediately, making WS close detection ineffective.
The stream would hang on idle timeout instead of triggering the retry loop.
@toubatbrian
Copy link
Contributor

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a55b1ee053

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 693 to 696
await Promise.race([
this.#resetWS.await,
Promise.all([sendTask(), listenTask.result, wsMonitor.result]),
]);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cancel stale send loop on websocket reset

In #runWS, Promise.race can resolve via this.#resetWS.await when updateOptions() is called, but sendTask() is a plain async function and is never cancelled. After the finally block closes the websocket, that old send loop can still be blocked on this.input.next() and then consume subsequent audio frames, attempting to write them to a closed socket; this drops user audio right after an option/model change.

Useful? React with 👍 / 👎.

Comment on lines 610 to 612
const listenMessage = new Promise<void>((resolve, reject) => {
ws.on('message', (msg: RawData) => {
try {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Resolve listener when websocket closes without messages

listenMessage only subscribes to ws.on('message') and only resolves from inside that callback, so a normal shutdown path with no trailing transcript/event message can hang forever. This is reachable when sendTask sets closing = true and cancels wsMonitor (e.g., silent input), because a subsequent socket close then has no code path to settle listenTask, preventing stream completion.

Useful? React with 👍 / 👎.

- Add .gitattributes with `* text=auto eol=lf` to enforce LF endings
- Fix sendTask not cancellable on WS reset (session-scoped AbortController)
- Fix listenMessage hanging on WS close without trailing messages
- Normalize all sarvam plugin files from CRLF to LF
- Fix prettier formatting in index.ts and models.ts
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 20 additional findings in Devin Review.

Open in Devin Review

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 24 additional findings in Devin Review.

Open in Devin Review

Prevents stale #speaking=true from suppressing START_OF_SPEECH
events after a WS disconnect that occurred mid-speech.
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 27 additional findings in Devin Review.

Open in Devin Review

….cancel()

listenTask was created without the parent abort controller, causing a
deadlock when SpeechStream.close() was called — the finally block
couldn't run because Promise.all was stuck waiting for listenTask.

Passing this.abortController to Task.from allows listenTask to exit
when the stream is closed. Removing listenTask.cancel() from the
finally block prevents it from permanently aborting the parent
controller on WS reconnection. Instead, ws.close() triggers the
ws.once('close') handler in listenMessage for clean exit.
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 31 additional findings in Devin Review.

Open in Devin Review

Resetting retries to 0 after TCP connect meant that if the session
immediately failed (e.g. auth error, server rejection), the counter
never reached maxRetry, causing an infinite tight loop with 0ms delay.
Escape curly braces in JSDoc comment that TSDoc parser was
interpreting as malformed inline tags.
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 32 additional findings in Devin Review.

Open in Devin Review

…fer slice

- Reset retry counter only after sessions that ran >5s, distinguishing
  expected idle-timeout reconnections from persistent connection failures
- Use buffer.slice(byteOffset, byteOffset+byteLength) for AudioByteStream
  to handle typed array views into pooled/shared ArrayBuffers correctly
- Fix TSDoc comment with unescaped braces
@toubatbrian
Copy link
Contributor

Can you merge main into the branch to resolve some conflicts? Just merged the TTS plugin

@toubatbrian toubatbrian merged commit 47251f4 into livekit:main Feb 14, 2026
4 checks passed
@github-actions github-actions bot mentioned this pull request Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments