Bug Description
The elevenlabs STT implementation passes the sample rate/audio format in the wrong parameter during wss connection. The correct parameter is audio_format, but the format is being passed in the encoding (non-existent) parameter. API Reference here.
Expected Behavior
The sample rate/encoding should be passed in the audio_format parameter during wss connection.
Reproduction Steps
elevenlabs.STT(model="scribe_v2_realtime", sample_rate=8000)
pcm_8000 is passed to the non-existent encoding parameter. This defaults to pcm_16000, resulting in incorrect transcription timestamps, silence length detection, min silence/speech threshold detection, etc. for vad-based commits
Operating System
macOS Tahoe
Models Used
ElevenLabs Scribe V2 Realtime
Package Versions
livekit-plugins-elevenlabs==1.5.6
Session/Room/Call IDs
No response
Proposed Solution
--- a/livekit-plugins/livekit-plugins-elevenlabs/livekit/plugins/elevenlabs/stt.py
+++ b/livekit-plugins/livekit-plugins-elevenlabs/livekit/plugins/elevenlabs/stt.py
@@ -365,7 +365,6 @@ class SpeechStream(stt.SpeechStream):
"message_type": "input_audio_chunk",
"audio_base_64": "",
"commit": False,
- "sample_rate": self._opts.sample_rate,
}
)
)
@@ -403,7 +402,6 @@ class SpeechStream(stt.SpeechStream):
"message_type": "input_audio_chunk",
"audio_base_64": audio_b64,
"commit": False,
- "sample_rate": self._opts.sample_rate,
}
)
)
@@ -484,7 +482,7 @@ class SpeechStream(stt.SpeechStream):
commit_strategy = "manual" if self._opts.server_vad is None else "vad"
params = [
f"model_id={self._opts.model_id}",
- f"encoding=pcm_{self._opts.sample_rate}",
+ f"audio_format=pcm_{self._opts.sample_rate}",
f"commit_strategy={commit_strategy}",
]
Additional Context
No response
Screenshots and Recordings
No response
Bug Description
The elevenlabs STT implementation passes the sample rate/audio format in the wrong parameter during wss connection. The correct parameter is
audio_format, but the format is being passed in theencoding(non-existent) parameter. API Reference here.Expected Behavior
The sample rate/encoding should be passed in the audio_format parameter during wss connection.
Reproduction Steps
pcm_8000 is passed to the non-existent encoding parameter. This defaults to pcm_16000, resulting in incorrect transcription timestamps, silence length detection, min silence/speech threshold detection, etc. for vad-based commits
Operating System
macOS Tahoe
Models Used
ElevenLabs Scribe V2 Realtime
Package Versions
Session/Room/Call IDs
No response
Proposed Solution
Additional Context
No response
Screenshots and Recordings
No response