Bug Description
ElevenLabs Scribe V2 Realtime STT via LiveKit Inference produces stt_audio_duration=0.0 (no transcriptions at all), while Deepgram Nova-3 works perfectly with the exact same setup.
I've tested extensively with different initialization approaches and language formats. Audio is definitely being received and processed (confirmed via gain boost logs showing -27 to -30 dB RMS), but ElevenLabs STT never produces any transcription output.
The documentation at https://docs.livekit.io/agents/models/stt/inference/elevenlabs/ shows this should work with the inference.STT() approach.
Expected Behavior
ElevenLabs Scribe V2 Realtime should produce transcriptions similar to Deepgram Nova-3, with non-zero stt_audio_duration in usage metrics and user_input_transcribed events firing when speech is detected.
Reproduction Steps
1. Create an AgentSession with ElevenLabs STT via LiveKit Inference
2. Connect a participant and speak into microphone
3. Observe that no transcriptions are produced (stt_audio_duration=0.0)
4. Switch to Deepgram Nova-3 with same setup
5. Observe that transcriptions work perfectly
**Working Code (Deepgram):**
from livekit.agents import inference
STT_MODEL = "deepgram/nova-3"
STT_LANGUAGE = "en"
session = AgentSession(
stt=inference.STT(model=STT_MODEL, language=STT_LANGUAGE),
llm="openai/gpt-4o-mini",
tts="cartesia/sonic-3:f786b574-daa5-4673-aa0c-cbe3e8534c02",
vad=vad,
)
Failing Code (ElevenLabs) - Tested 3 variations: Option 1 - String shorthand:
stt="elevenlabs/scribe_v2_realtime:en"
Option 2 - inference.STT with "en":
stt=inference.STT(model="elevenlabs/scribe_v2_realtime", language="en")
Option 3 - inference.STT with "en-US":
stt=inference.STT(model="elevenlabs/scribe_v2_realtime", language="en-US")
All three ElevenLabs variations produce stt_audio_duration=0.0
Operating System
Linux (Railway deployment)
Models Used
STT: elevenlabs/scribe_v2_realtime (FAILING) / deepgram/nova-3 (WORKING) LLM: openai/gpt-4o-mini TTS: cartesia/sonic-3
Package Versions
ivekit-agents==1.1.2
livekit-plugins-silero==1.1.2
livekit-plugins-turn-detector==1.1.2
Python 3.11
Session/Room/Call IDs
Room ID: voice-test-7e493f46
Participant: web-tester-7e493f46
Proposed Solution
Unclear - may be an issue with how LiveKit Inference routes requests to ElevenLabs, or a configuration issue on the ElevenLabs integration side.
Additional Context
Key observation from logs: WORKING (Deepgram):
stt_audio_duration=19.55
Multiple user_input_transcribed events
FAILING (ElevenLabs):
stt_audio_duration=0.0
Zero user_input_transcribed events
Audio IS being received (GAIN_BOOST logs show -27 to -30 dB RMS)
Full usage summary from ElevenLabs test: UsageSummary(llm_prompt_tokens=178, llm_completion_tokens=9, tts_characters_count=32, tts_audio_duration=1.95, stt_audio_duration=0.0) Note: LLM and TTS work fine - only STT fails with ElevenLabs.
Screenshots and Recordings
No response
Bug Description
ElevenLabs Scribe V2 Realtime STT via LiveKit Inference produces
stt_audio_duration=0.0(no transcriptions at all), while Deepgram Nova-3 works perfectly with the exact same setup.I've tested extensively with different initialization approaches and language formats. Audio is definitely being received and processed (confirmed via gain boost logs showing -27 to -30 dB RMS), but ElevenLabs STT never produces any transcription output.
The documentation at https://docs.livekit.io/agents/models/stt/inference/elevenlabs/ shows this should work with the
inference.STT()approach.Expected Behavior
ElevenLabs Scribe V2 Realtime should produce transcriptions similar to Deepgram Nova-3, with non-zero
stt_audio_durationin usage metrics anduser_input_transcribedevents firing when speech is detected.Reproduction Steps
Operating System
Linux (Railway deployment)
Models Used
STT: elevenlabs/scribe_v2_realtime (FAILING) / deepgram/nova-3 (WORKING) LLM: openai/gpt-4o-mini TTS: cartesia/sonic-3
Package Versions
Session/Room/Call IDs
Room ID: voice-test-7e493f46
Participant: web-tester-7e493f46
Proposed Solution
Additional Context
Key observation from logs: WORKING (Deepgram):
stt_audio_duration=19.55
Multiple user_input_transcribed events
FAILING (ElevenLabs):
stt_audio_duration=0.0
Zero user_input_transcribed events
Audio IS being received (GAIN_BOOST logs show -27 to -30 dB RMS)
Full usage summary from ElevenLabs test: UsageSummary(llm_prompt_tokens=178, llm_completion_tokens=9, tts_characters_count=32, tts_audio_duration=1.95, stt_audio_duration=0.0) Note: LLM and TTS work fine - only STT fails with ElevenLabs.
Screenshots and Recordings
No response