Bug Description
When using livekit-plugins-elevenlabs with ElevenLabs Scribe v2 Realtime and server_vad, AgentSession(turn_detection="stt") receives partial and final transcripts, but the user turn is not committed in time because no useful END_OF_SPEECH event is emitted.
In a real call, the ElevenLabs STT stream produced multiple partial/final transcripts for short user utterances, for example:
Can you hear me?
Hello?
Are you receiving audio?
However, the agent did not respond. The transcript only appeared as a user_turn when the participant disconnected, and LiveKit then logged:
skipping user input, speech scheduling is paused
This looks like a plugin/turn-detection integration issue rather than an STT recognition issue: Scribe did hear the user, but the turn stayed open until shutdown.
Related but not exactly the same as #4087. That issue describes Scribe v2 turns committing very late with local/Silero VAD. This issue is specifically about the new ElevenLabs server_vad path with AgentSession(turn_detection="stt").
Expected Behavior
With ElevenLabs server_vad enabled and AgentSession(turn_detection="stt"), the server-side VAD endpoint should cause LiveKit Agents to commit the user turn promptly, so the LLM/agent can answer as soon as ElevenLabs decides the utterance ended.
Actual Behavior
The ElevenLabs plugin emits partial/interim and committed/final transcript events, but the user turn is not committed promptly. The agent keeps waiting and only sees the accumulated user transcript during shutdown/disconnect.
Likely Cause
From livekit.plugins.elevenlabs.stt in livekit-plugins-elevenlabs==1.5.12:
_connect_ws() uses commit_strategy = "vad" when server_vad is configured.
_process_stream_event() maps non-empty committed_transcript / committed_transcript_with_timestamps to SpeechEventType.FINAL_TRANSCRIPT and keeps _speaking = True.
SpeechEventType.END_OF_SPEECH is only emitted for an empty committed transcript.
With AgentSession(turn_detection="stt") and no local VAD, audio_recognition.py appears to rely on an STT END_OF_SPEECH event to mark _user_turn_committed=True. If ElevenLabs does not send an empty committed transcript after a server-VAD commit, LiveKit never gets the end-of-speech signal and the turn remains open.
Reproduction Steps
- Create an agent using ElevenLabs Scribe v2 Realtime STT with
server_vad configured.
- Configure the session to use STT turn detection, without local VAD:
from livekit.agents import AgentSession
from livekit.plugins import elevenlabs
stt = elevenlabs.STT(
model="scribe_v2_realtime",
server_vad=elevenlabs.stt.VADOptions(
vad_threshold=0.4,
vad_silence_threshold_secs=0.5,
min_speech_duration_ms=100,
min_silence_duration_ms=100,
),
)
session = AgentSession(
stt=stt,
turn_detection="stt",
# no local VAD
)
- Join a voice room/call and speak a few short utterances, e.g. “hello”, “can you hear me?”.
- Observe that ElevenLabs produces partial and final transcripts, but the agent does not answer until disconnect/shutdown, if at all.
Package Versions
livekit-agents==1.5.12
livekit-plugins-elevenlabs==1.5.12
Python 3.12
Proposed Direction
For the server_vad / commit_strategy="vad" path, the ElevenLabs plugin may need to translate the server-VAD committed transcript into an end-of-speech signal that LiveKit Agents can use for turn_detection="stt".
For example, after emitting FINAL_TRANSCRIPT for a non-empty server-VAD committed_transcript, the plugin could also emit END_OF_SPEECH (or otherwise mark the STT turn as committed) when the committed transcript represents an endpointed utterance.
The important part is that turn_detection="stt" should not depend on a local VAD fallback to make ElevenLabs server_vad usable for low-latency agent responses.
Bug Description
When using
livekit-plugins-elevenlabswith ElevenLabs Scribe v2 Realtime andserver_vad,AgentSession(turn_detection="stt")receives partial and final transcripts, but the user turn is not committed in time because no usefulEND_OF_SPEECHevent is emitted.In a real call, the ElevenLabs STT stream produced multiple partial/final transcripts for short user utterances, for example:
However, the agent did not respond. The transcript only appeared as a
user_turnwhen the participant disconnected, and LiveKit then logged:This looks like a plugin/turn-detection integration issue rather than an STT recognition issue: Scribe did hear the user, but the turn stayed open until shutdown.
Related but not exactly the same as #4087. That issue describes Scribe v2 turns committing very late with local/Silero VAD. This issue is specifically about the new ElevenLabs
server_vadpath withAgentSession(turn_detection="stt").Expected Behavior
With ElevenLabs
server_vadenabled andAgentSession(turn_detection="stt"), the server-side VAD endpoint should cause LiveKit Agents to commit the user turn promptly, so the LLM/agent can answer as soon as ElevenLabs decides the utterance ended.Actual Behavior
The ElevenLabs plugin emits partial/interim and committed/final transcript events, but the user turn is not committed promptly. The agent keeps waiting and only sees the accumulated user transcript during shutdown/disconnect.
Likely Cause
From
livekit.plugins.elevenlabs.sttinlivekit-plugins-elevenlabs==1.5.12:_connect_ws()usescommit_strategy = "vad"whenserver_vadis configured._process_stream_event()maps non-emptycommitted_transcript/committed_transcript_with_timestampstoSpeechEventType.FINAL_TRANSCRIPTand keeps_speaking = True.SpeechEventType.END_OF_SPEECHis only emitted for an empty committed transcript.With
AgentSession(turn_detection="stt")and no local VAD,audio_recognition.pyappears to rely on an STTEND_OF_SPEECHevent to mark_user_turn_committed=True. If ElevenLabs does not send an empty committed transcript after a server-VAD commit, LiveKit never gets the end-of-speech signal and the turn remains open.Reproduction Steps
server_vadconfigured.Package Versions
Proposed Direction
For the
server_vad/commit_strategy="vad"path, the ElevenLabs plugin may need to translate the server-VAD committed transcript into an end-of-speech signal that LiveKit Agents can use forturn_detection="stt".For example, after emitting
FINAL_TRANSCRIPTfor a non-empty server-VADcommitted_transcript, the plugin could also emitEND_OF_SPEECH(or otherwise mark the STT turn as committed) when the committed transcript represents an endpointed utterance.The important part is that
turn_detection="stt"should not depend on a local VAD fallback to make ElevenLabsserver_vadusable for low-latency agent responses.