Skip to content

ElevenLabs server_vad with turn_detection="stt" does not end user turns #5849

@IngLP

Description

@IngLP

Bug Description

When using livekit-plugins-elevenlabs with ElevenLabs Scribe v2 Realtime and server_vad, AgentSession(turn_detection="stt") receives partial and final transcripts, but the user turn is not committed in time because no useful END_OF_SPEECH event is emitted.

In a real call, the ElevenLabs STT stream produced multiple partial/final transcripts for short user utterances, for example:

Can you hear me?
Hello?
Are you receiving audio?

However, the agent did not respond. The transcript only appeared as a user_turn when the participant disconnected, and LiveKit then logged:

skipping user input, speech scheduling is paused

This looks like a plugin/turn-detection integration issue rather than an STT recognition issue: Scribe did hear the user, but the turn stayed open until shutdown.

Related but not exactly the same as #4087. That issue describes Scribe v2 turns committing very late with local/Silero VAD. This issue is specifically about the new ElevenLabs server_vad path with AgentSession(turn_detection="stt").

Expected Behavior

With ElevenLabs server_vad enabled and AgentSession(turn_detection="stt"), the server-side VAD endpoint should cause LiveKit Agents to commit the user turn promptly, so the LLM/agent can answer as soon as ElevenLabs decides the utterance ended.

Actual Behavior

The ElevenLabs plugin emits partial/interim and committed/final transcript events, but the user turn is not committed promptly. The agent keeps waiting and only sees the accumulated user transcript during shutdown/disconnect.

Likely Cause

From livekit.plugins.elevenlabs.stt in livekit-plugins-elevenlabs==1.5.12:

  • _connect_ws() uses commit_strategy = "vad" when server_vad is configured.
  • _process_stream_event() maps non-empty committed_transcript / committed_transcript_with_timestamps to SpeechEventType.FINAL_TRANSCRIPT and keeps _speaking = True.
  • SpeechEventType.END_OF_SPEECH is only emitted for an empty committed transcript.

With AgentSession(turn_detection="stt") and no local VAD, audio_recognition.py appears to rely on an STT END_OF_SPEECH event to mark _user_turn_committed=True. If ElevenLabs does not send an empty committed transcript after a server-VAD commit, LiveKit never gets the end-of-speech signal and the turn remains open.

Reproduction Steps

  1. Create an agent using ElevenLabs Scribe v2 Realtime STT with server_vad configured.
  2. Configure the session to use STT turn detection, without local VAD:
from livekit.agents import AgentSession
from livekit.plugins import elevenlabs

stt = elevenlabs.STT(
    model="scribe_v2_realtime",
    server_vad=elevenlabs.stt.VADOptions(
        vad_threshold=0.4,
        vad_silence_threshold_secs=0.5,
        min_speech_duration_ms=100,
        min_silence_duration_ms=100,
    ),
)

session = AgentSession(
    stt=stt,
    turn_detection="stt",
    # no local VAD
)
  1. Join a voice room/call and speak a few short utterances, e.g. “hello”, “can you hear me?”.
  2. Observe that ElevenLabs produces partial and final transcripts, but the agent does not answer until disconnect/shutdown, if at all.

Package Versions

livekit-agents==1.5.12
livekit-plugins-elevenlabs==1.5.12
Python 3.12

Proposed Direction

For the server_vad / commit_strategy="vad" path, the ElevenLabs plugin may need to translate the server-VAD committed transcript into an end-of-speech signal that LiveKit Agents can use for turn_detection="stt".

For example, after emitting FINAL_TRANSCRIPT for a non-empty server-VAD committed_transcript, the plugin could also emit END_OF_SPEECH (or otherwise mark the STT turn as committed) when the committed transcript represents an endpointed utterance.

The important part is that turn_detection="stt" should not depend on a local VAD fallback to make ElevenLabs server_vad usable for low-latency agent responses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions