Skip to content

Allow decoupling user_state source from VAD when STT emits speech events #5580

@miguelmoralai

Description

@miguelmoralai

Summary

When both VAD and a streaming STT are configured on AgentSession, VAD unconditionally drives the user_state machine, even when turn_detection="stt". The only way to make STT drive user_state is to pass no VAD at all, which forces users to give up VAD-based interruption sensing. These two concerns are conceptually independent and could be controlled separately.

Verified on livekit-agents 1.5.5 .

Current behavior

In livekit/agents/voice/audio_recognition.py, the VAD and STT paths set self._speaking asymmetrically.

VAD path is unconditional:

async def _on_vad_event(self, ev):
    if ev.type == vad.VADEventType.START_OF_SPEECH:
        ...
        self._speaking = True
    elif ev.type == vad.VADEventType.END_OF_SPEECH:
        ...
        self._speaking = False

STT path is gated on turn_detection_mode:

elif ev.type == stt.SpeechEventType.START_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    self._speaking = True
elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    self._speaking = False

Both paths funnel through the same on_start_of_speech / on_end_of_speech hook in agent_activity.py, which calls _session._update_user_state(...). There is no public configuration to suppress VAD's contribution to user_state while keeping VAD active for interruption detection.

Why this matters

In noisy environments, Silero VAD can flip user_state to speaking on background noise even when STT emits no transcript. This breaks downstream behaviors that key off user_state.

The available workaround is to pass vad=None. That stops the false user_state flips, but it also disables VAD-based interruption detection. Users then have to rely on STT interim transcripts as the interruption signal, which adds noticeable latency compared to VAD's min_duration shortcut.

Proposed solution

Either of these would solve it cleanly. The first is the smaller change.

Option A — symmetric gate on the VAD path. Apply a turn_detection_mode gate on the VAD branch so it stops writing _speaking when the user has explicitly opted into STT-driven turn detection. This is a 2-line change in _on_vad_event. It mirrors what the STT branch already does.

Option B — explicit configuration. Add a user_state_source: Literal["vad", "stt", "auto"] field to TurnHandlingOptions (or AgentSession). "auto" keeps current behavior. "stt" makes the VAD branch skip the _speaking writes while still running VAD inference for interruption detection. "vad" is today's default.

Option A is the simplest fix and follows the existing code style. Option B is more explicit and lets users decouple the two signals regardless of the turn_detection mode chosen.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions