Summary
When both VAD and a streaming STT are configured on AgentSession, VAD unconditionally drives the user_state machine, even when turn_detection="stt". The only way to make STT drive user_state is to pass no VAD at all, which forces users to give up VAD-based interruption sensing. These two concerns are conceptually independent and could be controlled separately.
Verified on livekit-agents 1.5.5 .
Current behavior
In livekit/agents/voice/audio_recognition.py, the VAD and STT paths set self._speaking asymmetrically.
VAD path is unconditional:
async def _on_vad_event(self, ev):
if ev.type == vad.VADEventType.START_OF_SPEECH:
...
self._speaking = True
elif ev.type == vad.VADEventType.END_OF_SPEECH:
...
self._speaking = False
STT path is gated on turn_detection_mode:
elif ev.type == stt.SpeechEventType.START_OF_SPEECH and self._turn_detection_mode == "stt":
...
self._speaking = True
elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
...
self._speaking = False
Both paths funnel through the same on_start_of_speech / on_end_of_speech hook in agent_activity.py, which calls _session._update_user_state(...). There is no public configuration to suppress VAD's contribution to user_state while keeping VAD active for interruption detection.
Why this matters
In noisy environments, Silero VAD can flip user_state to speaking on background noise even when STT emits no transcript. This breaks downstream behaviors that key off user_state.
The available workaround is to pass vad=None. That stops the false user_state flips, but it also disables VAD-based interruption detection. Users then have to rely on STT interim transcripts as the interruption signal, which adds noticeable latency compared to VAD's min_duration shortcut.
Proposed solution
Either of these would solve it cleanly. The first is the smaller change.
Option A — symmetric gate on the VAD path. Apply a turn_detection_mode gate on the VAD branch so it stops writing _speaking when the user has explicitly opted into STT-driven turn detection. This is a 2-line change in _on_vad_event. It mirrors what the STT branch already does.
Option B — explicit configuration. Add a user_state_source: Literal["vad", "stt", "auto"] field to TurnHandlingOptions (or AgentSession). "auto" keeps current behavior. "stt" makes the VAD branch skip the _speaking writes while still running VAD inference for interruption detection. "vad" is today's default.
Option A is the simplest fix and follows the existing code style. Option B is more explicit and lets users decouple the two signals regardless of the turn_detection mode chosen.
Summary
When both VAD and a streaming STT are configured on
AgentSession, VAD unconditionally drives theuser_statemachine, even whenturn_detection="stt". The only way to make STT driveuser_stateis to pass no VAD at all, which forces users to give up VAD-based interruption sensing. These two concerns are conceptually independent and could be controlled separately.Verified on
livekit-agents1.5.5 .Current behavior
In
livekit/agents/voice/audio_recognition.py, the VAD and STT paths setself._speakingasymmetrically.VAD path is unconditional:
STT path is gated on
turn_detection_mode:Both paths funnel through the same
on_start_of_speech/on_end_of_speechhook inagent_activity.py, which calls_session._update_user_state(...). There is no public configuration to suppress VAD's contribution touser_statewhile keeping VAD active for interruption detection.Why this matters
In noisy environments, Silero VAD can flip
user_statetospeakingon background noise even when STT emits no transcript. This breaks downstream behaviors that key offuser_state.The available workaround is to pass
vad=None. That stops the falseuser_stateflips, but it also disables VAD-based interruption detection. Users then have to rely on STT interim transcripts as the interruption signal, which adds noticeable latency compared to VAD'smin_durationshortcut.Proposed solution
Either of these would solve it cleanly. The first is the smaller change.
Option A — symmetric gate on the VAD path. Apply a
turn_detection_modegate on the VAD branch so it stops writing_speakingwhen the user has explicitly opted into STT-driven turn detection. This is a 2-line change in_on_vad_event. It mirrors what the STT branch already does.Option B — explicit configuration. Add a
user_state_source: Literal["vad", "stt", "auto"]field toTurnHandlingOptions(orAgentSession)."auto"keeps current behavior."stt"makes the VAD branch skip the_speakingwrites while still running VAD inference for interruption detection."vad"is today's default.Option A is the simplest fix and follows the existing code style. Option B is more explicit and lets users decouple the two signals regardless of the
turn_detectionmode chosen.