Summary
When turn_detection="stt" and a VAD is attached, every STT END_OF_SPEECH triggers AudioRecognition.update_vad(self._vad) to reset the VAD pipeline. This has two problems:
- The reset is unconditional even though the SDK's own comment justifies it only for the case where the user is still speaking when the EOT arrives.
- When the reset does happen, the new VAD task waits for the previous one to fully shut down before processing any audio. The result is a deaf window of several hundred milliseconds during which a user resuming speech is not detected.
In production with Soniox + Silero, this consistently causes the agent to start TTS while the user started speaking, and to speak audibly over user before VAD recovers and triggers an interruption.
Reproduction
livekit-agents==1.5.8, default Silero, Soniox turn_detection="stt", min_silence_duration=1.1s, min_speech_duration=0.1s, interruption.mode="vad".
https://cloud.livekit.io/projects/p_2tsga5unx60/sessions/RM_UdamB8QYwwVx/observability?mode=transcript
Turn around 01:08. Transcript right before the issue:
agent: "Gracias. Te explico el motivo de la llamada... ¿Podrías hacer el pago hoy
para resolver esta situación?"
user: "Eh, quería sí, quería—" (hesitation, trails off)
agent: "Sin problema, termina lo que ibas" (interrupted mid-sentence)
user: "La, la cuota, la, la."
Logs at the moment Soniox commits the user's hesitation as end of turn:
01:08.870 User state speaking → listening (spoke for 2.2s) [STT EOT]
01:08.871 WARN stt end of speech received while user is speaking, resetting vad
01:08.893 Agent state listening → thinking
01:09.609 Agent state thinking → speaking [TTS starts]
01:09.707 ElevenLabs TTS request completed
01:10.221 User state listening → speaking [VAD finally fires]
If you scrub the recording, the user starts speaking again clearly around 01:09.50, roughly 300ms before the agent's TTS is audible. The SDK does not register User state listening → speaking until 01:10.22, which is about 700ms after the real onset. By then the agent has been talking for ~400ms over the user.
Root cause
The relevant SDK code is in audio_recognition.py:
elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
...
if self._vad:
if self._speaking:
logger.warning("stt end of speech received while user is speaking, resetting vad")
self.update_vad(self._vad)
self._speaking = False
...
def update_vad(self, vad):
self._vad = vad
if vad:
self._vad_ch = aio.Chan[rtc.AudioFrame]()
self._vad_atask = asyncio.create_task(
self._vad_task(vad, self._vad_ch, self._vad_atask)
)
async def _vad_task(self, vad, audio_input, task):
if task is not None:
await aio.cancel_and_wait(task) # blocks here
stream = vad.stream()
...
There are two distinct problems in this code path.
Problem 1: the reset is unconditional
The comment above update_vad(self._vad) reads:
# reset VAD so that incorrect end of turn from STT can be corrected by VAD interruption
# if user is still speaking (an immediate VAD SOS will interrupt the agent)
The intent is clearly to reset only when _speaking=True at EOT time. But the actual code calls update_vad outside the if self._speaking block, so it runs on every EOT regardless. When _speaking=False, Silero has already emitted its own EOS, its internal state is pub_speaking=False, and the next event it produces will naturally be a SOS as soon as audio crosses the activation threshold. The reset in that case is pure overhead.
Problem 2: when the reset is necessary, it blocks
On every reset:
_vad_ch is reassigned to a fresh empty channel immediately.
push_audio keeps writing frames into the new channel.
- The new
_vad_task is scheduled, but its first action is await aio.cancel_and_wait(task). It does not consume any frame from the new channel until the old task fully exits.
- The old task receives
CancelledError and runs its finally, which awaits cancel_and_wait(forward_task) and await stream.aclose(). The ONNX shutdown alone takes around 100 to 300ms.
- Only after that does the new task call
vad.stream(), start its forward loop, and begin draining the backlog of frames that accumulated during the swap.
Until the new task catches up and Silero has accumulated min_speech_duration of high probability windows on freshly processed audio, no SOS is emitted. End to end this is the ~600ms window of deafness observed in the call above.
Proposed fix
Two changes, independent of each other, both in livekit-agents. They can be reviewed separately.
Change 1: only reset when needed
Move update_vad inside the if self._speaking branch so the reset only happens when Silero is actually still in a speaking state at EOT time. This is exactly what the comment was promising.
if self._vad and self._speaking:
logger.warning("stt end of speech received while user is speaking, resetting vad")
self.update_vad(self._vad)
When _speaking=False, Silero has already EOS'd, its state machine is ready to fire SOS on the next high probability window without intervention, and the deaf window disappears entirely.
This is a one-line change. It only changes behavior in the case the comment never intended to cover.
Change 2: when the reset runs, don't block on the previous task
Cancel the old task and let it clean up its own resources in the background. Use a generation counter so the old task stops dispatching events into _on_vad_event once it has been superseded.
# audio_recognition.py
self._vad_generation: int = 0 # init in __init__
def update_vad(self, vad):
self._vad = vad
if vad:
self._vad_generation += 1
gen = self._vad_generation
self._vad_ch = aio.Chan[rtc.AudioFrame]()
self._vad_atask = asyncio.create_task(
self._vad_task(vad, self._vad_ch, self._vad_atask, gen)
)
async def _vad_task(self, vad, audio_input, prev_task, gen):
if prev_task is not None:
prev_task.cancel() # fire and forget, its finally cleans up
stream = vad.stream()
async def _forward():
async for frame in audio_input:
stream.push_frame(frame)
forward_task = asyncio.create_task(_forward())
try:
async for ev in stream:
if gen != self._vad_generation:
break # superseded by a newer update_vad call
await self._on_vad_event(ev)
finally:
await aio.cancel_and_wait(forward_task)
await stream.aclose()
The new task starts processing audio immediately. Expected deaf window in this reproduction drops from ~600ms to under 200ms, with the residual being natural Silero onset latency on quiet ramps, not a structural defect of the SDK.
The generation token closes the race the original await was implicitly preventing: an in-flight old task could otherwise dispatch stale SOS or EOS events into _on_vad_event after the new task has taken over. Without it, removing the await would be unsafe.
The old task's finally still runs (cancels its forward_task, closes its stream), just off the critical path. No public API change. No plugin change.
Proposal 2 (Preferred one)
Directly eliminate the if condition related to VAD
elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
...
if self._vad:
if self._speaking:
logger.warning("stt end of speech received while user is speaking, resetting vad")
self.update_vad(self._vad)
self._speaking = False
...
If we decided to use self._turn_detection_mode == "stt" is because we want STT to be the only one affecting EOT decision and not VAD.
Happy to open the PR with both changes and the tests.
Summary
When
turn_detection="stt"and a VAD is attached, every STTEND_OF_SPEECHtriggersAudioRecognition.update_vad(self._vad)to reset the VAD pipeline. This has two problems:In production with Soniox + Silero, this consistently causes the agent to start TTS while the user started speaking, and to speak audibly over user before VAD recovers and triggers an interruption.
Reproduction
livekit-agents==1.5.8, default Silero, Sonioxturn_detection="stt",min_silence_duration=1.1s,min_speech_duration=0.1s,interruption.mode="vad".https://cloud.livekit.io/projects/p_2tsga5unx60/sessions/RM_UdamB8QYwwVx/observability?mode=transcript
Turn around 01:08. Transcript right before the issue:
Logs at the moment Soniox commits the user's hesitation as end of turn:
If you scrub the recording, the user starts speaking again clearly around
01:09.50, roughly 300ms before the agent's TTS is audible. The SDK does not registerUser state listening → speakinguntil01:10.22, which is about 700ms after the real onset. By then the agent has been talking for ~400ms over the user.Root cause
The relevant SDK code is in
audio_recognition.py:There are two distinct problems in this code path.
Problem 1: the reset is unconditional
The comment above
update_vad(self._vad)reads:The intent is clearly to reset only when
_speaking=Trueat EOT time. But the actual code callsupdate_vadoutside theif self._speakingblock, so it runs on every EOT regardless. When_speaking=False, Silero has already emitted its own EOS, its internal state ispub_speaking=False, and the next event it produces will naturally be a SOS as soon as audio crosses the activation threshold. The reset in that case is pure overhead.Problem 2: when the reset is necessary, it blocks
On every reset:
_vad_chis reassigned to a fresh empty channel immediately.push_audiokeeps writing frames into the new channel._vad_taskis scheduled, but its first action isawait aio.cancel_and_wait(task). It does not consume any frame from the new channel until the old task fully exits.CancelledErrorand runs itsfinally, which awaitscancel_and_wait(forward_task)andawait stream.aclose(). The ONNX shutdown alone takes around 100 to 300ms.vad.stream(), start its forward loop, and begin draining the backlog of frames that accumulated during the swap.Until the new task catches up and Silero has accumulated
min_speech_durationof high probability windows on freshly processed audio, no SOS is emitted. End to end this is the ~600ms window of deafness observed in the call above.Proposed fix
Two changes, independent of each other, both in
livekit-agents. They can be reviewed separately.Change 1: only reset when needed
Move
update_vadinside theif self._speakingbranch so the reset only happens when Silero is actually still in a speaking state at EOT time. This is exactly what the comment was promising.When
_speaking=False, Silero has already EOS'd, its state machine is ready to fire SOS on the next high probability window without intervention, and the deaf window disappears entirely.This is a one-line change. It only changes behavior in the case the comment never intended to cover.
Change 2: when the reset runs, don't block on the previous task
Cancel the old task and let it clean up its own resources in the background. Use a generation counter so the old task stops dispatching events into
_on_vad_eventonce it has been superseded.The new task starts processing audio immediately. Expected deaf window in this reproduction drops from ~600ms to under 200ms, with the residual being natural Silero onset latency on quiet ramps, not a structural defect of the SDK.
The generation token closes the race the original
awaitwas implicitly preventing: an in-flight old task could otherwise dispatch stale SOS or EOS events into_on_vad_eventafter the new task has taken over. Without it, removing the await would be unsafe.The old task's
finallystill runs (cancels itsforward_task, closes its stream), just off the critical path. No public API change. No plugin change.Proposal 2 (Preferred one)
Directly eliminate the if condition related to VAD
If we decided to use self._turn_detection_mode == "stt" is because we want STT to be the only one affecting EOT decision and not VAD.
Happy to open the PR with both changes and the tests.