Skip to content

Bug: update_vad() on STT EOT introduces ~600ms deaf window because the new VAD task waits for the old one to shut down #5674

@miguelmoralai

Description

@miguelmoralai

Summary

When turn_detection="stt" and a VAD is attached, every STT END_OF_SPEECH triggers AudioRecognition.update_vad(self._vad) to reset the VAD pipeline. This has two problems:

  1. The reset is unconditional even though the SDK's own comment justifies it only for the case where the user is still speaking when the EOT arrives.
  2. When the reset does happen, the new VAD task waits for the previous one to fully shut down before processing any audio. The result is a deaf window of several hundred milliseconds during which a user resuming speech is not detected.

In production with Soniox + Silero, this consistently causes the agent to start TTS while the user started speaking, and to speak audibly over user before VAD recovers and triggers an interruption.

Reproduction

livekit-agents==1.5.8, default Silero, Soniox turn_detection="stt", min_silence_duration=1.1s, min_speech_duration=0.1s, interruption.mode="vad".

https://cloud.livekit.io/projects/p_2tsga5unx60/sessions/RM_UdamB8QYwwVx/observability?mode=transcript

Turn around 01:08. Transcript right before the issue:

agent: "Gracias. Te explico el motivo de la llamada... ¿Podrías hacer el pago hoy
        para resolver esta situación?"
user:  "Eh, quería sí, quería—"        (hesitation, trails off)
agent: "Sin problema, termina lo que ibas"  (interrupted mid-sentence)
user:  "La, la cuota, la, la."

Logs at the moment Soniox commits the user's hesitation as end of turn:

01:08.870  User state speaking → listening (spoke for 2.2s)   [STT EOT]
01:08.871  WARN  stt end of speech received while user is speaking, resetting vad
01:08.893  Agent state listening → thinking
01:09.609  Agent state thinking → speaking                    [TTS starts]
01:09.707  ElevenLabs TTS request completed
01:10.221  User state listening → speaking                    [VAD finally fires]

If you scrub the recording, the user starts speaking again clearly around 01:09.50, roughly 300ms before the agent's TTS is audible. The SDK does not register User state listening → speaking until 01:10.22, which is about 700ms after the real onset. By then the agent has been talking for ~400ms over the user.

Image

Root cause

The relevant SDK code is in audio_recognition.py:

elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    if self._vad:
        if self._speaking:
            logger.warning("stt end of speech received while user is speaking, resetting vad")
        self.update_vad(self._vad)
    self._speaking = False
    ...
def update_vad(self, vad):
    self._vad = vad
    if vad:
        self._vad_ch = aio.Chan[rtc.AudioFrame]()
        self._vad_atask = asyncio.create_task(
            self._vad_task(vad, self._vad_ch, self._vad_atask)
        )
async def _vad_task(self, vad, audio_input, task):
    if task is not None:
        await aio.cancel_and_wait(task)        # blocks here
    stream = vad.stream()
    ...

There are two distinct problems in this code path.

Problem 1: the reset is unconditional

The comment above update_vad(self._vad) reads:

# reset VAD so that incorrect end of turn from STT can be corrected by VAD interruption
# if user is still speaking (an immediate VAD SOS will interrupt the agent)

The intent is clearly to reset only when _speaking=True at EOT time. But the actual code calls update_vad outside the if self._speaking block, so it runs on every EOT regardless. When _speaking=False, Silero has already emitted its own EOS, its internal state is pub_speaking=False, and the next event it produces will naturally be a SOS as soon as audio crosses the activation threshold. The reset in that case is pure overhead.

Problem 2: when the reset is necessary, it blocks

On every reset:

  1. _vad_ch is reassigned to a fresh empty channel immediately.
  2. push_audio keeps writing frames into the new channel.
  3. The new _vad_task is scheduled, but its first action is await aio.cancel_and_wait(task). It does not consume any frame from the new channel until the old task fully exits.
  4. The old task receives CancelledError and runs its finally, which awaits cancel_and_wait(forward_task) and await stream.aclose(). The ONNX shutdown alone takes around 100 to 300ms.
  5. Only after that does the new task call vad.stream(), start its forward loop, and begin draining the backlog of frames that accumulated during the swap.

Until the new task catches up and Silero has accumulated min_speech_duration of high probability windows on freshly processed audio, no SOS is emitted. End to end this is the ~600ms window of deafness observed in the call above.

Proposed fix

Two changes, independent of each other, both in livekit-agents. They can be reviewed separately.

Change 1: only reset when needed

Move update_vad inside the if self._speaking branch so the reset only happens when Silero is actually still in a speaking state at EOT time. This is exactly what the comment was promising.

if self._vad and self._speaking:
    logger.warning("stt end of speech received while user is speaking, resetting vad")
    self.update_vad(self._vad)

When _speaking=False, Silero has already EOS'd, its state machine is ready to fire SOS on the next high probability window without intervention, and the deaf window disappears entirely.

This is a one-line change. It only changes behavior in the case the comment never intended to cover.

Change 2: when the reset runs, don't block on the previous task

Cancel the old task and let it clean up its own resources in the background. Use a generation counter so the old task stops dispatching events into _on_vad_event once it has been superseded.

# audio_recognition.py

self._vad_generation: int = 0   # init in __init__

def update_vad(self, vad):
    self._vad = vad
    if vad:
        self._vad_generation += 1
        gen = self._vad_generation
        self._vad_ch = aio.Chan[rtc.AudioFrame]()
        self._vad_atask = asyncio.create_task(
            self._vad_task(vad, self._vad_ch, self._vad_atask, gen)
        )

async def _vad_task(self, vad, audio_input, prev_task, gen):
    if prev_task is not None:
        prev_task.cancel()      # fire and forget, its finally cleans up

    stream = vad.stream()

    async def _forward():
        async for frame in audio_input:
            stream.push_frame(frame)
    forward_task = asyncio.create_task(_forward())

    try:
        async for ev in stream:
            if gen != self._vad_generation:
                break           # superseded by a newer update_vad call
            await self._on_vad_event(ev)
    finally:
        await aio.cancel_and_wait(forward_task)
        await stream.aclose()

The new task starts processing audio immediately. Expected deaf window in this reproduction drops from ~600ms to under 200ms, with the residual being natural Silero onset latency on quiet ramps, not a structural defect of the SDK.

The generation token closes the race the original await was implicitly preventing: an in-flight old task could otherwise dispatch stale SOS or EOS events into _on_vad_event after the new task has taken over. Without it, removing the await would be unsafe.

The old task's finally still runs (cancels its forward_task, closes its stream), just off the critical path. No public API change. No plugin change.

Proposal 2 (Preferred one)

Directly eliminate the if condition related to VAD

elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    if self._vad:
        if self._speaking:
            logger.warning("stt end of speech received while user is speaking, resetting vad")
        self.update_vad(self._vad)
    self._speaking = False
    ...

If we decided to use self._turn_detection_mode == "stt" is because we want STT to be the only one affecting EOT decision and not VAD.

Happy to open the PR with both changes and the tests.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions