Bug: update_vad() on STT EOT introduces ~600ms deaf window because the new VAD task waits for the old one to shut down

## Summary

When `turn_detection="stt"` and a VAD is attached, every STT `END_OF_SPEECH` triggers `AudioRecognition.update_vad(self._vad)` to reset the VAD pipeline. This has two problems:

1. The reset is unconditional even though the SDK's own comment justifies it only for the case where the user is still speaking when the EOT arrives.
2. When the reset does happen, the new VAD task waits for the previous one to fully shut down before processing any audio. The result is a deaf window of several hundred milliseconds during which a user resuming speech is not detected.

In production with Soniox + Silero, this consistently causes the agent to start TTS while the user started speaking, and to speak audibly over user before VAD recovers and triggers an interruption.

## Reproduction

`livekit-agents==1.5.8`, default Silero, Soniox `turn_detection="stt"`, `min_silence_duration=1.1s`, `min_speech_duration=0.1s`, `interruption.mode="vad"`.

https://cloud.livekit.io/projects/p_2tsga5unx60/sessions/RM_UdamB8QYwwVx/observability?mode=transcript

Turn around 01:08. Transcript right before the issue:

```
agent: "Gracias. Te explico el motivo de la llamada... ¿Podrías hacer el pago hoy
        para resolver esta situación?"
user:  "Eh, quería sí, quería—"        (hesitation, trails off)
agent: "Sin problema, termina lo que ibas"  (interrupted mid-sentence)
user:  "La, la cuota, la, la."
```

Logs at the moment Soniox commits the user's hesitation as end of turn:

```
01:08.870  User state speaking → listening (spoke for 2.2s)   [STT EOT]
01:08.871  WARN  stt end of speech received while user is speaking, resetting vad
01:08.893  Agent state listening → thinking
01:09.609  Agent state thinking → speaking                    [TTS starts]
01:09.707  ElevenLabs TTS request completed
01:10.221  User state listening → speaking                    [VAD finally fires]
```

If you scrub the recording, the user starts speaking again clearly around `01:09.50`, roughly 300ms before the agent's TTS is audible. The SDK does not register `User state listening → speaking` until `01:10.22`, which is about 700ms after the real onset. By then the agent has been talking for ~400ms over the user.

<img width="1163" height="285" alt="Image" src="https://github.com/user-attachments/assets/9cd6155c-138d-4365-b890-008247230aa7" />

## Root cause

The relevant SDK code is in [`audio_recognition.py`](https://github.com/livekit/agents/blob/main/livekit-agents/livekit/agents/voice/audio_recognition.py):

```python
elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    if self._vad:
        if self._speaking:
            logger.warning("stt end of speech received while user is speaking, resetting vad")
        self.update_vad(self._vad)
    self._speaking = False
    ...
```

```python
def update_vad(self, vad):
    self._vad = vad
    if vad:
        self._vad_ch = aio.Chan[rtc.AudioFrame]()
        self._vad_atask = asyncio.create_task(
            self._vad_task(vad, self._vad_ch, self._vad_atask)
        )
```

```python
async def _vad_task(self, vad, audio_input, task):
    if task is not None:
        await aio.cancel_and_wait(task)        # blocks here
    stream = vad.stream()
    ...
```

There are two distinct problems in this code path.

### Problem 1: the reset is unconditional

The comment above `update_vad(self._vad)` reads:

```
# reset VAD so that incorrect end of turn from STT can be corrected by VAD interruption
# if user is still speaking (an immediate VAD SOS will interrupt the agent)
```

The intent is clearly to reset only when `_speaking=True` at EOT time. But the actual code calls `update_vad` outside the `if self._speaking` block, so it runs on every EOT regardless. When `_speaking=False`, Silero has already emitted its own EOS, its internal state is `pub_speaking=False`, and the next event it produces will naturally be a SOS as soon as audio crosses the activation threshold. The reset in that case is pure overhead.

### Problem 2: when the reset is necessary, it blocks

On every reset:

1. `_vad_ch` is reassigned to a fresh empty channel immediately.
2. `push_audio` keeps writing frames into the new channel.
3. The new `_vad_task` is scheduled, but its first action is `await aio.cancel_and_wait(task)`. It does not consume any frame from the new channel until the old task fully exits.
4. The old task receives `CancelledError` and runs its `finally`, which awaits `cancel_and_wait(forward_task)` and `await stream.aclose()`. The ONNX shutdown alone takes around 100 to 300ms.
5. Only after that does the new task call `vad.stream()`, start its forward loop, and begin draining the backlog of frames that accumulated during the swap.

Until the new task catches up and Silero has accumulated `min_speech_duration` of high probability windows on freshly processed audio, no SOS is emitted. End to end this is the ~600ms window of deafness observed in the call above.

## Proposed fix

Two changes, independent of each other, both in `livekit-agents`. They can be reviewed separately.

### Change 1: only reset when needed

Move `update_vad` inside the `if self._speaking` branch so the reset only happens when Silero is actually still in a speaking state at EOT time. This is exactly what the comment was promising.

```python
if self._vad and self._speaking:
    logger.warning("stt end of speech received while user is speaking, resetting vad")
    self.update_vad(self._vad)
```

When `_speaking=False`, Silero has already EOS'd, its state machine is ready to fire SOS on the next high probability window without intervention, and the deaf window disappears entirely.

This is a one-line change. It only changes behavior in the case the comment never intended to cover.

### Change 2: when the reset runs, don't block on the previous task

Cancel the old task and let it clean up its own resources in the background. Use a generation counter so the old task stops dispatching events into `_on_vad_event` once it has been superseded.

```python
# audio_recognition.py

self._vad_generation: int = 0   # init in __init__

def update_vad(self, vad):
    self._vad = vad
    if vad:
        self._vad_generation += 1
        gen = self._vad_generation
        self._vad_ch = aio.Chan[rtc.AudioFrame]()
        self._vad_atask = asyncio.create_task(
            self._vad_task(vad, self._vad_ch, self._vad_atask, gen)
        )

async def _vad_task(self, vad, audio_input, prev_task, gen):
    if prev_task is not None:
        prev_task.cancel()      # fire and forget, its finally cleans up

    stream = vad.stream()

    async def _forward():
        async for frame in audio_input:
            stream.push_frame(frame)
    forward_task = asyncio.create_task(_forward())

    try:
        async for ev in stream:
            if gen != self._vad_generation:
                break           # superseded by a newer update_vad call
            await self._on_vad_event(ev)
    finally:
        await aio.cancel_and_wait(forward_task)
        await stream.aclose()
```

The new task starts processing audio immediately. Expected deaf window in this reproduction drops from ~600ms to under 200ms, with the residual being natural Silero onset latency on quiet ramps, not a structural defect of the SDK.

The generation token closes the race the original `await` was implicitly preventing: an in-flight old task could otherwise dispatch stale SOS or EOS events into `_on_vad_event` after the new task has taken over. Without it, removing the await would be unsafe.

The old task's `finally` still runs (cancels its `forward_task`, closes its stream), just off the critical path. No public API change. No plugin change.

## Proposal 2 (Preferred one)

Directly eliminate the if condition related to VAD


```python
elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    if self._vad:
        if self._speaking:
            logger.warning("stt end of speech received while user is speaking, resetting vad")
        self.update_vad(self._vad)
    self._speaking = False
    ...
```

If we decided to use self._turn_detection_mode == "stt"  is because we want STT to be the only one affecting EOT decision and not VAD. 



Happy to open the PR with both changes and the tests.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: update_vad() on STT EOT introduces ~600ms deaf window because the new VAD task waits for the old one to shut down #5674

Summary

Reproduction

Root cause

Problem 1: the reset is unconditional

Problem 2: when the reset is necessary, it blocks

Proposed fix

Change 1: only reset when needed

Change 2: when the reset runs, don't block on the previous task

Proposal 2 (Preferred one)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: update_vad() on STT EOT introduces ~600ms deaf window because the new VAD task waits for the old one to shut down #5674

Description

Summary

Reproduction

Root cause

Problem 1: the reset is unconditional

Problem 2: when the reset is necessary, it blocks

Proposed fix

Change 1: only reset when needed

Change 2: when the reset runs, don't block on the previous task

Proposal 2 (Preferred one)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions