Allow decoupling `user_state` source from VAD when STT emits speech events

### Summary

When both VAD and a streaming STT are configured on `AgentSession`, VAD unconditionally drives the `user_state` machine, even when `turn_detection="stt"`. The only way to make STT drive `user_state` is to pass no VAD at all, which forces users to give up VAD-based interruption sensing. These two concerns are conceptually independent and could be controlled separately.

Verified on `livekit-agents` 1.5.5 .

### Current behavior

In `livekit/agents/voice/audio_recognition.py`, the VAD and STT paths set `self._speaking` asymmetrically.

VAD path is unconditional:

```python
async def _on_vad_event(self, ev):
    if ev.type == vad.VADEventType.START_OF_SPEECH:
        ...
        self._speaking = True
    elif ev.type == vad.VADEventType.END_OF_SPEECH:
        ...
        self._speaking = False
```

STT path is gated on `turn_detection_mode`:

```python
elif ev.type == stt.SpeechEventType.START_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    self._speaking = True
elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    self._speaking = False
```

Both paths funnel through the same `on_start_of_speech` / `on_end_of_speech` hook in `agent_activity.py`, which calls `_session._update_user_state(...)`. There is no public configuration to suppress VAD's contribution to `user_state` while keeping VAD active for interruption detection.

### Why this matters

In noisy environments, Silero VAD can flip `user_state` to `speaking` on background noise even when STT emits no transcript. This breaks downstream behaviors that key off `user_state`.

The available workaround is to pass `vad=None`. That stops the false `user_state` flips, but it also disables VAD-based interruption detection. Users then have to rely on STT interim transcripts as the interruption signal, which adds noticeable latency compared to VAD's `min_duration` shortcut.

### Proposed solution

Either of these would solve it cleanly. The first is the smaller change.

**Option A — symmetric gate on the VAD path.** Apply a `turn_detection_mode` gate on the VAD branch so it stops writing `_speaking` when the user has explicitly opted into STT-driven turn detection. This is a 2-line change in `_on_vad_event`. It mirrors what the STT branch already does.

**Option B — explicit configuration.** Add a `user_state_source: Literal["vad", "stt", "auto"]` field to `TurnHandlingOptions` (or `AgentSession`). `"auto"` keeps current behavior. `"stt"` makes the VAD branch skip the `_speaking` writes while still running VAD inference for interruption detection. `"vad"` is today's default.

Option A is the simplest fix and follows the existing code style. Option B is more explicit and lets users decouple the two signals regardless of the `turn_detection` mode chosen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow decoupling `user_state` source from VAD when STT emits speech events #5580

Summary

Current behavior

Why this matters

Proposed solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow decoupling user_state source from VAD when STT emits speech events #5580

Description

Summary

Current behavior

Why this matters

Proposed solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Allow decoupling `user_state` source from VAD when STT emits speech events #5580