Agent transitions `thinking → speaking` after user has already transitioned to `speaking` (stale reply plays over new user turn)

### Summary

We are reliably reproducing a race on `livekit-agents ~= 1.5.2`: the agent enters `speaking` and begins TTS playout even though the `UserStateChangedEvent` for `listening → speaking` has already fired milliseconds earlier. The stale reply gets played over the user's new turn, causing both sides to talk at once and typically resulting in a bad interruption/EOT resolution.

We shipped a workaround on our side that works, but we believe the SDK should guard this internally — once the session has observed the user transitioning into `speaking`, it shouldn't let a reply that is still in `thinking` (no audio emitted yet) promote to `speaking`.

The bug is visible when turn detection is driven by both VAD and STT.

### Reproduction — real call logs

In this call the LLM generation for the previous user turn had just started when the user began a new turn. The SDK still flipped the agent to `speaking`:

```
02:43:58.700  Agent state changed: listening → thinking
02:43:59.094  VAD speech detected (triggering_probability=0.90)
02:43:59.095  User state changed: away → speaking
02:43:59.144  VAD waiting for EOS (probability=0.92, min_silence_required=1.10s)
02:43:59.264  Agent state changed: thinking → speaking   ← should not happen
02:43:59.299  ElevenLabs TTS request completed
```

Between the user going `speaking` (02:43:59.095) and the agent going `speaking` (02:43:59.264) there is ~170 ms — plenty of time for the SDK to observe the user state transition and cancel the pending generation.

A full chat history + traces + logs bundle from one of these calls is available on request — we didn't want to paste customer data in a public issue.

### Our workaround

In our `user_state_changed` handler, when the new state is `speaking` and `session.agent_state == "thinking"`, we call `session.interrupt()` to cancel the in-flight generation before it can promote to `speaking`. The committed user turn's `on_end_of_turn` path then regenerates naturally against the new transcript.

Guardrails we had to add:

- Skip when `current_speech.num_steps > 1` — otherwise we cancel the follow-up generation that runs after a tool call (e.g. agent handoff).
- Skip when `current_speech.allow_interruptions` is `False` — `SpeechHandle.interrupt(force=False)` raises `RuntimeError` for non-interruptible speeches (our initial greeting, mandatory compliance messages, etc.).
- Wrap `session.interrupt()` in `try/except RuntimeError` as defence in depth for non-interruptible speeches sitting in the scheduling queue.

### How often we see it

We added a metric (`agent.thinking_interrupted_by_user`) that increments every time our workaround fires. Across **72 completed calls** with **~51s average duration**, the metric fired **127 times** — so this is not a rare edge case, it fires multiple times per call in some conversations. Since shipping the workaround we haven't seen the bug in call reviews.

### Why we think this belongs in the SDK

- Every agent hitting this race has to reimplement the same `user_state_changed` listener with the same guardrails (tool-call follow-ups, non-interruptible speeches). That logic is easy to get wrong.
- The SDK already knows the user's state and the agent's state — it is in a much better position than application code to break the tie atomically.
- Semantically, if the user is already in `speaking` when a reply is still in `thinking` (nothing audible yet), promoting it to `speaking` is never the right call.

### Suggested fix direction

Inside the session scheduler, before allowing a speech to transition from `thinking → speaking`, check whether the user is currently in `speaking` state. If yes, cancel the pending speech (same semantics as `session.interrupt()` does today) rather than starting TTS playout. Respect `allow_interruptions=False` and the tool-call follow-up case the same way we had to externally.

If maintainers agree with this direction, happy to open a PR.

### Versions

```
livekit-agents ~= 1.5.2 (with deepgram/soniox/silero/turn-detector extras)
livekit-api >= 1.1.0, < 2
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent transitions `thinking → speaking` after user has already transitioned to `speaking` (stale reply plays over new user turn) #5509

Summary

Reproduction — real call logs

Our workaround

How often we see it

Why we think this belongs in the SDK

Suggested fix direction

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agent transitions thinking → speaking after user has already transitioned to speaking (stale reply plays over new user turn) #5509

Description

Summary

Reproduction — real call logs

Our workaround

How often we see it

Why we think this belongs in the SDK

Suggested fix direction

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Agent transitions `thinking → speaking` after user has already transitioned to `speaking` (stale reply plays over new user turn) #5509