Skip to content

Agent transitions thinking → speaking after user has already transitioned to speaking (stale reply plays over new user turn) #5509

@miguelmoralai

Description

@miguelmoralai

Summary

We are reliably reproducing a race on livekit-agents ~= 1.5.2: the agent enters speaking and begins TTS playout even though the UserStateChangedEvent for listening → speaking has already fired milliseconds earlier. The stale reply gets played over the user's new turn, causing both sides to talk at once and typically resulting in a bad interruption/EOT resolution.

We shipped a workaround on our side that works, but we believe the SDK should guard this internally — once the session has observed the user transitioning into speaking, it shouldn't let a reply that is still in thinking (no audio emitted yet) promote to speaking.

The bug is visible when turn detection is driven by both VAD and STT.

Reproduction — real call logs

In this call the LLM generation for the previous user turn had just started when the user began a new turn. The SDK still flipped the agent to speaking:

02:43:58.700  Agent state changed: listening → thinking
02:43:59.094  VAD speech detected (triggering_probability=0.90)
02:43:59.095  User state changed: away → speaking
02:43:59.144  VAD waiting for EOS (probability=0.92, min_silence_required=1.10s)
02:43:59.264  Agent state changed: thinking → speaking   ← should not happen
02:43:59.299  ElevenLabs TTS request completed

Between the user going speaking (02:43:59.095) and the agent going speaking (02:43:59.264) there is ~170 ms — plenty of time for the SDK to observe the user state transition and cancel the pending generation.

A full chat history + traces + logs bundle from one of these calls is available on request — we didn't want to paste customer data in a public issue.

Our workaround

In our user_state_changed handler, when the new state is speaking and session.agent_state == "thinking", we call session.interrupt() to cancel the in-flight generation before it can promote to speaking. The committed user turn's on_end_of_turn path then regenerates naturally against the new transcript.

Guardrails we had to add:

  • Skip when current_speech.num_steps > 1 — otherwise we cancel the follow-up generation that runs after a tool call (e.g. agent handoff).
  • Skip when current_speech.allow_interruptions is FalseSpeechHandle.interrupt(force=False) raises RuntimeError for non-interruptible speeches (our initial greeting, mandatory compliance messages, etc.).
  • Wrap session.interrupt() in try/except RuntimeError as defence in depth for non-interruptible speeches sitting in the scheduling queue.

How often we see it

We added a metric (agent.thinking_interrupted_by_user) that increments every time our workaround fires. Across 72 completed calls with ~51s average duration, the metric fired 127 times — so this is not a rare edge case, it fires multiple times per call in some conversations. Since shipping the workaround we haven't seen the bug in call reviews.

Why we think this belongs in the SDK

  • Every agent hitting this race has to reimplement the same user_state_changed listener with the same guardrails (tool-call follow-ups, non-interruptible speeches). That logic is easy to get wrong.
  • The SDK already knows the user's state and the agent's state — it is in a much better position than application code to break the tie atomically.
  • Semantically, if the user is already in speaking when a reply is still in thinking (nothing audible yet), promoting it to speaking is never the right call.

Suggested fix direction

Inside the session scheduler, before allowing a speech to transition from thinking → speaking, check whether the user is currently in speaking state. If yes, cancel the pending speech (same semantics as session.interrupt() does today) rather than starting TTS playout. Respect allow_interruptions=False and the tool-call follow-up case the same way we had to externally.

If maintainers agree with this direction, happy to open a PR.

Versions

livekit-agents ~= 1.5.2 (with deepgram/soniox/silero/turn-detector extras)
livekit-api >= 1.1.0, < 2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions