Summary
We are reliably reproducing a race on livekit-agents ~= 1.5.2: the agent enters speaking and begins TTS playout even though the UserStateChangedEvent for listening → speaking has already fired milliseconds earlier. The stale reply gets played over the user's new turn, causing both sides to talk at once and typically resulting in a bad interruption/EOT resolution.
We shipped a workaround on our side that works, but we believe the SDK should guard this internally — once the session has observed the user transitioning into speaking, it shouldn't let a reply that is still in thinking (no audio emitted yet) promote to speaking.
The bug is visible when turn detection is driven by both VAD and STT.
Reproduction — real call logs
In this call the LLM generation for the previous user turn had just started when the user began a new turn. The SDK still flipped the agent to speaking:
02:43:58.700 Agent state changed: listening → thinking
02:43:59.094 VAD speech detected (triggering_probability=0.90)
02:43:59.095 User state changed: away → speaking
02:43:59.144 VAD waiting for EOS (probability=0.92, min_silence_required=1.10s)
02:43:59.264 Agent state changed: thinking → speaking ← should not happen
02:43:59.299 ElevenLabs TTS request completed
Between the user going speaking (02:43:59.095) and the agent going speaking (02:43:59.264) there is ~170 ms — plenty of time for the SDK to observe the user state transition and cancel the pending generation.
A full chat history + traces + logs bundle from one of these calls is available on request — we didn't want to paste customer data in a public issue.
Our workaround
In our user_state_changed handler, when the new state is speaking and session.agent_state == "thinking", we call session.interrupt() to cancel the in-flight generation before it can promote to speaking. The committed user turn's on_end_of_turn path then regenerates naturally against the new transcript.
Guardrails we had to add:
- Skip when
current_speech.num_steps > 1 — otherwise we cancel the follow-up generation that runs after a tool call (e.g. agent handoff).
- Skip when
current_speech.allow_interruptions is False — SpeechHandle.interrupt(force=False) raises RuntimeError for non-interruptible speeches (our initial greeting, mandatory compliance messages, etc.).
- Wrap
session.interrupt() in try/except RuntimeError as defence in depth for non-interruptible speeches sitting in the scheduling queue.
How often we see it
We added a metric (agent.thinking_interrupted_by_user) that increments every time our workaround fires. Across 72 completed calls with ~51s average duration, the metric fired 127 times — so this is not a rare edge case, it fires multiple times per call in some conversations. Since shipping the workaround we haven't seen the bug in call reviews.
Why we think this belongs in the SDK
- Every agent hitting this race has to reimplement the same
user_state_changed listener with the same guardrails (tool-call follow-ups, non-interruptible speeches). That logic is easy to get wrong.
- The SDK already knows the user's state and the agent's state — it is in a much better position than application code to break the tie atomically.
- Semantically, if the user is already in
speaking when a reply is still in thinking (nothing audible yet), promoting it to speaking is never the right call.
Suggested fix direction
Inside the session scheduler, before allowing a speech to transition from thinking → speaking, check whether the user is currently in speaking state. If yes, cancel the pending speech (same semantics as session.interrupt() does today) rather than starting TTS playout. Respect allow_interruptions=False and the tool-call follow-up case the same way we had to externally.
If maintainers agree with this direction, happy to open a PR.
Versions
livekit-agents ~= 1.5.2 (with deepgram/soniox/silero/turn-detector extras)
livekit-api >= 1.1.0, < 2
Summary
We are reliably reproducing a race on
livekit-agents ~= 1.5.2: the agent entersspeakingand begins TTS playout even though theUserStateChangedEventforlistening → speakinghas already fired milliseconds earlier. The stale reply gets played over the user's new turn, causing both sides to talk at once and typically resulting in a bad interruption/EOT resolution.We shipped a workaround on our side that works, but we believe the SDK should guard this internally — once the session has observed the user transitioning into
speaking, it shouldn't let a reply that is still inthinking(no audio emitted yet) promote tospeaking.The bug is visible when turn detection is driven by both VAD and STT.
Reproduction — real call logs
In this call the LLM generation for the previous user turn had just started when the user began a new turn. The SDK still flipped the agent to
speaking:Between the user going
speaking(02:43:59.095) and the agent goingspeaking(02:43:59.264) there is ~170 ms — plenty of time for the SDK to observe the user state transition and cancel the pending generation.A full chat history + traces + logs bundle from one of these calls is available on request — we didn't want to paste customer data in a public issue.
Our workaround
In our
user_state_changedhandler, when the new state isspeakingandsession.agent_state == "thinking", we callsession.interrupt()to cancel the in-flight generation before it can promote tospeaking. The committed user turn'son_end_of_turnpath then regenerates naturally against the new transcript.Guardrails we had to add:
current_speech.num_steps > 1— otherwise we cancel the follow-up generation that runs after a tool call (e.g. agent handoff).current_speech.allow_interruptionsisFalse—SpeechHandle.interrupt(force=False)raisesRuntimeErrorfor non-interruptible speeches (our initial greeting, mandatory compliance messages, etc.).session.interrupt()intry/except RuntimeErroras defence in depth for non-interruptible speeches sitting in the scheduling queue.How often we see it
We added a metric (
agent.thinking_interrupted_by_user) that increments every time our workaround fires. Across 72 completed calls with ~51s average duration, the metric fired 127 times — so this is not a rare edge case, it fires multiple times per call in some conversations. Since shipping the workaround we haven't seen the bug in call reviews.Why we think this belongs in the SDK
user_state_changedlistener with the same guardrails (tool-call follow-ups, non-interruptible speeches). That logic is easy to get wrong.speakingwhen a reply is still inthinking(nothing audible yet), promoting it tospeakingis never the right call.Suggested fix direction
Inside the session scheduler, before allowing a speech to transition from
thinking → speaking, check whether the user is currently inspeakingstate. If yes, cancel the pending speech (same semantics assession.interrupt()does today) rather than starting TTS playout. Respectallow_interruptions=Falseand the tool-call follow-up case the same way we had to externally.If maintainers agree with this direction, happy to open a PR.
Versions