Skip to content

fix(voice): pause output when user starts speaking during thinking#5535

Merged
longcw merged 3 commits intomainfrom
longc/pause-on-user-speaking
Apr 24, 2026
Merged

fix(voice): pause output when user starts speaking during thinking#5535
longcw merged 3 commits intomainfrom
longc/pause-on-user-speaking

Conversation

@longcw
Copy link
Copy Markdown
Contributor

@longcw longcw commented Apr 23, 2026

Summary

Fixes #5509. When the user starts a new turn while the agent is in thinking (LLM generating, TTS audio not yet flowing), the stale reply could still reach speaking and play over the new user turn. This pauses the pausable audio output on VAD SOS before any TTS frame hits the transport, so the stale reply neither plays nor promotes to speaking.

Changes

livekit-agents/livekit/agents/voice/agent_activity.py

  • New on_start_of_speech path: when the agent is not speaking, pauses the output if it supports pause. Uses timeout=0 so a brief false-positive VAD resumes immediately on VAD EOS; _interrupt_by_audio_activity upgrades to the real timeout if VAD confirms active speech.
  • _update_paused_speech preserves the agent_state captured at first pause across repeat calls
  • _PausedSpeechInfo carries handle + agent_state + timeout so resume restores the exact state (e.g. thinking, not speaking) that was active when the pause began.

Test helpers (tests/fake_*.py)

  • FakeAudioOutput(can_pause=True) supports pause now
  • FakeUserSpeech with empty transcript fires VAD SOS/EOS only (no STT events) — simulates sub-min_duration noise.

Known limitation

  • With timeout=0 on the new pause, if the user's speech is shorter than min_interruption_duration but STT still produces a final transcript that arrives after VAD EOS, the paused agent speech briefly resumes (timer fires on EOS) and then gets interrupted when the STT transcript lands — so the user hears a short snippet of the stale reply before it cuts off.

longcw added 3 commits April 23, 2026 09:53
…5509)

When the user starts a new turn while the agent is in thinking (LLM
generating, TTS audio not yet flowing), pause the pausable audio output
on VAD SOS so no stale TTS frames reach the transport and the stale
reply never promotes to speaking.

- New on_start_of_speech path pauses the output when agent_state is not
  "speaking" and the output supports pause. Uses timeout=0 so a brief
  false-positive VAD resumes immediately on VAD EOS.
- _update_paused_speech preserves the agent_state captured at first
  pause across repeat calls (Path A may re-call for the same handle
  with a real timeout; Path B yields to an existing pause via a
  call-site guard).
- _pause_enabled() centralizes the resume_false_interruption /
  false_interruption_timeout / can_pause gate.

Test helpers:
- FakeAudioOutput gains optional can_pause with a virtual playout clock
  (_started_at / _paused_at / _total_paused) so flush completion is
  deferred on pause and rescheduled on resume — played_duration is
  accurate for clear_buffer's playback_position.
- FakeUserSpeech with an empty transcript now fires VAD SOS/EOS only
  (no STT events), simulating sub-min_duration noise.
- create_session gains can_pause_audio passthrough.

Tests:
- test_interrupt_before_speaking_with_pausable_audio: #5509 regression —
  no speaking transition, playback_finished interrupted with
  playback_position=0, stale reply dropped from chat_ctx.
- test_false_interruption_before_speaking_resumes: brief VAD-only noise
  during thinking pauses then resumes on VAD EOS; speaking fires at
  ~3.8s (postponed) and playback completes without interruption.
@chenghao-mou chenghao-mou requested a review from a team April 23, 2026 09:10
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

Open in Devin Review

@miguelmoralai
Copy link
Copy Markdown

miguelmoralai commented Apr 23, 2026

Nice approach — pausing sidesteps the idle risk entirely, no backstop needed.

One thought on the known limitation (short utterance → brief audio blip before cut-off): the resume on VAD EOS could be gated on whether the current STT interim has non-empty content. Empty interim → blip was noise, safe to resume. Non-empty interim → real speech, skip resume and let the interrupt path handle it. That's the heuristic we landed on for our own workaround and it eliminated that race cleanly.

Copy link
Copy Markdown
Member

@chenghao-mou chenghao-mou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. It worked well when I spoke during its thinking state.

@longcw longcw merged commit 63fc7fc into main Apr 24, 2026
26 checks passed
@longcw longcw deleted the longc/pause-on-user-speaking branch April 24, 2026 01:40
Copy link
Copy Markdown
Contributor

This is an automated Claude Code Routine created by @toubatbrian. Right now it is in experimentation stage. The automation will start porting this PR into agents-js automatically.

Tracking: this core-runtime fix (voice/agent_activity.py — pause output when user starts speaking during thinking) qualifies for porting. A corresponding PR will be opened in livekit/agents-js shortly.


Generated by Claude Code

@longcw
Copy link
Copy Markdown
Contributor Author

longcw commented Apr 24, 2026

@miguelmoralai

the resume on VAD EOS could be gated on whether the current STT interim has non-empty content.

thanks for the advise, yeah there is already an interruption logic on interim transcript, if non-empty interim transcript received, the timeout for resume will be updated to the default value (e.g. 2 seconds).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agent transitions thinking → speaking after user has already transitioned to speaking (stale reply plays over new user turn)

4 participants