fix(voice): pause output when user starts speaking during thinking#5535
fix(voice): pause output when user starts speaking during thinking#5535
Conversation
…5509) When the user starts a new turn while the agent is in thinking (LLM generating, TTS audio not yet flowing), pause the pausable audio output on VAD SOS so no stale TTS frames reach the transport and the stale reply never promotes to speaking. - New on_start_of_speech path pauses the output when agent_state is not "speaking" and the output supports pause. Uses timeout=0 so a brief false-positive VAD resumes immediately on VAD EOS. - _update_paused_speech preserves the agent_state captured at first pause across repeat calls (Path A may re-call for the same handle with a real timeout; Path B yields to an existing pause via a call-site guard). - _pause_enabled() centralizes the resume_false_interruption / false_interruption_timeout / can_pause gate. Test helpers: - FakeAudioOutput gains optional can_pause with a virtual playout clock (_started_at / _paused_at / _total_paused) so flush completion is deferred on pause and rescheduled on resume — played_duration is accurate for clear_buffer's playback_position. - FakeUserSpeech with an empty transcript now fires VAD SOS/EOS only (no STT events), simulating sub-min_duration noise. - create_session gains can_pause_audio passthrough. Tests: - test_interrupt_before_speaking_with_pausable_audio: #5509 regression — no speaking transition, playback_finished interrupted with playback_position=0, stale reply dropped from chat_ctx. - test_false_interruption_before_speaking_resumes: brief VAD-only noise during thinking pauses then resumes on VAD EOS; speaking fires at ~3.8s (postponed) and playback completes without interruption.
|
Nice approach — pausing sidesteps the idle risk entirely, no backstop needed. One thought on the known limitation (short utterance → brief audio blip before cut-off): the resume on VAD EOS could be gated on whether the current STT interim has non-empty content. Empty interim → blip was noise, safe to resume. Non-empty interim → real speech, skip resume and let the interrupt path handle it. That's the heuristic we landed on for our own workaround and it eliminated that race cleanly. |
chenghao-mou
left a comment
There was a problem hiding this comment.
lgtm. It worked well when I spoke during its thinking state.
|
This is an automated Claude Code Routine created by @toubatbrian. Right now it is in experimentation stage. The automation will start porting this PR into agents-js automatically. Tracking: this core-runtime fix (voice/agent_activity.py — pause output when user starts speaking during Generated by Claude Code |
thanks for the advise, yeah there is already an interruption logic on interim transcript, if non-empty interim transcript received, the timeout for resume will be updated to the default value (e.g. 2 seconds). |
Summary
Fixes #5509. When the user starts a new turn while the agent is in
thinking(LLM generating, TTS audio not yet flowing), the stale reply could still reachspeakingand play over the new user turn. This pauses the pausable audio output on VAD SOS before any TTS frame hits the transport, so the stale reply neither plays nor promotes tospeaking.Changes
livekit-agents/livekit/agents/voice/agent_activity.pyon_start_of_speechpath: when the agent is notspeaking, pauses the output if it supports pause. Usestimeout=0so a brief false-positive VAD resumes immediately on VAD EOS;_interrupt_by_audio_activityupgrades to the real timeout if VAD confirms active speech._update_paused_speechpreserves theagent_statecaptured at first pause across repeat calls_PausedSpeechInfocarrieshandle + agent_state + timeoutso resume restores the exact state (e.g.thinking, notspeaking) that was active when the pause began.Test helpers (
tests/fake_*.py)FakeAudioOutput(can_pause=True)supports pause nowFakeUserSpeechwith emptytranscriptfires VAD SOS/EOS only (no STT events) — simulates sub-min_durationnoise.Known limitation