Skip to content

[Detail Bug] Voice: Canceling a push-to-talk turn causes user turn limits to carry over to the next turn #5851

@detail-app

Description

@detail-app

Summary

  • Context: AudioRecognition.clear_user_turn() is used to discard a user's in-progress turn (e.g., push-to-talk cancel_turn). When a turn is discarded, all per-turn state should be reset.
  • Bug: clear_user_turn() resets _speech_start_time but fails to reset _turn_tracker, causing word count, transcript, and duration to persist across discarded turns.
  • Actual vs. Expected: After clear_user_turn(), subsequent turns incorrectly inherit word counts and timestamps from the discarded turn, causing user_turn_limit to trigger prematurely.
  • Impact: In push-to-talk scenarios with user_turn_limit configured, valid user speech is incorrectly interrupted based on accumulated data from canceled turns.

Response to Reviewer Objections

Objection 1: "Tests bypass the actual event pipeline"

Clarification: The tests call _check_user_turn_limit(transcript) directly, which is the SAME code path invoked by the STT event handler at line 903:

# audio_recognition.py:858-903
if ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
    transcript = ev.alternatives[0].text  # Fresh from STT
    # ...
    self._check_user_turn_limit(transcript)  # Called with fresh transcript

The transcript argument comes fresh from the STT provider. The bug is NOT in the transcript content - it's in the _turn_tracker state that _check_user_turn_limit uses:

# audio_recognition.py:1244-1245
self._turn_tracker.words += len(words)  # Accumulates into OLD state
self._turn_tracker.transcript = f"{self._turn_tracker.transcript} {transcript}".strip()

Key insight: Even if STT sends fresh transcripts, if _turn_tracker.words already has 3 from the canceled turn, the new 3-word transcript becomes 6 total.

Objection 2: "STT pipeline is reset in clear_user_turn()"

Acknowledgment: Yes, clear_user_turn() calls update_stt(None) then update_stt(stt) at lines 706-708, which creates a fresh STT pipeline.

But this doesn't reset _turn_tracker:

# audio_recognition.py:593-622 - update_stt() implementation
def update_stt(self, stt: io.STTNode | None, *, pipeline: _STTPipeline | None = None) -> None:
    self._stt = stt
    # Creates new pipeline, resets transcript_buffer, etc.
    # BUT: Does NOT touch self._turn_tracker

The update_stt() method resets _transcript_buffer, _ignore_user_transcript_until, and _input_started_at, but not _turn_tracker. This is the bug - inconsistent reset behavior.

Objection 3: "The feature is brand new, accumulation may be intentional"

Design intent analysis:

  1. Line 220 comment: "accumulates across turns until agent speaks"
  2. Line 297-298: Reset happens in on_start_of_agent_speech():
def on_start_of_agent_speech(self, started_at: float) -> None:
    # ...
    # reset user turn tracker when agent starts speaking
    self._turn_tracker = _UserTurnTracker()
  1. The clear_user_turn() method has a clear purpose in its docstring:
# audio_recognition.py:688
def clear_user_turn(self) -> None:

The method is called to discard a turn (see push_to_talk.py line 88). A discarded turn should not contribute to limits because:

  • It was never committed to chat context
  • The user explicitly aborted it
  • It should be as if the turn never happened

Contrast with committed turns:

  • In test_accumulation_across_interrupted_turns, turn 1 IS committed (via on_user_turn_completed)
  • The accumulation is intentional because turn 1 is part of conversation history
  • In cancel_turn, turn 1 is DISCARDED, NOT committed

Objection 4: "No demonstration with real STT events"

CONCLUSIVE EVIDENCE: The integration test test_clear_user_turn_integration_word_limit demonstrates the bug through the actual STT event pipeline using FakeSession, FakeSTT, and FakeVAD.

Test Output (Bug Confirmed)

Bug demonstration:
  exceeded_events count: 1
  accumulated_word_count: 6
  accumulated_transcript: 'one two three four five six'

Turn 1: 3 words → clear_user_turn() → Turn 2: 3 words → 6 accumulated (limit was 5).

Real Event Flow

FakeSTT.stream() → FakeRecognizeStream → _STTPipeline.event_ch → 
_stt_consumer() → _on_stt_event() → _check_user_turn_limit()

This is the exact same code path used in production. The STT pipeline IS reset by update_stt(None) + update_stt(stt) at lines 706-708, creating a fresh _STTPipeline with new event_ch. But _turn_tracker persists.

Why pipeline reset doesn't help: The new STT events accumulate into the OLD _turn_tracker state (line 1244: self._turn_tracker.words += len(words)). The pipeline is fresh, but the tracker is stale.


Design Intent Clarification

The reviewer asks whether the accumulation is intentional design. The evidence shows it is NOT:

The Comment "accumulates across turns until agent speaks" (Line 220)

This comment describes behavior for committed turns:

  1. User speaks turn 1
  2. Turn 1 is committed to chat context (via on_user_turn_completed)
  3. Agent hasn't spoken yet (slow LLM, or user interrupts)
  4. User speaks turn 2
  5. Accumulation is correct - turn 1 IS part of conversation history

The clear_user_turn() Method Has Different Purpose

# audio_recognition.py:688
def clear_user_turn(self) -> None:

This method is called to discard a turn (see push_to_talk.py line 88). The method's purpose is clear from its name and from what it resets:

  • _audio_transcript = "" - Clear transcript
  • _speech_start_time = None - Clear timing
  • _vad_speech_started = False - Clear VAD state
  • update_stt(None); update_stt(stt) - Reset STT pipeline

All per-turn state is cleared except _turn_tracker.

Committed vs Discarded: The Critical Distinction

Scenario Turn 1 Status Accumulation Correct?
test_accumulation_across_interrupted_turns Committed to chat context Yes - turn 1 is history
cancel_turn in push_to_talk Discarded (NOT committed) No - turn 1 should be erased

When clear_user_turn() is called, the turn is discarded. It is NOT committed to chat context. The user's intent is to cancel that turn. It should be as if the turn never happened.

Why the Fix is Correct

If the design was intentional (accumulation across ALL turns, including canceled), then:

  1. clear_user_turn() would NOT reset _speech_start_time
  2. The method would not reset _audio_transcript
  3. The method would not reset VAD state

But it DOES reset all of these. The intent is clear: discard all state for this turn. The _turn_tracker omission is an oversight.


Key Distinction: Committed vs. Discarded Turns

Committed Turn (Intentional Accumulation)

In test_accumulation_across_interrupted_turns():

  1. User speaks turn 1 ("one two three")
  2. Turn 1 is committed to chat context (via on_user_turn_completed)
  3. User interrupts before agent speaks
  4. Turn 2 starts ("four five six")
  5. Accumulation is intentional - turn 1 was committed, user interrupted

The accumulated_transcript correctly equals "one two three four five six" because turn 1 is part of the conversation history.

Discarded Turn (Bug Scenario)

In push-to-talk cancel_turn:

  1. User presses button, speaks turn 1 ("one two three")
  2. User releases button without committing → calls cancel_turn → calls clear_user_turn()
  3. Turn 1 is discarded (NOT committed to chat context)
  4. User presses button again, speaks turn 2 ("four five six")
  5. Turn 2's _check_user_turn_limit() uses accumulated values from discarded turn 1
  6. Bug: accumulated_word_count = 6 (should be 3)

The key difference: Discarded turns should NOT contribute to limits.


Evidence

Evidence 1: clear_user_turn() resets all per-turn state EXCEPT _turn_tracker

# audio_recognition.py:688-708
def clear_user_turn(self) -> None:
    self._audio_transcript = ""
    self._audio_interim_transcript = ""
    self._audio_preflight_transcript = ""
    self._final_transcript_confidence = []
    self._last_final_transcript_time = None
    self._speech_start_time = None        # Reset ✓
    self._last_speaking_time = None        # Reset ✓
    self._vad_speech_started = False       # Reset ✓
    self._user_turn_committed = False      # Reset ✓
    # ... user_turn_span handling ...
    # _turn_tracker NOT reset              # BUG ✗

The method resets all per-turn buffers and timing state. _turn_tracker serves the same purpose (per-turn limit tracking) but is not reset.

Evidence 2: _check_user_turn_limit() incorrectly inherits from discarded turn

# audio_recognition.py:1240-1245
if self._turn_tracker.started_at is None:
    self._turn_tracker.started_at = self._speech_start_time or now

words = self._word_tokenizer.tokenize(transcript)
self._turn_tracker.words += len(words)  # Accumulates from discarded turn!
self._turn_tracker.transcript = f"{self._turn_tracker.transcript} {transcript}".strip()

After clear_user_turn():

  • _turn_tracker.started_at is NOT None (has timestamp from discarded turn)
  • _turn_tracker.words has count from discarded turn
  • Duration calculation: now - _turn_tracker.started_at includes time from discarded turn

Evidence 3: End-to-End Integration Test (STT Event Pipeline)

CRITICAL: Test test_clear_user_turn_integration_word_limit in tests/test_clear_user_turn_integration.py demonstrates the bug through the actual STT event pipeline using FakeSession, FakeSTT, and FakeVAD.

Test Output (Actual Evidence)

Bug demonstration:
  exceeded_events count: 1
  accumulated_word_count: 6
  accumulated_transcript: 'one two three four five six'

What the Test Does

  1. Creates a session with user_turn_limit: {max_words: 5}
  2. Turn 1: FakeSTT sends FINAL_TRANSCRIPT "one two three" (3 words)
  3. Calls session.clear_user_turn() - simulating push-to-talk cancel_turn
  4. Turn 2: FakeSTT sends FINAL_TRANSCRIPT "four five six" (3 words)
  5. BUG: on_user_turn_exceeded fires with accumulated_word_count = 6

Code Path (Real Event Flow)

FakeSTT.stream() → FakeRecognizeStream → _STTPipeline.event_ch → 
_stt_consumer() → _on_stt_event() → _check_user_turn_limit()

This is the exact same code path used in production. The STT pipeline IS reset by update_stt(None) + update_stt(stt) at lines 706-708, but the _turn_tracker persists.

Why STT Pipeline Reset Doesn't Help

  • update_stt(None) closes the old _STTPipeline and cancels _stt_consumer_atask
  • update_stt(stt) creates a NEW _STTPipeline with fresh event_ch
  • BUT: Neither operation touches self._turn_tracker
  • The new STT events accumulate into the OLD _turn_tracker state

Evidence 4: Direct Unit Tests

Tests in tests/test_clear_user_turn_tracker_bug.py isolate the bug to _turn_tracker state:

Test 1: Direct state inspection (test_tracker_not_reset_on_clear_user_turn)

recognition._check_user_turn_limit("one two three")
recognition.clear_user_turn()

assert recognition._speech_start_time is None  # Correctly reset
assert recognition._turn_tracker.words == 3    # BUG: Not reset
assert recognition._turn_tracker.transcript == "one two three"  # BUG: Not reset

Test 2: False trigger due to word accumulation (test_canceled_turn_causes_false_word_trigger)

# Turn 1: 3 words
recognition._check_user_turn_limit("one two three")
recognition.clear_user_turn()  # Discard turn 1

# Turn 2: 3 more words
recognition._check_user_turn_limit("four five six")

# BUG: Event fires with accumulated_word_count = 6 (limit was 5)
assert len(hooks.exceeded_events) == 1
assert hooks.exceeded_events[0].accumulated_word_count == 6  # Should be 3

Test 3: Duration uses wrong timestamp (test_canceled_turn_causes_false_duration_trigger)

# max_duration = 1.0 second

# Turn 1: started 2 seconds ago
recognition._check_user_turn_limit("hello")
recognition.clear_user_turn()

# Turn 2: starts now
recognition._check_user_turn_limit("world")

# BUG: duration = now - turn1_start ≈ 2 seconds
# This exceeds max_duration immediately!
assert hooks.exceeded_events[1].duration > 1.5

Test 4: Fix verification (test_fix_verification)

# With manual reset of _turn_tracker after clear_user_turn:
recognition._turn_tracker = _UserTurnTracker()
recognition._check_user_turn_limit("four five six")

# No event - correct behavior
assert len(hooks.exceeded_events) == 0

All tests pass, confirming the bug exists and the fix resolves it.

Evidence 5: Push-to-talk pattern (Real World Impact)

# examples/voice_agents/push_to_talk.py:85-89
@ctx.room.local_participant.register_rpc_method("cancel_turn")
async def cancel_turn(data: rtc.RpcInvocationData):
    session.input.set_audio_enabled(False)
    session.clear_user_turn()  # Called when user aborts turn

Users combining push-to-talk with user_turn_limit will encounter this bug. Example:

  1. User presses button, says "one two three"
  2. User releases button (accidentally), triggering cancel_turn
  3. User presses button again, says "four five six"
  4. If max_words=5, the second turn triggers on_user_turn_exceeded incorrectly

Evidence 6: Inconsistent reset behavior in update_stt()

# audio_recognition.py:607-610
def update_stt(self, stt: io.STTNode | None, ...) -> None:
    # ...
    # reset interruption handling related state
    self._transcript_buffer.clear()
    self._ignore_user_transcript_until = NOT_GIVEN
    self._input_started_at = None
    # Note: _turn_tracker NOT reset here either

The update_stt() method resets several state variables but not _turn_tracker. This shows the oversight is in both clear_user_turn() and update_stt().

Evidence 7: Complete analysis of _turn_tracker reset locations

The _turn_tracker is reset in exactly ONE place:

# audio_recognition.py:297-298
def on_start_of_agent_speech(self, started_at: float) -> None:
    # ...
    # reset user turn tracker when agent starts speaking
    self._turn_tracker = _UserTurnTracker()

This reset occurs when the agent speaks. There is NO reset when:

  • clear_user_turn() is called (user discards turn)
  • update_stt() is called (STT pipeline reset)
  • Agent handoff scenarios
  • Session close/cleanup

This confirms the reset logic is incomplete - it only covers the "agent speaks" case, not the "user cancels turn" case.


Why the Existing Tests Don't Catch This

The tests in test_user_turn_exceeded.py only cover:

  1. test_reset_on_agent_speaking - Agent speaks between turns (resets via on_start_of_agent_speech)
  2. test_accumulation_across_interrupted_turns - Turn 1 is COMMITTED before accumulation

Neither test covers the clear_user_turn() code path where a turn is discarded.


Recommended Fix

Add _turn_tracker reset to clear_user_turn():

def clear_user_turn(self) -> None:
    self._audio_transcript = ""
    self._audio_interim_transcript = ""
    self._audio_preflight_transcript = ""
    self._final_transcript_confidence = []
    self._last_final_transcript_time = None
    self._speech_start_time = None
    self._last_speaking_time = None
    self._vad_speech_started = False
    self._user_turn_committed = False
    # FIX: Reset _turn_tracker to prevent accumulation from discarded turns
    self._turn_tracker = _UserTurnTracker()
    # ... rest of method ...

This is consistent with:

  1. The method's purpose of clearing all per-turn state
  2. The reset pattern in on_start_of_agent_speech() (line 298)
  3. The intent that discarded turns should not contribute to limits

History

This bug was introduced in commit c4daef3 (@longcw, 2026-05-18, PR #5492). The commit added the user_turn_limit feature with _UserTurnTracker to track word count, transcript, and duration across user turns. The developer correctly reset _turn_tracker in on_start_of_agent_speech() (line 297-298) but forgot to add the same reset to clear_user_turn(), despite that method already resetting all other per-turn state variables like _speech_start_time, _vad_speech_started, and _audio_transcript. This oversight causes discarded turns (via clear_user_turn()) to incorrectly contribute accumulated values to subsequent turns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdetail

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions