Summary
- Context:
AudioRecognition.clear_user_turn() is used to discard a user's in-progress turn (e.g., push-to-talk cancel_turn). When a turn is discarded, all per-turn state should be reset.
- Bug:
clear_user_turn() resets _speech_start_time but fails to reset _turn_tracker, causing word count, transcript, and duration to persist across discarded turns.
- Actual vs. Expected: After
clear_user_turn(), subsequent turns incorrectly inherit word counts and timestamps from the discarded turn, causing user_turn_limit to trigger prematurely.
- Impact: In push-to-talk scenarios with
user_turn_limit configured, valid user speech is incorrectly interrupted based on accumulated data from canceled turns.
Response to Reviewer Objections
Objection 1: "Tests bypass the actual event pipeline"
Clarification: The tests call _check_user_turn_limit(transcript) directly, which is the SAME code path invoked by the STT event handler at line 903:
# audio_recognition.py:858-903
if ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
transcript = ev.alternatives[0].text # Fresh from STT
# ...
self._check_user_turn_limit(transcript) # Called with fresh transcript
The transcript argument comes fresh from the STT provider. The bug is NOT in the transcript content - it's in the _turn_tracker state that _check_user_turn_limit uses:
# audio_recognition.py:1244-1245
self._turn_tracker.words += len(words) # Accumulates into OLD state
self._turn_tracker.transcript = f"{self._turn_tracker.transcript} {transcript}".strip()
Key insight: Even if STT sends fresh transcripts, if _turn_tracker.words already has 3 from the canceled turn, the new 3-word transcript becomes 6 total.
Objection 2: "STT pipeline is reset in clear_user_turn()"
Acknowledgment: Yes, clear_user_turn() calls update_stt(None) then update_stt(stt) at lines 706-708, which creates a fresh STT pipeline.
But this doesn't reset _turn_tracker:
# audio_recognition.py:593-622 - update_stt() implementation
def update_stt(self, stt: io.STTNode | None, *, pipeline: _STTPipeline | None = None) -> None:
self._stt = stt
# Creates new pipeline, resets transcript_buffer, etc.
# BUT: Does NOT touch self._turn_tracker
The update_stt() method resets _transcript_buffer, _ignore_user_transcript_until, and _input_started_at, but not _turn_tracker. This is the bug - inconsistent reset behavior.
Objection 3: "The feature is brand new, accumulation may be intentional"
Design intent analysis:
- Line 220 comment: "accumulates across turns until agent speaks"
- Line 297-298: Reset happens in
on_start_of_agent_speech():
def on_start_of_agent_speech(self, started_at: float) -> None:
# ...
# reset user turn tracker when agent starts speaking
self._turn_tracker = _UserTurnTracker()
- The
clear_user_turn() method has a clear purpose in its docstring:
# audio_recognition.py:688
def clear_user_turn(self) -> None:
The method is called to discard a turn (see push_to_talk.py line 88). A discarded turn should not contribute to limits because:
- It was never committed to chat context
- The user explicitly aborted it
- It should be as if the turn never happened
Contrast with committed turns:
- In
test_accumulation_across_interrupted_turns, turn 1 IS committed (via on_user_turn_completed)
- The accumulation is intentional because turn 1 is part of conversation history
- In
cancel_turn, turn 1 is DISCARDED, NOT committed
Objection 4: "No demonstration with real STT events"
CONCLUSIVE EVIDENCE: The integration test test_clear_user_turn_integration_word_limit demonstrates the bug through the actual STT event pipeline using FakeSession, FakeSTT, and FakeVAD.
Test Output (Bug Confirmed)
Bug demonstration:
exceeded_events count: 1
accumulated_word_count: 6
accumulated_transcript: 'one two three four five six'
Turn 1: 3 words → clear_user_turn() → Turn 2: 3 words → 6 accumulated (limit was 5).
Real Event Flow
FakeSTT.stream() → FakeRecognizeStream → _STTPipeline.event_ch →
_stt_consumer() → _on_stt_event() → _check_user_turn_limit()
This is the exact same code path used in production. The STT pipeline IS reset by update_stt(None) + update_stt(stt) at lines 706-708, creating a fresh _STTPipeline with new event_ch. But _turn_tracker persists.
Why pipeline reset doesn't help: The new STT events accumulate into the OLD _turn_tracker state (line 1244: self._turn_tracker.words += len(words)). The pipeline is fresh, but the tracker is stale.
Design Intent Clarification
The reviewer asks whether the accumulation is intentional design. The evidence shows it is NOT:
The Comment "accumulates across turns until agent speaks" (Line 220)
This comment describes behavior for committed turns:
- User speaks turn 1
- Turn 1 is committed to chat context (via
on_user_turn_completed)
- Agent hasn't spoken yet (slow LLM, or user interrupts)
- User speaks turn 2
- Accumulation is correct - turn 1 IS part of conversation history
The clear_user_turn() Method Has Different Purpose
# audio_recognition.py:688
def clear_user_turn(self) -> None:
This method is called to discard a turn (see push_to_talk.py line 88). The method's purpose is clear from its name and from what it resets:
_audio_transcript = "" - Clear transcript
_speech_start_time = None - Clear timing
_vad_speech_started = False - Clear VAD state
update_stt(None); update_stt(stt) - Reset STT pipeline
All per-turn state is cleared except _turn_tracker.
Committed vs Discarded: The Critical Distinction
| Scenario |
Turn 1 Status |
Accumulation Correct? |
test_accumulation_across_interrupted_turns |
Committed to chat context |
Yes - turn 1 is history |
cancel_turn in push_to_talk |
Discarded (NOT committed) |
No - turn 1 should be erased |
When clear_user_turn() is called, the turn is discarded. It is NOT committed to chat context. The user's intent is to cancel that turn. It should be as if the turn never happened.
Why the Fix is Correct
If the design was intentional (accumulation across ALL turns, including canceled), then:
clear_user_turn() would NOT reset _speech_start_time
- The method would not reset
_audio_transcript
- The method would not reset VAD state
But it DOES reset all of these. The intent is clear: discard all state for this turn. The _turn_tracker omission is an oversight.
Key Distinction: Committed vs. Discarded Turns
Committed Turn (Intentional Accumulation)
In test_accumulation_across_interrupted_turns():
- User speaks turn 1 ("one two three")
- Turn 1 is committed to chat context (via
on_user_turn_completed)
- User interrupts before agent speaks
- Turn 2 starts ("four five six")
- Accumulation is intentional - turn 1 was committed, user interrupted
The accumulated_transcript correctly equals "one two three four five six" because turn 1 is part of the conversation history.
Discarded Turn (Bug Scenario)
In push-to-talk cancel_turn:
- User presses button, speaks turn 1 ("one two three")
- User releases button without committing → calls
cancel_turn → calls clear_user_turn()
- Turn 1 is discarded (NOT committed to chat context)
- User presses button again, speaks turn 2 ("four five six")
- Turn 2's
_check_user_turn_limit() uses accumulated values from discarded turn 1
- Bug:
accumulated_word_count = 6 (should be 3)
The key difference: Discarded turns should NOT contribute to limits.
Evidence
Evidence 1: clear_user_turn() resets all per-turn state EXCEPT _turn_tracker
# audio_recognition.py:688-708
def clear_user_turn(self) -> None:
self._audio_transcript = ""
self._audio_interim_transcript = ""
self._audio_preflight_transcript = ""
self._final_transcript_confidence = []
self._last_final_transcript_time = None
self._speech_start_time = None # Reset ✓
self._last_speaking_time = None # Reset ✓
self._vad_speech_started = False # Reset ✓
self._user_turn_committed = False # Reset ✓
# ... user_turn_span handling ...
# _turn_tracker NOT reset # BUG ✗
The method resets all per-turn buffers and timing state. _turn_tracker serves the same purpose (per-turn limit tracking) but is not reset.
Evidence 2: _check_user_turn_limit() incorrectly inherits from discarded turn
# audio_recognition.py:1240-1245
if self._turn_tracker.started_at is None:
self._turn_tracker.started_at = self._speech_start_time or now
words = self._word_tokenizer.tokenize(transcript)
self._turn_tracker.words += len(words) # Accumulates from discarded turn!
self._turn_tracker.transcript = f"{self._turn_tracker.transcript} {transcript}".strip()
After clear_user_turn():
_turn_tracker.started_at is NOT None (has timestamp from discarded turn)
_turn_tracker.words has count from discarded turn
- Duration calculation:
now - _turn_tracker.started_at includes time from discarded turn
Evidence 3: End-to-End Integration Test (STT Event Pipeline)
CRITICAL: Test test_clear_user_turn_integration_word_limit in tests/test_clear_user_turn_integration.py demonstrates the bug through the actual STT event pipeline using FakeSession, FakeSTT, and FakeVAD.
Test Output (Actual Evidence)
Bug demonstration:
exceeded_events count: 1
accumulated_word_count: 6
accumulated_transcript: 'one two three four five six'
What the Test Does
- Creates a session with
user_turn_limit: {max_words: 5}
- Turn 1: FakeSTT sends FINAL_TRANSCRIPT "one two three" (3 words)
- Calls
session.clear_user_turn() - simulating push-to-talk cancel_turn
- Turn 2: FakeSTT sends FINAL_TRANSCRIPT "four five six" (3 words)
- BUG:
on_user_turn_exceeded fires with accumulated_word_count = 6
Code Path (Real Event Flow)
FakeSTT.stream() → FakeRecognizeStream → _STTPipeline.event_ch →
_stt_consumer() → _on_stt_event() → _check_user_turn_limit()
This is the exact same code path used in production. The STT pipeline IS reset by update_stt(None) + update_stt(stt) at lines 706-708, but the _turn_tracker persists.
Why STT Pipeline Reset Doesn't Help
update_stt(None) closes the old _STTPipeline and cancels _stt_consumer_atask
update_stt(stt) creates a NEW _STTPipeline with fresh event_ch
- BUT: Neither operation touches
self._turn_tracker
- The new STT events accumulate into the OLD
_turn_tracker state
Evidence 4: Direct Unit Tests
Tests in tests/test_clear_user_turn_tracker_bug.py isolate the bug to _turn_tracker state:
Test 1: Direct state inspection (test_tracker_not_reset_on_clear_user_turn)
recognition._check_user_turn_limit("one two three")
recognition.clear_user_turn()
assert recognition._speech_start_time is None # Correctly reset
assert recognition._turn_tracker.words == 3 # BUG: Not reset
assert recognition._turn_tracker.transcript == "one two three" # BUG: Not reset
Test 2: False trigger due to word accumulation (test_canceled_turn_causes_false_word_trigger)
# Turn 1: 3 words
recognition._check_user_turn_limit("one two three")
recognition.clear_user_turn() # Discard turn 1
# Turn 2: 3 more words
recognition._check_user_turn_limit("four five six")
# BUG: Event fires with accumulated_word_count = 6 (limit was 5)
assert len(hooks.exceeded_events) == 1
assert hooks.exceeded_events[0].accumulated_word_count == 6 # Should be 3
Test 3: Duration uses wrong timestamp (test_canceled_turn_causes_false_duration_trigger)
# max_duration = 1.0 second
# Turn 1: started 2 seconds ago
recognition._check_user_turn_limit("hello")
recognition.clear_user_turn()
# Turn 2: starts now
recognition._check_user_turn_limit("world")
# BUG: duration = now - turn1_start ≈ 2 seconds
# This exceeds max_duration immediately!
assert hooks.exceeded_events[1].duration > 1.5
Test 4: Fix verification (test_fix_verification)
# With manual reset of _turn_tracker after clear_user_turn:
recognition._turn_tracker = _UserTurnTracker()
recognition._check_user_turn_limit("four five six")
# No event - correct behavior
assert len(hooks.exceeded_events) == 0
All tests pass, confirming the bug exists and the fix resolves it.
Evidence 5: Push-to-talk pattern (Real World Impact)
# examples/voice_agents/push_to_talk.py:85-89
@ctx.room.local_participant.register_rpc_method("cancel_turn")
async def cancel_turn(data: rtc.RpcInvocationData):
session.input.set_audio_enabled(False)
session.clear_user_turn() # Called when user aborts turn
Users combining push-to-talk with user_turn_limit will encounter this bug. Example:
- User presses button, says "one two three"
- User releases button (accidentally), triggering
cancel_turn
- User presses button again, says "four five six"
- If
max_words=5, the second turn triggers on_user_turn_exceeded incorrectly
Evidence 6: Inconsistent reset behavior in update_stt()
# audio_recognition.py:607-610
def update_stt(self, stt: io.STTNode | None, ...) -> None:
# ...
# reset interruption handling related state
self._transcript_buffer.clear()
self._ignore_user_transcript_until = NOT_GIVEN
self._input_started_at = None
# Note: _turn_tracker NOT reset here either
The update_stt() method resets several state variables but not _turn_tracker. This shows the oversight is in both clear_user_turn() and update_stt().
Evidence 7: Complete analysis of _turn_tracker reset locations
The _turn_tracker is reset in exactly ONE place:
# audio_recognition.py:297-298
def on_start_of_agent_speech(self, started_at: float) -> None:
# ...
# reset user turn tracker when agent starts speaking
self._turn_tracker = _UserTurnTracker()
This reset occurs when the agent speaks. There is NO reset when:
clear_user_turn() is called (user discards turn)
update_stt() is called (STT pipeline reset)
- Agent handoff scenarios
- Session close/cleanup
This confirms the reset logic is incomplete - it only covers the "agent speaks" case, not the "user cancels turn" case.
Why the Existing Tests Don't Catch This
The tests in test_user_turn_exceeded.py only cover:
test_reset_on_agent_speaking - Agent speaks between turns (resets via on_start_of_agent_speech)
test_accumulation_across_interrupted_turns - Turn 1 is COMMITTED before accumulation
Neither test covers the clear_user_turn() code path where a turn is discarded.
Recommended Fix
Add _turn_tracker reset to clear_user_turn():
def clear_user_turn(self) -> None:
self._audio_transcript = ""
self._audio_interim_transcript = ""
self._audio_preflight_transcript = ""
self._final_transcript_confidence = []
self._last_final_transcript_time = None
self._speech_start_time = None
self._last_speaking_time = None
self._vad_speech_started = False
self._user_turn_committed = False
# FIX: Reset _turn_tracker to prevent accumulation from discarded turns
self._turn_tracker = _UserTurnTracker()
# ... rest of method ...
This is consistent with:
- The method's purpose of clearing all per-turn state
- The reset pattern in
on_start_of_agent_speech() (line 298)
- The intent that discarded turns should not contribute to limits
History
This bug was introduced in commit c4daef3 (@longcw, 2026-05-18, PR #5492). The commit added the user_turn_limit feature with _UserTurnTracker to track word count, transcript, and duration across user turns. The developer correctly reset _turn_tracker in on_start_of_agent_speech() (line 297-298) but forgot to add the same reset to clear_user_turn(), despite that method already resetting all other per-turn state variables like _speech_start_time, _vad_speech_started, and _audio_transcript. This oversight causes discarded turns (via clear_user_turn()) to incorrectly contribute accumulated values to subsequent turns.
Summary
AudioRecognition.clear_user_turn()is used to discard a user's in-progress turn (e.g., push-to-talkcancel_turn). When a turn is discarded, all per-turn state should be reset.clear_user_turn()resets_speech_start_timebut fails to reset_turn_tracker, causing word count, transcript, and duration to persist across discarded turns.clear_user_turn(), subsequent turns incorrectly inherit word counts and timestamps from the discarded turn, causinguser_turn_limitto trigger prematurely.user_turn_limitconfigured, valid user speech is incorrectly interrupted based on accumulated data from canceled turns.Response to Reviewer Objections
Objection 1: "Tests bypass the actual event pipeline"
Clarification: The tests call
_check_user_turn_limit(transcript)directly, which is the SAME code path invoked by the STT event handler at line 903:The
transcriptargument comes fresh from the STT provider. The bug is NOT in the transcript content - it's in the_turn_trackerstate that_check_user_turn_limituses:Key insight: Even if STT sends fresh transcripts, if
_turn_tracker.wordsalready has 3 from the canceled turn, the new 3-word transcript becomes 6 total.Objection 2: "STT pipeline is reset in clear_user_turn()"
Acknowledgment: Yes,
clear_user_turn()callsupdate_stt(None)thenupdate_stt(stt)at lines 706-708, which creates a fresh STT pipeline.But this doesn't reset _turn_tracker:
The
update_stt()method resets_transcript_buffer,_ignore_user_transcript_until, and_input_started_at, but not_turn_tracker. This is the bug - inconsistent reset behavior.Objection 3: "The feature is brand new, accumulation may be intentional"
Design intent analysis:
on_start_of_agent_speech():clear_user_turn()method has a clear purpose in its docstring:The method is called to discard a turn (see push_to_talk.py line 88). A discarded turn should not contribute to limits because:
Contrast with committed turns:
test_accumulation_across_interrupted_turns, turn 1 IS committed (viaon_user_turn_completed)cancel_turn, turn 1 is DISCARDED, NOT committedObjection 4: "No demonstration with real STT events"
CONCLUSIVE EVIDENCE: The integration test
test_clear_user_turn_integration_word_limitdemonstrates the bug through the actual STT event pipeline usingFakeSession,FakeSTT, andFakeVAD.Test Output (Bug Confirmed)
Turn 1: 3 words →
clear_user_turn()→ Turn 2: 3 words → 6 accumulated (limit was 5).Real Event Flow
This is the exact same code path used in production. The STT pipeline IS reset by
update_stt(None)+update_stt(stt)at lines 706-708, creating a fresh_STTPipelinewith newevent_ch. But_turn_trackerpersists.Why pipeline reset doesn't help: The new STT events accumulate into the OLD
_turn_trackerstate (line 1244:self._turn_tracker.words += len(words)). The pipeline is fresh, but the tracker is stale.Design Intent Clarification
The reviewer asks whether the accumulation is intentional design. The evidence shows it is NOT:
The Comment "accumulates across turns until agent speaks" (Line 220)
This comment describes behavior for committed turns:
on_user_turn_completed)The
clear_user_turn()Method Has Different PurposeThis method is called to discard a turn (see push_to_talk.py line 88). The method's purpose is clear from its name and from what it resets:
_audio_transcript = ""- Clear transcript_speech_start_time = None- Clear timing_vad_speech_started = False- Clear VAD stateupdate_stt(None); update_stt(stt)- Reset STT pipelineAll per-turn state is cleared except
_turn_tracker.Committed vs Discarded: The Critical Distinction
test_accumulation_across_interrupted_turnscancel_turnin push_to_talkWhen
clear_user_turn()is called, the turn is discarded. It is NOT committed to chat context. The user's intent is to cancel that turn. It should be as if the turn never happened.Why the Fix is Correct
If the design was intentional (accumulation across ALL turns, including canceled), then:
clear_user_turn()would NOT reset_speech_start_time_audio_transcriptBut it DOES reset all of these. The intent is clear: discard all state for this turn. The
_turn_trackeromission is an oversight.Key Distinction: Committed vs. Discarded Turns
Committed Turn (Intentional Accumulation)
In
test_accumulation_across_interrupted_turns():on_user_turn_completed)The
accumulated_transcriptcorrectly equals "one two three four five six" because turn 1 is part of the conversation history.Discarded Turn (Bug Scenario)
In push-to-talk
cancel_turn:cancel_turn→ callsclear_user_turn()_check_user_turn_limit()uses accumulated values from discarded turn 1accumulated_word_count= 6 (should be 3)The key difference: Discarded turns should NOT contribute to limits.
Evidence
Evidence 1: clear_user_turn() resets all per-turn state EXCEPT _turn_tracker
The method resets all per-turn buffers and timing state.
_turn_trackerserves the same purpose (per-turn limit tracking) but is not reset.Evidence 2: _check_user_turn_limit() incorrectly inherits from discarded turn
After
clear_user_turn():_turn_tracker.started_atis NOT None (has timestamp from discarded turn)_turn_tracker.wordshas count from discarded turnnow - _turn_tracker.started_atincludes time from discarded turnEvidence 3: End-to-End Integration Test (STT Event Pipeline)
CRITICAL: Test
test_clear_user_turn_integration_word_limitintests/test_clear_user_turn_integration.pydemonstrates the bug through the actual STT event pipeline usingFakeSession,FakeSTT, andFakeVAD.Test Output (Actual Evidence)
What the Test Does
user_turn_limit: {max_words: 5}session.clear_user_turn()- simulating push-to-talk cancel_turnon_user_turn_exceededfires with accumulated_word_count = 6Code Path (Real Event Flow)
This is the exact same code path used in production. The STT pipeline IS reset by
update_stt(None)+update_stt(stt)at lines 706-708, but the_turn_trackerpersists.Why STT Pipeline Reset Doesn't Help
update_stt(None)closes the old_STTPipelineand cancels_stt_consumer_ataskupdate_stt(stt)creates a NEW_STTPipelinewith freshevent_chself._turn_tracker_turn_trackerstateEvidence 4: Direct Unit Tests
Tests in
tests/test_clear_user_turn_tracker_bug.pyisolate the bug to_turn_trackerstate:Test 1: Direct state inspection (
test_tracker_not_reset_on_clear_user_turn)Test 2: False trigger due to word accumulation (
test_canceled_turn_causes_false_word_trigger)Test 3: Duration uses wrong timestamp (
test_canceled_turn_causes_false_duration_trigger)Test 4: Fix verification (
test_fix_verification)All tests pass, confirming the bug exists and the fix resolves it.
Evidence 5: Push-to-talk pattern (Real World Impact)
Users combining push-to-talk with
user_turn_limitwill encounter this bug. Example:cancel_turnmax_words=5, the second turn triggerson_user_turn_exceededincorrectlyEvidence 6: Inconsistent reset behavior in update_stt()
The
update_stt()method resets several state variables but not_turn_tracker. This shows the oversight is in bothclear_user_turn()andupdate_stt().Evidence 7: Complete analysis of _turn_tracker reset locations
The
_turn_trackeris reset in exactly ONE place:This reset occurs when the agent speaks. There is NO reset when:
clear_user_turn()is called (user discards turn)update_stt()is called (STT pipeline reset)This confirms the reset logic is incomplete - it only covers the "agent speaks" case, not the "user cancels turn" case.
Why the Existing Tests Don't Catch This
The tests in
test_user_turn_exceeded.pyonly cover:test_reset_on_agent_speaking- Agent speaks between turns (resets viaon_start_of_agent_speech)test_accumulation_across_interrupted_turns- Turn 1 is COMMITTED before accumulationNeither test covers the
clear_user_turn()code path where a turn is discarded.Recommended Fix
Add
_turn_trackerreset toclear_user_turn():This is consistent with:
on_start_of_agent_speech()(line 298)History
This bug was introduced in commit c4daef3 (@longcw, 2026-05-18, PR #5492). The commit added the
user_turn_limitfeature with_UserTurnTrackerto track word count, transcript, and duration across user turns. The developer correctly reset_turn_trackerinon_start_of_agent_speech()(line 297-298) but forgot to add the same reset toclear_user_turn(), despite that method already resetting all other per-turn state variables like_speech_start_time,_vad_speech_started, and_audio_transcript. This oversight causes discarded turns (viaclear_user_turn()) to incorrectly contribute accumulated values to subsequent turns.