[Detail Bug] Voice: Canceling a push-to-talk turn causes user turn limits to carry over to the next turn

# Summary
- **Context**: `AudioRecognition.clear_user_turn()` is used to discard a user's in-progress turn (e.g., push-to-talk `cancel_turn`). When a turn is discarded, all per-turn state should be reset.
- **Bug**: `clear_user_turn()` resets `_speech_start_time` but fails to reset `_turn_tracker`, causing word count, transcript, and duration to persist across discarded turns.
- **Actual vs. Expected**: After `clear_user_turn()`, subsequent turns incorrectly inherit word counts and timestamps from the discarded turn, causing `user_turn_limit` to trigger prematurely.
- **Impact**: In push-to-talk scenarios with `user_turn_limit` configured, valid user speech is incorrectly interrupted based on accumulated data from canceled turns.

---

# Response to Reviewer Objections

## Objection 1: "Tests bypass the actual event pipeline"

**Clarification**: The tests call `_check_user_turn_limit(transcript)` directly, which is the SAME code path invoked by the STT event handler at line 903:

```python
# audio_recognition.py:858-903
if ev.type == stt.SpeechEventType.FINAL_TRANSCRIPT:
    transcript = ev.alternatives[0].text  # Fresh from STT
    # ...
    self._check_user_turn_limit(transcript)  # Called with fresh transcript
```

The `transcript` argument comes fresh from the STT provider. The bug is NOT in the transcript content - it's in the `_turn_tracker` state that `_check_user_turn_limit` uses:

```python
# audio_recognition.py:1244-1245
self._turn_tracker.words += len(words)  # Accumulates into OLD state
self._turn_tracker.transcript = f"{self._turn_tracker.transcript} {transcript}".strip()
```

**Key insight**: Even if STT sends fresh transcripts, if `_turn_tracker.words` already has 3 from the canceled turn, the new 3-word transcript becomes 6 total.

## Objection 2: "STT pipeline is reset in clear_user_turn()"

**Acknowledgment**: Yes, `clear_user_turn()` calls `update_stt(None)` then `update_stt(stt)` at lines 706-708, which creates a fresh STT pipeline.

**But this doesn't reset _turn_tracker**:

```python
# audio_recognition.py:593-622 - update_stt() implementation
def update_stt(self, stt: io.STTNode | None, *, pipeline: _STTPipeline | None = None) -> None:
    self._stt = stt
    # Creates new pipeline, resets transcript_buffer, etc.
    # BUT: Does NOT touch self._turn_tracker
```

The `update_stt()` method resets `_transcript_buffer`, `_ignore_user_transcript_until`, and `_input_started_at`, but **not** `_turn_tracker`. This is the bug - inconsistent reset behavior.

## Objection 3: "The feature is brand new, accumulation may be intentional"

**Design intent analysis**:

1. **Line 220 comment**: "accumulates across turns until agent speaks"
2. **Line 297-298**: Reset happens in `on_start_of_agent_speech()`:

```python
def on_start_of_agent_speech(self, started_at: float) -> None:
    # ...
    # reset user turn tracker when agent starts speaking
    self._turn_tracker = _UserTurnTracker()
```

3. **The `clear_user_turn()` method has a clear purpose in its docstring**:

```python
# audio_recognition.py:688
def clear_user_turn(self) -> None:
```

The method is called to **discard** a turn (see push_to_talk.py line 88). A discarded turn should not contribute to limits because:
- It was never committed to chat context
- The user explicitly aborted it
- It should be as if the turn never happened

**Contrast with committed turns**:
- In `test_accumulation_across_interrupted_turns`, turn 1 IS committed (via `on_user_turn_completed`)
- The accumulation is intentional because turn 1 is part of conversation history
- In `cancel_turn`, turn 1 is DISCARDED, NOT committed

## Objection 4: "No demonstration with real STT events"

**CONCLUSIVE EVIDENCE**: The integration test `test_clear_user_turn_integration_word_limit` demonstrates the bug through the **actual STT event pipeline** using `FakeSession`, `FakeSTT`, and `FakeVAD`.

### Test Output (Bug Confirmed)
```
Bug demonstration:
  exceeded_events count: 1
  accumulated_word_count: 6
  accumulated_transcript: 'one two three four five six'
```

Turn 1: 3 words → `clear_user_turn()` → Turn 2: 3 words → **6 accumulated** (limit was 5).

### Real Event Flow
```
FakeSTT.stream() → FakeRecognizeStream → _STTPipeline.event_ch → 
_stt_consumer() → _on_stt_event() → _check_user_turn_limit()
```

This is the **exact same code path** used in production. The STT pipeline IS reset by `update_stt(None)` + `update_stt(stt)` at lines 706-708, creating a fresh `_STTPipeline` with new `event_ch`. But `_turn_tracker` persists.

**Why pipeline reset doesn't help**: The new STT events accumulate into the OLD `_turn_tracker` state (line 1244: `self._turn_tracker.words += len(words)`). The pipeline is fresh, but the tracker is stale.

---

# Design Intent Clarification

The reviewer asks whether the accumulation is intentional design. The evidence shows it is **NOT**:

## The Comment "accumulates across turns until agent speaks" (Line 220)

This comment describes behavior for **committed** turns:
1. User speaks turn 1
2. Turn 1 is committed to chat context (via `on_user_turn_completed`)
3. Agent hasn't spoken yet (slow LLM, or user interrupts)
4. User speaks turn 2
5. Accumulation is correct - turn 1 IS part of conversation history

## The `clear_user_turn()` Method Has Different Purpose

```python
# audio_recognition.py:688
def clear_user_turn(self) -> None:
```

This method is called to **discard** a turn (see push_to_talk.py line 88). The method's purpose is clear from its name and from what it resets:
- `_audio_transcript = ""` - Clear transcript
- `_speech_start_time = None` - Clear timing
- `_vad_speech_started = False` - Clear VAD state
- `update_stt(None); update_stt(stt)` - Reset STT pipeline

**All per-turn state is cleared except `_turn_tracker`.**

## Committed vs Discarded: The Critical Distinction

| Scenario | Turn 1 Status | Accumulation Correct? |
|----------|---------------|----------------------|
| `test_accumulation_across_interrupted_turns` | Committed to chat context | Yes - turn 1 is history |
| `cancel_turn` in push_to_talk | Discarded (NOT committed) | No - turn 1 should be erased |

When `clear_user_turn()` is called, the turn is discarded. It is NOT committed to chat context. The user's intent is to cancel that turn. It should be as if the turn never happened.

## Why the Fix is Correct

If the design was intentional (accumulation across ALL turns, including canceled), then:
1. `clear_user_turn()` would NOT reset `_speech_start_time` 
2. The method would not reset `_audio_transcript`
3. The method would not reset VAD state

But it DOES reset all of these. The intent is clear: **discard all state for this turn**. The `_turn_tracker` omission is an oversight.

---

# Key Distinction: Committed vs. Discarded Turns

## Committed Turn (Intentional Accumulation)

In `test_accumulation_across_interrupted_turns()`:
1. User speaks turn 1 ("one two three")
2. Turn 1 is **committed** to chat context (via `on_user_turn_completed`)
3. User interrupts before agent speaks
4. Turn 2 starts ("four five six")
5. Accumulation is **intentional** - turn 1 was committed, user interrupted

The `accumulated_transcript` correctly equals "one two three four five six" because turn 1 is part of the conversation history.

## Discarded Turn (Bug Scenario)

In push-to-talk `cancel_turn`:
1. User presses button, speaks turn 1 ("one two three")
2. User releases button without committing → calls `cancel_turn` → calls `clear_user_turn()`
3. Turn 1 is **discarded** (NOT committed to chat context)
4. User presses button again, speaks turn 2 ("four five six")
5. Turn 2's `_check_user_turn_limit()` uses accumulated values from **discarded** turn 1
6. Bug: `accumulated_word_count` = 6 (should be 3)

The key difference: **Discarded turns should NOT contribute to limits.**

---

# Evidence

## Evidence 1: clear_user_turn() resets all per-turn state EXCEPT _turn_tracker

```python
# audio_recognition.py:688-708
def clear_user_turn(self) -> None:
    self._audio_transcript = ""
    self._audio_interim_transcript = ""
    self._audio_preflight_transcript = ""
    self._final_transcript_confidence = []
    self._last_final_transcript_time = None
    self._speech_start_time = None        # Reset ✓
    self._last_speaking_time = None        # Reset ✓
    self._vad_speech_started = False       # Reset ✓
    self._user_turn_committed = False      # Reset ✓
    # ... user_turn_span handling ...
    # _turn_tracker NOT reset              # BUG ✗
```

The method resets all per-turn buffers and timing state. `_turn_tracker` serves the same purpose (per-turn limit tracking) but is not reset.

## Evidence 2: _check_user_turn_limit() incorrectly inherits from discarded turn

```python
# audio_recognition.py:1240-1245
if self._turn_tracker.started_at is None:
    self._turn_tracker.started_at = self._speech_start_time or now

words = self._word_tokenizer.tokenize(transcript)
self._turn_tracker.words += len(words)  # Accumulates from discarded turn!
self._turn_tracker.transcript = f"{self._turn_tracker.transcript} {transcript}".strip()
```

After `clear_user_turn()`:
- `_turn_tracker.started_at` is NOT None (has timestamp from discarded turn)
- `_turn_tracker.words` has count from discarded turn
- Duration calculation: `now - _turn_tracker.started_at` includes time from discarded turn

## Evidence 3: End-to-End Integration Test (STT Event Pipeline)

**CRITICAL**: Test `test_clear_user_turn_integration_word_limit` in `tests/test_clear_user_turn_integration.py` demonstrates the bug through the **actual STT event pipeline** using `FakeSession`, `FakeSTT`, and `FakeVAD`.

### Test Output (Actual Evidence)
```
Bug demonstration:
  exceeded_events count: 1
  accumulated_word_count: 6
  accumulated_transcript: 'one two three four five six'
```

### What the Test Does
1. Creates a session with `user_turn_limit: {max_words: 5}`
2. Turn 1: FakeSTT sends FINAL_TRANSCRIPT "one two three" (3 words)
3. **Calls `session.clear_user_turn()`** - simulating push-to-talk cancel_turn
4. Turn 2: FakeSTT sends FINAL_TRANSCRIPT "four five six" (3 words)
5. BUG: `on_user_turn_exceeded` fires with accumulated_word_count = 6

### Code Path (Real Event Flow)
```
FakeSTT.stream() → FakeRecognizeStream → _STTPipeline.event_ch → 
_stt_consumer() → _on_stt_event() → _check_user_turn_limit()
```

This is the **exact same code path** used in production. The STT pipeline IS reset by `update_stt(None)` + `update_stt(stt)` at lines 706-708, but the `_turn_tracker` persists.

### Why STT Pipeline Reset Doesn't Help
- `update_stt(None)` closes the old `_STTPipeline` and cancels `_stt_consumer_atask`
- `update_stt(stt)` creates a NEW `_STTPipeline` with fresh `event_ch`
- BUT: Neither operation touches `self._turn_tracker`
- The new STT events accumulate into the OLD `_turn_tracker` state

## Evidence 4: Direct Unit Tests

Tests in `tests/test_clear_user_turn_tracker_bug.py` isolate the bug to `_turn_tracker` state:

### Test 1: Direct state inspection (`test_tracker_not_reset_on_clear_user_turn`)
```python
recognition._check_user_turn_limit("one two three")
recognition.clear_user_turn()

assert recognition._speech_start_time is None  # Correctly reset
assert recognition._turn_tracker.words == 3    # BUG: Not reset
assert recognition._turn_tracker.transcript == "one two three"  # BUG: Not reset
```

### Test 2: False trigger due to word accumulation (`test_canceled_turn_causes_false_word_trigger`)
```python
# Turn 1: 3 words
recognition._check_user_turn_limit("one two three")
recognition.clear_user_turn()  # Discard turn 1

# Turn 2: 3 more words
recognition._check_user_turn_limit("four five six")

# BUG: Event fires with accumulated_word_count = 6 (limit was 5)
assert len(hooks.exceeded_events) == 1
assert hooks.exceeded_events[0].accumulated_word_count == 6  # Should be 3
```

### Test 3: Duration uses wrong timestamp (`test_canceled_turn_causes_false_duration_trigger`)
```python
# max_duration = 1.0 second

# Turn 1: started 2 seconds ago
recognition._check_user_turn_limit("hello")
recognition.clear_user_turn()

# Turn 2: starts now
recognition._check_user_turn_limit("world")

# BUG: duration = now - turn1_start ≈ 2 seconds
# This exceeds max_duration immediately!
assert hooks.exceeded_events[1].duration > 1.5
```

### Test 4: Fix verification (`test_fix_verification`)
```python
# With manual reset of _turn_tracker after clear_user_turn:
recognition._turn_tracker = _UserTurnTracker()
recognition._check_user_turn_limit("four five six")

# No event - correct behavior
assert len(hooks.exceeded_events) == 0
```

All tests pass, confirming the bug exists and the fix resolves it.

## Evidence 5: Push-to-talk pattern (Real World Impact)

```python
# examples/voice_agents/push_to_talk.py:85-89
@ctx.room.local_participant.register_rpc_method("cancel_turn")
async def cancel_turn(data: rtc.RpcInvocationData):
    session.input.set_audio_enabled(False)
    session.clear_user_turn()  # Called when user aborts turn
```

Users combining push-to-talk with `user_turn_limit` will encounter this bug. Example:
1. User presses button, says "one two three"
2. User releases button (accidentally), triggering `cancel_turn`
3. User presses button again, says "four five six"
4. If `max_words=5`, the second turn triggers `on_user_turn_exceeded` incorrectly

## Evidence 6: Inconsistent reset behavior in update_stt()

```python
# audio_recognition.py:607-610
def update_stt(self, stt: io.STTNode | None, ...) -> None:
    # ...
    # reset interruption handling related state
    self._transcript_buffer.clear()
    self._ignore_user_transcript_until = NOT_GIVEN
    self._input_started_at = None
    # Note: _turn_tracker NOT reset here either
```

The `update_stt()` method resets several state variables but not `_turn_tracker`. This shows the oversight is in both `clear_user_turn()` and `update_stt()`.

## Evidence 7: Complete analysis of _turn_tracker reset locations

The `_turn_tracker` is reset in exactly ONE place:

```python
# audio_recognition.py:297-298
def on_start_of_agent_speech(self, started_at: float) -> None:
    # ...
    # reset user turn tracker when agent starts speaking
    self._turn_tracker = _UserTurnTracker()
```

This reset occurs when the **agent** speaks. There is NO reset when:
- `clear_user_turn()` is called (user discards turn)
- `update_stt()` is called (STT pipeline reset)
- Agent handoff scenarios
- Session close/cleanup

This confirms the reset logic is incomplete - it only covers the "agent speaks" case, not the "user cancels turn" case.

---

# Why the Existing Tests Don't Catch This

The tests in `test_user_turn_exceeded.py` only cover:

1. `test_reset_on_agent_speaking` - Agent speaks between turns (resets via `on_start_of_agent_speech`)
2. `test_accumulation_across_interrupted_turns` - Turn 1 is COMMITTED before accumulation

Neither test covers the `clear_user_turn()` code path where a turn is discarded.

---

# Recommended Fix

Add `_turn_tracker` reset to `clear_user_turn()`:

```python
def clear_user_turn(self) -> None:
    self._audio_transcript = ""
    self._audio_interim_transcript = ""
    self._audio_preflight_transcript = ""
    self._final_transcript_confidence = []
    self._last_final_transcript_time = None
    self._speech_start_time = None
    self._last_speaking_time = None
    self._vad_speech_started = False
    self._user_turn_committed = False
    # FIX: Reset _turn_tracker to prevent accumulation from discarded turns
    self._turn_tracker = _UserTurnTracker()
    # ... rest of method ...
```

This is consistent with:
1. The method's purpose of clearing all per-turn state
2. The reset pattern in `on_start_of_agent_speech()` (line 298)
3. The intent that discarded turns should not contribute to limits

# History

This bug was introduced in commit c4daef3 (@longcw, 2026-05-18, PR #5492). The commit added the `user_turn_limit` feature with `_UserTurnTracker` to track word count, transcript, and duration across user turns. The developer correctly reset `_turn_tracker` in `on_start_of_agent_speech()` (line 297-298) but forgot to add the same reset to `clear_user_turn()`, despite that method already resetting all other per-turn state variables like `_speech_start_time`, `_vad_speech_started`, and `_audio_transcript`. This oversight causes discarded turns (via `clear_user_turn()`) to incorrectly contribute accumulated values to subsequent turns.


Scenario	Turn 1 Status	Accumulation Correct?
`test_accumulation_across_interrupted_turns`	Committed to chat context	Yes - turn 1 is history
`cancel_turn` in push_to_talk	Discarded (NOT committed)	No - turn 1 should be erased

[Detail Bug] Voice: Canceling a push-to-talk turn causes user turn limits to carry over to the next turn #5851

Description

Summary

Response to Reviewer Objections

Objection 1: "Tests bypass the actual event pipeline"

Objection 2: "STT pipeline is reset in clear_user_turn()"

Objection 3: "The feature is brand new, accumulation may be intentional"

Objection 4: "No demonstration with real STT events"

Test Output (Bug Confirmed)

Real Event Flow

Design Intent Clarification

The Comment "accumulates across turns until agent speaks" (Line 220)

The clear_user_turn() Method Has Different Purpose

Committed vs Discarded: The Critical Distinction

Why the Fix is Correct

Key Distinction: Committed vs. Discarded Turns

Committed Turn (Intentional Accumulation)

Discarded Turn (Bug Scenario)

Evidence

Evidence 1: clear_user_turn() resets all per-turn state EXCEPT _turn_tracker

Evidence 2: _check_user_turn_limit() incorrectly inherits from discarded turn

Evidence 3: End-to-End Integration Test (STT Event Pipeline)

Test Output (Actual Evidence)

What the Test Does

Code Path (Real Event Flow)

Why STT Pipeline Reset Doesn't Help

Evidence 4: Direct Unit Tests

Test 1: Direct state inspection (test_tracker_not_reset_on_clear_user_turn)

Test 2: False trigger due to word accumulation (test_canceled_turn_causes_false_word_trigger)

Test 3: Duration uses wrong timestamp (test_canceled_turn_causes_false_duration_trigger)

Test 4: Fix verification (test_fix_verification)

Evidence 5: Push-to-talk pattern (Real World Impact)

Evidence 6: Inconsistent reset behavior in update_stt()

Evidence 7: Complete analysis of _turn_tracker reset locations

Why the Existing Tests Don't Catch This

Recommended Fix

History

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

The `clear_user_turn()` Method Has Different Purpose

Test 1: Direct state inspection (`test_tracker_not_reset_on_clear_user_turn`)

Test 2: False trigger due to word accumulation (`test_canceled_turn_causes_false_word_trigger`)

Test 3: Duration uses wrong timestamp (`test_canceled_turn_causes_false_duration_trigger`)

Test 4: Fix verification (`test_fix_verification`)