Skip to content

bug: STT errors close AgentSession immediately, lacking the tolerance counter that LLM and TTS errors have #5665

@AKomplished-bug

Description

@AKomplished-bug

Bug Description

AgentSession._on_error applies an max_unrecoverable_errors tolerance counter for llm_error and tts_error, but stt_error has no such counter — it falls through and immediately closes the session on the first non-recoverable STT error.

Root Cause

In livekit-agents/livekit/agents/voice/agent_session.py, the _on_error method:

def _on_error(self, error: llm.LLMError | stt.STTError | tts.TTSError | ...) -> None:
    if self._closing_task or error.recoverable:
        return

    if error.type == "llm_error":
        self._llm_error_counts += 1
        if self._llm_error_counts <= self.conn_options.max_unrecoverable_errors:
            return  # ← tolerance applied
    elif error.type == "tts_error":
        self._tts_error_counts += 1
        if self._tts_error_counts <= self.conn_options.max_unrecoverable_errors:
            return  # ← tolerance applied

    # stt_error: no branch → falls through → session closes immediately ❌

    self._closing_task = asyncio.create_task(
        self._aclose_impl(error=error, reason=CloseReason.ERROR)
    )

_llm_error_counts and _tts_error_counts are initialised and reset in multiple places (__init__, _aclose_impl, _update_agent_state), but there is no equivalent _stt_error_counts.

Real-World Impact

When an STT provider returns transient errors (e.g. HTTP 403/429 from Groq after exhausting per-utterance retries in the RecognizeStream._main_task loop), the session closes immediately on the very first non-recoverable STT error. There is no opportunity for application-level error handlers (registered via session.on("error", ...)) to perform graceful recovery such as transferring to a human agent.

The per-utterance retry loop in RecognizeStream._main_task already handles transient network blips. The session-level tolerance is meant to handle the case where retries are truly exhausted — but currently STT gets zero tolerance at the session level while LLM and TTS each get max_unrecoverable_errors (default 3).

Expected Behaviour

stt_error should receive the same tolerance as llm_error and tts_error: allow up to max_unrecoverable_errors non-recoverable STT errors before closing the session, giving application error handlers a chance to act (e.g. fall back to another STT, transfer to human, speak an acknowledgement).

Proposed Fix

Add _stt_error_counts counter mirroring the existing LLM/TTS pattern:

# __init__ and reset sites:
self._stt_error_counts = 0

# _on_error:
elif error.type == "stt_error":
    self._stt_error_counts += 1
    if self._stt_error_counts <= self.conn_options.max_unrecoverable_errors:
        return

And reset _stt_error_counts alongside the other counters in _aclose_impl and _update_agent_state (when state transitions to "speaking").

Environment

  • livekit-agents (latest main)
  • STT plugin: livekit-plugins-groq (wraps livekit-plugins-openai STT)
  • Reproduced with any STT provider that returns HTTP 4xx errors after retries are exhausted

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions