Skip to content

AssemblyAI STT: transcription text jumps/regresses mid-segment #4779

@mmsbrggr

Description

@mmsbrggr

Bug Description

When using AssemblyAISTT, transcription text on the client occasionally "jumps", the last speech bubble's text shrinks to just a few words, then reappears in full on the next update. STT and voice interaction work correctly; the issue is purely visual.

Example sequence on a single segment ID (all final: false):

  1. "I need you to look beyond the eyes and be honest with me" — correct, growing
  2. "And be honest with me." — text regresses to just the tail end
  3. "I need you to look beyond the eyes and be honest with me, always" — full text restored

Expected Behavior

Transcription text for a segment should only grow or be replaced by the final transcript. never shrink mid-utterance.

Reproduction Steps

  1. Set up a voice agent with AssemblyAISTT (streaming mode)
  2. Connect a frontend client that renders TranscriptionReceived segments by ID
  3. Speak continuously for 10+ seconds
  4. Observe the last speech bubble's text occasionally shrinking then restoring

Operating System

Linux (also observed on iOS/Safari client)

Models Used

AssemblyAI universal-streaming-multilingual

Package Versions

  • livekit-agents[assemblyai] == 1.4.1
  • livekit-client (JS) latest

Session/Room/Call IDs

No response

Proposed Solution

AssemblyAI Turn messages contain both a words array (cumulative) and an utterance field (chunk-based). It appears the plugin emits INTERIM_TRANSCRIPT from cumulative words and PREFLIGHT_TRANSCRIPT from the chunk-based utterance. AudioRecognition routes both through on_interim_transcript, and since the transcription uses replacement mode (is_delta_stream=False), the PREFLIGHT's chunk text overwrites the INTERIM's cumulative text on the same segment ID.

A length guard in the framework (skip updates where text shrinks for the same segment) would fix this generically. We're currently applying this workaround on the client side.

Additional Context

This is independent of format_turns, the utterance field is always chunk-based regardless of that setting.

Screenshots and Recordings

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions