Skip to content

BufferedTokenStream never yields the completed tokens (sentence or word) till the next token arrives causing delays in TTS Speech generation #3798

@zaheerabbas-prodigal

Description

@zaheerabbas-prodigal

Bug Description

Issue 1:
Streaming tokenizers in livekit.agents.tokenize.token_stream.BufferedTokenStream never emit the first token unless a second token appears. Ref code

With real LLM output (multi‑second gaps between tokens in some cases) this means Cartesia/ElevenLabs TTS stays silent until another chunk arrives.

If we relax the guard to allow a single token, the loop spins forever because sentence/word tokenizers return inclusive end indices (tok[2]), so the buffer slice never shrinks.

Issue 2:
blingfire.SentenceTokenizer treats back‑to‑back sentences with no intervening space as a single token. So when the LLM outputs something like Sentence one.Sentence two - common when tool calls finish and new LLM request is made - BlingFire glues everything together, and the streaming path won’t yield a sentence until the entire turn finishes causing further delay in speech generation.

As an example run the tokenization script I have shared below and see the output as Could you please help me with your full name?What could go wrong. <break time="0.25s" />. even though there are 2 complete sentences here and 1 SSML tag but are sent to TTS as a single complete sentence. Please refer this

Expected Behavior

Issue 1:
As soon as a tokenizer detects a complete token - whether that’s a sentence, a word - it should be emitted immediately, even if it’s the only token so far and not depend on the look-ahead logic as the delay from LLM chunks can be in seconds in certain cases.

Issue 2:
Blingfire Sentence tokenizer should split between sentences instead of gluing the rest of the turn onto the first token if there are no intervening spaces

Reproduction Steps

  1. pip install livekit-agents[elevenlabs,cartesia,openai,deepgram]==1.2.17
  2. There are two scripts below as GH Gist
  • tok.py script has LLM simulated responses with delays that have been observed in realistic cases of LLM when tool calls occur and the behavior of these tokenization implementation.
  • tts_filler.py script that has the same above simulated responses but shows the experience with cartesia and elevelabs TTS - you manually have to update the script to use different TTS and run again.
  1. Output of these scripts are added as comments in the Gist

Operating System

macOS Tahoe 26.0

Models Used

Deepgram nova-2-phonecall, elevenlabs, cartesia, custom llm simulation

Package Versions

livekit==1.0.16
livekit-agents==1.2.17
livekit-api==1.0.7
livekit-blingfire==1.0.0
livekit-plugins-anthropic==1.2.17
livekit-plugins-cartesia==1.2.17
livekit-plugins-deepgram==1.2.17
livekit-plugins-elevenlabs==1.2.17
livekit-plugins-google==1.2.17
livekit-plugins-noise-cancellation==0.2.5
livekit-plugins-openai==1.2.17
livekit-plugins-silero==1.2.17
livekit-plugins-turn-detector==1.2.17
livekit-protocol==1.0.8

Session/Room/Call IDs

No response

Proposed Solution

I have tried to relax the token length but that causes the buffer stream to go into an infinite state as mentioned in the bug description.

I am also unsure how we can implement this without introducing timers - hence created this bug to see if there could be other solutions as this is a core part of the code that hasn't been changed much.

Additional Context

Please go through this GH Gist

Screenshots and Recordings

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions