Bug Description
Issue 1:
Streaming tokenizers in livekit.agents.tokenize.token_stream.BufferedTokenStream never emit the first token unless a second token appears. Ref code
With real LLM output (multi‑second gaps between tokens in some cases) this means Cartesia/ElevenLabs TTS stays silent until another chunk arrives.
If we relax the guard to allow a single token, the loop spins forever because sentence/word tokenizers return inclusive end indices (tok[2]), so the buffer slice never shrinks.
Issue 2:
blingfire.SentenceTokenizer treats back‑to‑back sentences with no intervening space as a single token. So when the LLM outputs something like Sentence one.Sentence two - common when tool calls finish and new LLM request is made - BlingFire glues everything together, and the streaming path won’t yield a sentence until the entire turn finishes causing further delay in speech generation.
As an example run the tokenization script I have shared below and see the output as Could you please help me with your full name?What could go wrong. <break time="0.25s" />. even though there are 2 complete sentences here and 1 SSML tag but are sent to TTS as a single complete sentence. Please refer this
Expected Behavior
Issue 1:
As soon as a tokenizer detects a complete token - whether that’s a sentence, a word - it should be emitted immediately, even if it’s the only token so far and not depend on the look-ahead logic as the delay from LLM chunks can be in seconds in certain cases.
Issue 2:
Blingfire Sentence tokenizer should split between sentences instead of gluing the rest of the turn onto the first token if there are no intervening spaces
Reproduction Steps
- pip install
livekit-agents[elevenlabs,cartesia,openai,deepgram]==1.2.17
- There are two scripts below as GH Gist
- tok.py script has LLM simulated responses with delays that have been observed in realistic cases of LLM when tool calls occur and the behavior of these tokenization implementation.
- tts_filler.py script that has the same above simulated responses but shows the experience with
cartesia and elevelabs TTS - you manually have to update the script to use different TTS and run again.
- Output of these scripts are added as comments in the Gist
Operating System
macOS Tahoe 26.0
Models Used
Deepgram nova-2-phonecall, elevenlabs, cartesia, custom llm simulation
Package Versions
livekit==1.0.16
livekit-agents==1.2.17
livekit-api==1.0.7
livekit-blingfire==1.0.0
livekit-plugins-anthropic==1.2.17
livekit-plugins-cartesia==1.2.17
livekit-plugins-deepgram==1.2.17
livekit-plugins-elevenlabs==1.2.17
livekit-plugins-google==1.2.17
livekit-plugins-noise-cancellation==0.2.5
livekit-plugins-openai==1.2.17
livekit-plugins-silero==1.2.17
livekit-plugins-turn-detector==1.2.17
livekit-protocol==1.0.8
Session/Room/Call IDs
No response
Proposed Solution
I have tried to relax the token length but that causes the buffer stream to go into an infinite state as mentioned in the bug description.
I am also unsure how we can implement this without introducing timers - hence created this bug to see if there could be other solutions as this is a core part of the code that hasn't been changed much.
Additional Context
Please go through this GH Gist
Screenshots and Recordings
No response
Bug Description
Issue 1:
Streaming tokenizers in
livekit.agents.tokenize.token_stream.BufferedTokenStreamnever emit the first token unless a second token appears. Ref codeWith real LLM output (multi‑second gaps between tokens in some cases) this means Cartesia/ElevenLabs TTS stays silent until another chunk arrives.
If we relax the guard to allow a single token, the loop spins forever because sentence/word tokenizers return inclusive end indices (tok[2]), so the buffer slice never shrinks.
Issue 2:
blingfire.SentenceTokenizer treats back‑to‑back sentences with no intervening space as a single token. So when the LLM outputs something like
Sentence one.Sentence two- common when tool calls finish and new LLM request is made - BlingFire glues everything together, and the streaming path won’t yield a sentence until the entire turn finishes causing further delay in speech generation.As an example run the tokenization script I have shared below and see the output as
Could you please help me with your full name?What could go wrong. <break time="0.25s" />.even though there are 2 complete sentences here and 1 SSML tag but are sent to TTS as a single complete sentence. Please refer thisExpected Behavior
Issue 1:
As soon as a tokenizer detects a complete token - whether that’s a sentence, a word - it should be emitted immediately, even if it’s the only token so far and not depend on the look-ahead logic as the delay from LLM chunks can be in seconds in certain cases.
Issue 2:
Blingfire Sentence tokenizer should split between sentences instead of gluing the rest of the turn onto the first token if there are no intervening spaces
Reproduction Steps
livekit-agents[elevenlabs,cartesia,openai,deepgram]==1.2.17cartesiaandelevelabsTTS - you manually have to update the script to use different TTS and run again.tok.pyfor all tokenizer behaviortts_filler.pywith cartesiatts_filler.pywith elevenlabsOperating System
macOS Tahoe 26.0
Models Used
Deepgram nova-2-phonecall, elevenlabs, cartesia, custom llm simulation
Package Versions
Session/Room/Call IDs
No response
Proposed Solution
I have tried to relax the token length but that causes the buffer stream to go into an infinite state as mentioned in the bug description.
I am also unsure how we can implement this without introducing timers - hence created this bug to see if there could be other solutions as this is a core part of the code that hasn't been changed much.
Additional Context
Please go through this GH Gist
Screenshots and Recordings
No response