Skip to content

blingfire SentenceTokenizer(retain_format=True) emits a trailing whitespace-only sentence #6296

Description

@immuhammadfurqan

Bug Description

blingfire.SentenceTokenizer — the default sentence tokenizer used by the TTS StreamAdapter — emits a trailing whitespace-only token when retain_format=True and the input ends in whitespace.

In livekit/agents/tokenize/blingfire.py::_split_sentences, the trailing text segment is appended unconditionally in retain_format mode, whereas the non-retain path strips and skips empty segments, so the two paths disagree:

if start < len(text):
    raw_sentence = text[start:]
    if retain_format:
        merged_sentences.append((raw_sentence, start, len(text)))   # "\n\n" leaks through
    elif sentence := raw_sentence.strip():
        merged_sentences.append((sentence, start, len(text)))

Reproduction Steps

from livekit.agents.tokenize import blingfire

tok = blingfire.SentenceTokenizer(min_sentence_len=20, retain_format=True)
print(tok.tokenize("This is a real sentence to speak.\n\n"))
# ['This is a real sentence to speak.', '\n\n']   <-- trailing '\n\n' is a spurious empty sentence

print(blingfire.SentenceTokenizer(min_sentence_len=20).tokenize("This is a real sentence to speak.\n\n"))
# ['This is a real sentence to speak.']           <-- non-retain path correctly drops it

The same empty token is produced by the streamed path (.stream()), and StreamAdapterWrapper._synthesize pushes it into the timed transcript (push_timed_transcript) unconditionally. The audio synth call itself is already guarded by a .strip() check, so the practical impact is tokenizer-contract correctness and clean transcript / .tokenize() output rather than empty TTS requests.

Expected Behavior

retain_format=True should match the non-retain path and not emit whitespace-only trailing segments, while still preserving the original formatting of real trailing content (e.g. a retained "\n\nMore" must be kept intact).

Package Versions

  • livekit-agents (main)
  • livekit-blingfire ~=1.1

Additional Context

Fix in #6295.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions