fix(tokenizer): strip special tokens from offline decode text by localai-bot · Pull Request #48 · mudler/parakeet.cpp

localai-bot · 2026-07-01T17:16:23Z

Summary

Offline transcribe leaked raw special-token pieces (e.g. the <en-US> language tag) at the end of the transcript, because decode_enc_out(), transcribe_16k_batch(), decode_enc_out_with_timestamps(), and transcribe_16k_batch_with_timestamps() passed decoded ids straight to detokenize() with no filtering.
Add pk::strip_special_tokens() (drops ids whose piece matches <...>/[...], leaving ordinary ▁-prefixed content tokens untouched) and apply it at all four offline call sites, matching what the streaming path already does for <EOU>/<EOB>.
detokenize() itself is left unmodified since it mirrors NeMo's raw ids_to_text and test_tokenizer.cpp's baseline parity depends on that.

Fixes #40.

Test plan

TDD: added tests/test_special_token_filter.cpp, which reproduces the defect via the real, unmodified pk::detokenize() (the exact call every offline site made pre-fix) and verifies the fixed call-site expression (detokenize(pieces, strip_special_tokens(pieces, ids))) strips the tag while leaving normal content untouched.
Full ctest suite passes (model-gated tests skip as usual without local GGUF fixtures).
Manually ran parakeet-cli transcribe against a real local model (tdt_ctc-110m-q4_k.gguf) + fixture audio to confirm no regression in normal transcription output.

Offline transcribe emitted raw special-token pieces (e.g. the <en-US> language tag) at the end of the transcript because decode_enc_out(), transcribe_16k_batch(), decode_enc_out_with_timestamps(), and transcribe_16k_batch_with_timestamps() passed decoded ids straight to detokenize() with no filtering. Add pk::strip_special_tokens() (drops ids whose piece matches <...>/[...]) and apply it at all four call sites, matching the filtering the streaming path already does for <EOU>/<EOB>. Assisted-by: Claude:claude-sonnet-5 [Claude Code]

mudler merged commit e8acc61 into master Jul 1, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tokenizer): strip special tokens from offline decode text#48

fix(tokenizer): strip special tokens from offline decode text#48
mudler merged 1 commit into
masterfrom
fix-issue-40-special-tokens

localai-bot commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

localai-bot commented Jul 1, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants