fix(tokenizer): strip special tokens from offline decode text#48
Merged
Conversation
Offline transcribe emitted raw special-token pieces (e.g. the <en-US> language tag) at the end of the transcript because decode_enc_out(), transcribe_16k_batch(), decode_enc_out_with_timestamps(), and transcribe_16k_batch_with_timestamps() passed decoded ids straight to detokenize() with no filtering. Add pk::strip_special_tokens() (drops ids whose piece matches <...>/[...]) and apply it at all four call sites, matching the filtering the streaming path already does for <EOU>/<EOB>. Assisted-by: Claude:claude-sonnet-5 [Claude Code]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
<en-US>language tag) at the end of the transcript, becausedecode_enc_out(),transcribe_16k_batch(),decode_enc_out_with_timestamps(), andtranscribe_16k_batch_with_timestamps()passed decoded ids straight todetokenize()with no filtering.pk::strip_special_tokens()(drops ids whose piece matches<...>/[...], leaving ordinary▁-prefixed content tokens untouched) and apply it at all four offline call sites, matching what the streaming path already does for<EOU>/<EOB>.detokenize()itself is left unmodified since it mirrors NeMo's rawids_to_textandtest_tokenizer.cpp's baseline parity depends on that.Fixes #40.
Test plan
tests/test_special_token_filter.cpp, which reproduces the defect via the real, unmodifiedpk::detokenize()(the exact call every offline site made pre-fix) and verifies the fixed call-site expression (detokenize(pieces, strip_special_tokens(pieces, ids))) strips the tag while leaving normal content untouched.ctestsuite passes (model-gated tests skip as usual without local GGUF fixtures).parakeet-cli transcribeagainst a real local model (tdt_ctc-110m-q4_k.gguf) + fixture audio to confirm no regression in normal transcription output.