Bug
The offline transcription path emits raw special tokens like <en-US> in the output text:
parakeet-cli transcribe --model nemotron-3.5-asr-streaming-0.6b-q4_k.gguf --input test.wav --lang en
# → "The sun was setting slowly, casting long shadows across the empty field. <en-US>"
The trailing <en-US> is a language tag token from the model's tokenizer vocabulary that should be stripped.
Root cause
detokenize() in src/tokenizer.cpp concatenates all token pieces for the given IDs, including special tokens. It does not filter tokens matching the <...> or [...] special-token pattern.
The streaming path already handles this correctly — src/streaming.cpp filters to non_special_ tokens before calling detokenize():
// streaming.cpp — correct
text_ = detokenize(ml_.config().tokenizer_pieces, non_special_);
But the offline path in src/model.cpp passes all decoded IDs (including special tokens) directly to detokenize():
// model.cpp decode_enc_out() — bug: no special-token filtering
return detokenize(loader.tokenizer_pieces(), ids);
This affects all four offline decode paths:
decode_enc_out() (line ~67, used by transcribe_16k)
transcribe_16k_batch() (line ~175, the TDT/RNNT batch path)
decode_enc_out_with_timestamps() (line ~210, used by transcribe_16k_with_timestamps)
transcribe_16k_batch_with_timestamps() (line ~310, the batch timestamped path)
Fix
Filter special tokens before detokenizing in the offline path. The is_special_token() function already exists in src/transcription.cpp and could be reused (or moved to a shared header):
// In model.cpp, before detokenize():
std::vector<int32_t> non_special_ids;
non_special_ids.reserve(ids.size());
for (int32_t id : ids) {
if (id >= 0 && (size_t)id < pieces.size()) {
const std::string& piece = pieces[(size_t)id];
if (!is_special_token(piece)) // reuse from transcription.cpp
non_special_ids.push_back(id);
}
}
return detokenize(pieces, non_special_ids);
Alternatively, add special-token filtering directly inside detokenize() in src/tokenizer.cpp — but this would change the streaming path's behaviour too (it pre-filters, so double-filtering is harmless but redundant).
Environment
- parakeet.cpp built from master (f469a57)
- Model:
nemotron-3.5-asr-streaming-0.6b (q4_k GGUF)
- Linux, Vulkan backend (AMD Radeon 880M)
- Reproduced with
--lang en, --lang en-US
Bug
The offline transcription path emits raw special tokens like
<en-US>in the output text:parakeet-cli transcribe --model nemotron-3.5-asr-streaming-0.6b-q4_k.gguf --input test.wav --lang en # → "The sun was setting slowly, casting long shadows across the empty field. <en-US>"The trailing
<en-US>is a language tag token from the model's tokenizer vocabulary that should be stripped.Root cause
detokenize()insrc/tokenizer.cppconcatenates all token pieces for the given IDs, including special tokens. It does not filter tokens matching the<...>or[...]special-token pattern.The streaming path already handles this correctly —
src/streaming.cppfilters tonon_special_tokens before callingdetokenize():// streaming.cpp — correct text_ = detokenize(ml_.config().tokenizer_pieces, non_special_);But the offline path in
src/model.cpppasses all decoded IDs (including special tokens) directly todetokenize():This affects all four offline decode paths:
decode_enc_out()(line ~67, used bytranscribe_16k)transcribe_16k_batch()(line ~175, the TDT/RNNT batch path)decode_enc_out_with_timestamps()(line ~210, used bytranscribe_16k_with_timestamps)transcribe_16k_batch_with_timestamps()(line ~310, the batch timestamped path)Fix
Filter special tokens before detokenizing in the offline path. The
is_special_token()function already exists insrc/transcription.cppand could be reused (or moved to a shared header):Alternatively, add special-token filtering directly inside
detokenize()insrc/tokenizer.cpp— but this would change the streaming path's behaviour too (it pre-filters, so double-filtering is harmless but redundant).Environment
nemotron-3.5-asr-streaming-0.6b(q4_k GGUF)--lang en,--lang en-US