Skip to content

Offline decode emits raw language tag tokens (<en-US>) in transcript text #40

Description

@HaleTom

Bug

The offline transcription path emits raw special tokens like <en-US> in the output text:

parakeet-cli transcribe --model nemotron-3.5-asr-streaming-0.6b-q4_k.gguf --input test.wav --lang en
# → "The sun was setting slowly, casting long shadows across the empty field. <en-US>"

The trailing <en-US> is a language tag token from the model's tokenizer vocabulary that should be stripped.

Root cause

detokenize() in src/tokenizer.cpp concatenates all token pieces for the given IDs, including special tokens. It does not filter tokens matching the <...> or [...] special-token pattern.

The streaming path already handles this correctly — src/streaming.cpp filters to non_special_ tokens before calling detokenize():

// streaming.cpp — correct
text_ = detokenize(ml_.config().tokenizer_pieces, non_special_);

But the offline path in src/model.cpp passes all decoded IDs (including special tokens) directly to detokenize():

// model.cpp decode_enc_out() — bug: no special-token filtering
return detokenize(loader.tokenizer_pieces(), ids);

This affects all four offline decode paths:

  • decode_enc_out() (line ~67, used by transcribe_16k)
  • transcribe_16k_batch() (line ~175, the TDT/RNNT batch path)
  • decode_enc_out_with_timestamps() (line ~210, used by transcribe_16k_with_timestamps)
  • transcribe_16k_batch_with_timestamps() (line ~310, the batch timestamped path)

Fix

Filter special tokens before detokenizing in the offline path. The is_special_token() function already exists in src/transcription.cpp and could be reused (or moved to a shared header):

// In model.cpp, before detokenize():
std::vector<int32_t> non_special_ids;
non_special_ids.reserve(ids.size());
for (int32_t id : ids) {
    if (id >= 0 && (size_t)id < pieces.size()) {
        const std::string& piece = pieces[(size_t)id];
        if (!is_special_token(piece))  // reuse from transcription.cpp
            non_special_ids.push_back(id);
    }
}
return detokenize(pieces, non_special_ids);

Alternatively, add special-token filtering directly inside detokenize() in src/tokenizer.cpp — but this would change the streaming path's behaviour too (it pre-filters, so double-filtering is harmless but redundant).

Environment

  • parakeet.cpp built from master (f469a57)
  • Model: nemotron-3.5-asr-streaming-0.6b (q4_k GGUF)
  • Linux, Vulkan backend (AMD Radeon 880M)
  • Reproduced with --lang en, --lang en-US

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions