feat(loader): JSON config plumbing — special_tokens_map + Mistral V3 + tokenizer_config flags by kekzl · Pull Request #77 · kekzl/imp

kekzl · 2026-04-28T10:55:34Z

Summary

Combines the rebased content of the previously-orphaned PRs #75 + #76, which got stuck behind a stacking issue when their base branches were merged/deleted. Both PRs' diffs land cleanly on top of current main (which already has #74).

Both commits preserved as separate commits for reviewability:

special_tokens_map.json overlay + Mistral V3-Tekken family (was feat(loader): special_tokens_map.json overlay + Mistral V3-Tekken family #75)
tokenizer_config.json add_bos/add_eos/add_prefix_space plumbing (was feat(loader): plumb tokenizer_config.json add_bos/add_eos/add_prefix_space #76)

Why one PR

PRs #75 and #76 were independent topical changes, but they're small enough (and adjacent in scope) that one stacked review is cheaper than two separate threads. Each commit can be reviewed/cherry-picked individually.

1. special_tokens_map.json + Mistral V3

New HFConfigLoader::SpecialTokensMap parser: object form (Mistral) + plain-string form (Qwen)
Tokenizer::mark_as_control(id) defensive method (lazy-allocates type vector)
SafeTensors loader cross-checks additional_special_tokens against the tokenizer's CONTROL flag column; missing tags get patched
New ChatTemplateFamily::MISTRAL_V3 enum value with [TOOL_CALLS] / [AVAILABLE_TOOLS] substring detection BEFORE the LLAMA2 [INST] fallback
Fixes Mistral-Small-3.2 misclassification: was logging Chat template: llama2 despite Tekken-format jinja; now logs mistral_v3

2. tokenizer_config.json author flags

HFConfigLoader::TokenizerFlags struct + load_tokenizer_flags() parser
SafeTensors loader applies add_bos_token / add_prefix_space to the tokenizer (mirrors gguf_loader.cpp:1335-1354's GGUF-side handling)
Fixes Qwen3-Coder-30B-A3B-FP4 silent BOS injection: ships add_bos_token: false but pre-fix imp auto-prepended <|endoftext|> (token 151643) to every prompt
gpt2-default-false fallback when JSON is missing the flag, matching GGUF behavior

Test plan

test-core → 79/79 pass (was 72 on main; +3 SpecialTokensMapTest, +4 TokenizerFlagsTest)
test-text → 148/148 pass (was 140; +5 ChatTemplateDetectTest, +4 TokenizerControlTest)
test-e2e --gtest_filter='LlmCompressorE2E.{Gemma4,Mistral}*' → 3/3 pass (Gemma-4 NVFP4 coherent, Mistral-Small-3.2 NVFP4 "Paris" gate)
CLI smoke confirms log per model:
- Mistral 3.2: add_bos=true add_eos=false add_prefix_space=unset + 1000 additional_special_tokens, 0 patched
- Qwen3-Coder: add_bos=false add_eos=unset add_prefix_space=false
- Gemma-4: add_bos=unset add_eos=unset add_prefix_space=unset (falls through to existing tokenizer.json metadata)

Note on perf gate

IMP_VERIFY_SKIP_PERF=1 documented skip — baseline currently stale on main itself (Qwen3-8B Q8_0 measures tg=156 vs 258 baseline; same on this branch and on main). This PR doesn't touch any GGUF or compute code path.

Closes JSON coverage

With #74 already merged + this PR, every author-shipped HF config relevant for text-only inference is consumed:

✅ config.json, model.safetensors.index.json, chat_template.jinja, recipe.yaml, hf_quant_config.json, tokenizer.json, tokenizer_config.json (chat_template + added_tokens) — pre-existing
✅ generation_config.json — feat(loader): plumb generation_config.json sampling defaults #74
✅ special_tokens_map.json — this PR
✅ tokenizer_config.json (add_bos/add_eos/add_prefix_space) — this PR

⏭️ Out of scope (multimodal pipeline, separate effort): tokenizer.json::post_processor, tekken.json, processor_config.json, preprocessor_config.json.

Closes #75, #76

🤖 Generated with Claude Code

Two related additions to harden tokenizer + chat-template handling for SafeTensors models that ship config beyond what tokenizer.json carries. 1. special_tokens_map.json defensive overlay - HFConfigLoader::load_special_tokens_map parses the model author's authoritative additional_special_tokens list (object form for Mistral, plain-string form for Qwen). - SafeTensors loader cross-checks each entry against the tokenizer's CONTROL flag column. If a string is in the vocab but not tagged CONTROL, mark_as_control() patches it. tokenizer.json conversions are usually complete — this is belt-and-suspenders for buggy ones. - Mistral-Small-3.2-NVFP4 reports `1000 additional_special_tokens, 0 patched` (tokenizer.json already tagged them all). Still useful as a regression gate for future model dirs. 2. Mistral V3-Tekken family detection - New ChatTemplateFamily::MISTRAL_V3 enum value, with [TOOL_CALLS] / [AVAILABLE_TOOLS] substring detection BEFORE the LLAMA2 [INST] fallback. Older Mistral V1/V2 templates without tool markers still detect as LLAMA2. - Init / apply share the LLAMA2 path (same [INST] tokens). The Jinja2 path handles tool-call rendering when present. - Fixes: Mistral-Small-3.2 was previously misclassified as `llama2` in `chat_template.cpp:246` despite Tekken-format jinja being loaded. Tests - SpecialTokensMapTest x3 (object form, plain-string form, missing) - TokenizerControlTest x4 (lazy alloc, idempotent, invalid id, preserves pre-existing types) - ChatTemplateDetectTest x5 new (V3 ToolCalls, V3 AvailableTools, OldMistralStillLlama2, ParseFamilyMistralV3 + name lookup) - test-core 75/75, test-text 148/148, LlmCompressorE2E 3/3 Stacked on top of #74 (generation_config.json defaults). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…space (#76)

…+ tokenizer_config flags (#77)

kekzl and others added 2 commits April 28, 2026 12:54

feat(loader): plumb tokenizer_config.json add_bos/add_eos/add_prefix_…

0f9dc11

…space (#76)

kekzl merged commit ef6fa67 into main Apr 28, 2026
2 checks passed

kekzl deleted the feat/json-config-plumbing branch April 28, 2026 12:38

kekzl added a commit that referenced this pull request Apr 30, 2026

feat(loader): JSON config plumbing — special_tokens_map + Mistral V3 …

52d8798

…+ tokenizer_config flags (#77)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(loader): JSON config plumbing — special_tokens_map + Mistral V3 + tokenizer_config flags#77

feat(loader): JSON config plumbing — special_tokens_map + Mistral V3 + tokenizer_config flags#77
kekzl merged 2 commits into
mainfrom
feat/json-config-plumbing

kekzl commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented Apr 28, 2026

Summary

Why one PR

1. special_tokens_map.json + Mistral V3

2. tokenizer_config.json author flags

Test plan

Note on perf gate

Closes JSON coverage

Closes #75, #76

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant