Skip to content

feat(loader): JSON config plumbing — special_tokens_map + Mistral V3 + tokenizer_config flags#77

Merged
kekzl merged 2 commits into
mainfrom
feat/json-config-plumbing
Apr 28, 2026
Merged

feat(loader): JSON config plumbing — special_tokens_map + Mistral V3 + tokenizer_config flags#77
kekzl merged 2 commits into
mainfrom
feat/json-config-plumbing

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented Apr 28, 2026

Summary

Combines the rebased content of the previously-orphaned PRs #75 + #76, which got stuck behind a stacking issue when their base branches were merged/deleted. Both PRs' diffs land cleanly on top of current main (which already has #74).

Both commits preserved as separate commits for reviewability:

  1. special_tokens_map.json overlay + Mistral V3-Tekken family (was feat(loader): special_tokens_map.json overlay + Mistral V3-Tekken family #75)
  2. tokenizer_config.json add_bos/add_eos/add_prefix_space plumbing (was feat(loader): plumb tokenizer_config.json add_bos/add_eos/add_prefix_space #76)

Why one PR

PRs #75 and #76 were independent topical changes, but they're small enough (and adjacent in scope) that one stacked review is cheaper than two separate threads. Each commit can be reviewed/cherry-picked individually.

1. special_tokens_map.json + Mistral V3

  • New HFConfigLoader::SpecialTokensMap parser: object form (Mistral) + plain-string form (Qwen)
  • Tokenizer::mark_as_control(id) defensive method (lazy-allocates type vector)
  • SafeTensors loader cross-checks additional_special_tokens against the tokenizer's CONTROL flag column; missing tags get patched
  • New ChatTemplateFamily::MISTRAL_V3 enum value with [TOOL_CALLS] / [AVAILABLE_TOOLS] substring detection BEFORE the LLAMA2 [INST] fallback
  • Fixes Mistral-Small-3.2 misclassification: was logging Chat template: llama2 despite Tekken-format jinja; now logs mistral_v3

2. tokenizer_config.json author flags

  • HFConfigLoader::TokenizerFlags struct + load_tokenizer_flags() parser
  • SafeTensors loader applies add_bos_token / add_prefix_space to the tokenizer (mirrors gguf_loader.cpp:1335-1354's GGUF-side handling)
  • Fixes Qwen3-Coder-30B-A3B-FP4 silent BOS injection: ships add_bos_token: false but pre-fix imp auto-prepended <|endoftext|> (token 151643) to every prompt
  • gpt2-default-false fallback when JSON is missing the flag, matching GGUF behavior

Test plan

  • test-core → 79/79 pass (was 72 on main; +3 SpecialTokensMapTest, +4 TokenizerFlagsTest)
  • test-text → 148/148 pass (was 140; +5 ChatTemplateDetectTest, +4 TokenizerControlTest)
  • test-e2e --gtest_filter='LlmCompressorE2E.{Gemma4,Mistral}*' → 3/3 pass (Gemma-4 NVFP4 coherent, Mistral-Small-3.2 NVFP4 "Paris" gate)
  • CLI smoke confirms log per model:
    • Mistral 3.2: add_bos=true add_eos=false add_prefix_space=unset + 1000 additional_special_tokens, 0 patched
    • Qwen3-Coder: add_bos=false add_eos=unset add_prefix_space=false
    • Gemma-4: add_bos=unset add_eos=unset add_prefix_space=unset (falls through to existing tokenizer.json metadata)

Note on perf gate

IMP_VERIFY_SKIP_PERF=1 documented skip — baseline currently stale on main itself (Qwen3-8B Q8_0 measures tg=156 vs 258 baseline; same on this branch and on main). This PR doesn't touch any GGUF or compute code path.

Closes JSON coverage

With #74 already merged + this PR, every author-shipped HF config relevant for text-only inference is consumed:

  • config.json, model.safetensors.index.json, chat_template.jinja, recipe.yaml, hf_quant_config.json, tokenizer.json, tokenizer_config.json (chat_template + added_tokens) — pre-existing
  • generation_config.jsonfeat(loader): plumb generation_config.json sampling defaults #74
  • special_tokens_map.json — this PR
  • tokenizer_config.json (add_bos/add_eos/add_prefix_space) — this PR

⏭️ Out of scope (multimodal pipeline, separate effort): tokenizer.json::post_processor, tekken.json, processor_config.json, preprocessor_config.json.

Closes #75, #76

🤖 Generated with Claude Code

kekzl and others added 2 commits April 28, 2026 12:54
Two related additions to harden tokenizer + chat-template handling for
SafeTensors models that ship config beyond what tokenizer.json carries.

1. special_tokens_map.json defensive overlay
   - HFConfigLoader::load_special_tokens_map parses the model author's
     authoritative additional_special_tokens list (object form for
     Mistral, plain-string form for Qwen).
   - SafeTensors loader cross-checks each entry against the tokenizer's
     CONTROL flag column. If a string is in the vocab but not tagged
     CONTROL, mark_as_control() patches it. tokenizer.json conversions
     are usually complete — this is belt-and-suspenders for buggy ones.
   - Mistral-Small-3.2-NVFP4 reports `1000 additional_special_tokens,
     0 patched` (tokenizer.json already tagged them all). Still useful
     as a regression gate for future model dirs.

2. Mistral V3-Tekken family detection
   - New ChatTemplateFamily::MISTRAL_V3 enum value, with
     [TOOL_CALLS] / [AVAILABLE_TOOLS] substring detection BEFORE the
     LLAMA2 [INST] fallback. Older Mistral V1/V2 templates without
     tool markers still detect as LLAMA2.
   - Init / apply share the LLAMA2 path (same [INST] tokens). The
     Jinja2 path handles tool-call rendering when present.
   - Fixes: Mistral-Small-3.2 was previously misclassified as `llama2`
     in `chat_template.cpp:246` despite Tekken-format jinja being loaded.

Tests
   - SpecialTokensMapTest x3 (object form, plain-string form, missing)
   - TokenizerControlTest x4 (lazy alloc, idempotent, invalid id, preserves
     pre-existing types)
   - ChatTemplateDetectTest x5 new (V3 ToolCalls, V3 AvailableTools,
     OldMistralStillLlama2, ParseFamilyMistralV3 + name lookup)
   - test-core 75/75, test-text 148/148, LlmCompressorE2E 3/3

Stacked on top of #74 (generation_config.json defaults).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kekzl kekzl merged commit ef6fa67 into main Apr 28, 2026
2 checks passed
@kekzl kekzl deleted the feat/json-config-plumbing branch April 28, 2026 12:38
kekzl added a commit that referenced this pull request Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant