feat(loader): JSON config plumbing — special_tokens_map + Mistral V3 + tokenizer_config flags#77
Merged
Merged
Conversation
Two related additions to harden tokenizer + chat-template handling for
SafeTensors models that ship config beyond what tokenizer.json carries.
1. special_tokens_map.json defensive overlay
- HFConfigLoader::load_special_tokens_map parses the model author's
authoritative additional_special_tokens list (object form for
Mistral, plain-string form for Qwen).
- SafeTensors loader cross-checks each entry against the tokenizer's
CONTROL flag column. If a string is in the vocab but not tagged
CONTROL, mark_as_control() patches it. tokenizer.json conversions
are usually complete — this is belt-and-suspenders for buggy ones.
- Mistral-Small-3.2-NVFP4 reports `1000 additional_special_tokens,
0 patched` (tokenizer.json already tagged them all). Still useful
as a regression gate for future model dirs.
2. Mistral V3-Tekken family detection
- New ChatTemplateFamily::MISTRAL_V3 enum value, with
[TOOL_CALLS] / [AVAILABLE_TOOLS] substring detection BEFORE the
LLAMA2 [INST] fallback. Older Mistral V1/V2 templates without
tool markers still detect as LLAMA2.
- Init / apply share the LLAMA2 path (same [INST] tokens). The
Jinja2 path handles tool-call rendering when present.
- Fixes: Mistral-Small-3.2 was previously misclassified as `llama2`
in `chat_template.cpp:246` despite Tekken-format jinja being loaded.
Tests
- SpecialTokensMapTest x3 (object form, plain-string form, missing)
- TokenizerControlTest x4 (lazy alloc, idempotent, invalid id, preserves
pre-existing types)
- ChatTemplateDetectTest x5 new (V3 ToolCalls, V3 AvailableTools,
OldMistralStillLlama2, ParseFamilyMistralV3 + name lookup)
- test-core 75/75, test-text 148/148, LlmCompressorE2E 3/3
Stacked on top of #74 (generation_config.json defaults).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Combines the rebased content of the previously-orphaned PRs #75 + #76, which got stuck behind a stacking issue when their base branches were merged/deleted. Both PRs' diffs land cleanly on top of current
main(which already has #74).Both commits preserved as separate commits for reviewability:
special_tokens_map.jsonoverlay + Mistral V3-Tekken family (was feat(loader): special_tokens_map.json overlay + Mistral V3-Tekken family #75)tokenizer_config.jsonadd_bos/add_eos/add_prefix_space plumbing (was feat(loader): plumb tokenizer_config.json add_bos/add_eos/add_prefix_space #76)Why one PR
PRs #75 and #76 were independent topical changes, but they're small enough (and adjacent in scope) that one stacked review is cheaper than two separate threads. Each commit can be reviewed/cherry-picked individually.
1. special_tokens_map.json + Mistral V3
HFConfigLoader::SpecialTokensMapparser: object form (Mistral) + plain-string form (Qwen)Tokenizer::mark_as_control(id)defensive method (lazy-allocates type vector)additional_special_tokensagainst the tokenizer's CONTROL flag column; missing tags get patchedChatTemplateFamily::MISTRAL_V3enum value with[TOOL_CALLS]/[AVAILABLE_TOOLS]substring detection BEFORE the LLAMA2[INST]fallbackChat template: llama2despite Tekken-format jinja; now logsmistral_v32. tokenizer_config.json author flags
HFConfigLoader::TokenizerFlagsstruct +load_tokenizer_flags()parseradd_bos_token/add_prefix_spaceto the tokenizer (mirrorsgguf_loader.cpp:1335-1354's GGUF-side handling)add_bos_token: falsebut pre-fix imp auto-prepended<|endoftext|>(token 151643) to every promptTest plan
test-core→ 79/79 pass (was 72 on main; +3 SpecialTokensMapTest, +4 TokenizerFlagsTest)test-text→ 148/148 pass (was 140; +5 ChatTemplateDetectTest, +4 TokenizerControlTest)test-e2e --gtest_filter='LlmCompressorE2E.{Gemma4,Mistral}*'→ 3/3 pass (Gemma-4 NVFP4 coherent, Mistral-Small-3.2 NVFP4 "Paris" gate)add_bos=true add_eos=false add_prefix_space=unset+1000 additional_special_tokens, 0 patchedadd_bos=false add_eos=unset add_prefix_space=falseadd_bos=unset add_eos=unset add_prefix_space=unset(falls through to existing tokenizer.json metadata)Note on perf gate
IMP_VERIFY_SKIP_PERF=1documented skip — baseline currently stale onmainitself (Qwen3-8B Q8_0 measurestg=156vs258baseline; same on this branch and onmain). This PR doesn't touch any GGUF or compute code path.Closes JSON coverage
With #74 already merged + this PR, every author-shipped HF config relevant for text-only inference is consumed:
config.json,model.safetensors.index.json,chat_template.jinja,recipe.yaml,hf_quant_config.json,tokenizer.json,tokenizer_config.json(chat_template + added_tokens) — pre-existinggeneration_config.json— feat(loader): plumb generation_config.json sampling defaults #74special_tokens_map.json— this PRtokenizer_config.json(add_bos/add_eos/add_prefix_space) — this PR⏭️ Out of scope (multimodal pipeline, separate effort):
tokenizer.json::post_processor,tekken.json,processor_config.json,preprocessor_config.json.Closes #75, #76
🤖 Generated with Claude Code