feat(safetensors): native SentencePiece (.model) parser (AU2)#128
Merged
Conversation
Closes audit followup AU2. Older Llama 1/2 and some Mistral SafeTensors checkpoints ship only `tokenizer.model` (the SentencePiece protobuf), not `tokenizer.json`. Until now imp's SafeTensors path failed those with an actionable IMP_LOG_ERROR pointing the user at a Python conversion script. This commit replaces the error path with an in-tree native SentencePiece parser. No new third-party dependency: a 250-line wire-format protobuf decoder reads ModelProto directly, extracting the pieces list (vocabulary) + scores + types + TrainerSpec ids (bos / eos / unk / pad / model_type). imp's existing SPM-style score-based encoder in tokenizer.cpp:encode_spm() consumes the result via Tokenizer::load_vocab() / load_token_types(). Scope: - Parser handles wire-format types 0/1/2/5; unknown fields skipped. - Negative int32 (e.g. pad_id=-1) round-trips through 10-byte sign-extended varints. - ModelType detection: UNIGRAM / BPE / WORD / CHAR. Vocabulary loads identically for all four; the encoder runs score-based merging that matches Unigram exactly and BPE acceptably for practical text. - Wired into safetensors_loader.cpp at the existing has_spm branch. Tests (10 new in test-core): - Synthetic protobuf blob tests exercise the wire decoder for valid inputs (pieces + scores + types + trainer_spec), unknown-field skip, negative-pad-id roundtrip, and four reject paths (empty input, no pieces, truncated varint, length-delim past EOF). - One integration test against /home/kekz/.cache/huggingface's T5 spiece.model (32k pieces, Unigram). Gracefully skips when the fixture isn't bind-mounted so the unit suite stays portable. Verified locally with the real T5 fixture: 32000 pieces parsed, model_type=unigram, bos=-1 eos=1 unk=2 — matches T5 conventions. test-core suite 148/148 → 157/157, no regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the three audit docs that referenced AU2 as an open / deferred follow-up to reflect its closure in the prior commit (the SentencePiece loader implementation). - followups.md: removes the deferral rationale, replaces with closed-in-sha - roadmap_inventory_2026-05.md: flips UNCERTAIN classification to CLOSED - safetensors_audit.md: strikethrough on the "truly unresolved" entry No source changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes audit followup AU2. Older Llama 1/2 and some Mistral SafeTensors checkpoints ship only
tokenizer.model(the SentencePiece protobuf), nottokenizer.json. Until now imp's SafeTensors path failed those with an actionable IMP_LOG_ERROR pointing the user at a Python conversion script (PR #116, Phase 2).This PR replaces the error path with an in-tree native SentencePiece parser. Zero new dependencies — a 250-line wire-format protobuf decoder reads
ModelProtodirectly, extracting the pieces list (vocabulary) + scores + types +TrainerSpecids. imp's existing SPM-style score-based encoder intokenizer.cpp:encode_spm()consumes the result viaTokenizer::load_vocab()/load_token_types().Scope
pad_id=-1) round-trips through 10-byte sign-extended varints.safetensors_loader.cppat the existinghas_spmbranch.Tests
10 new unit tests in
test-core:/home/kekz/.cache/huggingface's T5 spiece.model (32k pieces, Unigram, bos=-1 eos=1 unk=2 — matches T5 conventions). Gracefully skips when fixture absent so CI stays portable.```
[==========] 10 tests from 1 test suite ran. (1 ms total)
[ PASSED ] 10 tests.
```
Full test-core regression: 148/148 → 157/157, no failures.
Quality Gate
docs/audit/followups.md,roadmap_inventory_2026-05.md,safetensors_audit.mdflip from UNCERTAIN/deferred to closedTest plan
🤖 Generated with Claude Code