feat(safetensors): native SentencePiece (.model) parser (AU2) by kekzl · Pull Request #128 · kekzl/imp

kekzl · 2026-05-08T07:33:19Z

Summary

Closes audit followup AU2. Older Llama 1/2 and some Mistral SafeTensors checkpoints ship only tokenizer.model (the SentencePiece protobuf), not tokenizer.json. Until now imp's SafeTensors path failed those with an actionable IMP_LOG_ERROR pointing the user at a Python conversion script (PR #116, Phase 2).

This PR replaces the error path with an in-tree native SentencePiece parser. Zero new dependencies — a 250-line wire-format protobuf decoder reads ModelProto directly, extracting the pieces list (vocabulary) + scores + types + TrainerSpec ids. imp's existing SPM-style score-based encoder in tokenizer.cpp:encode_spm() consumes the result via Tokenizer::load_vocab() / load_token_types().

Scope

Wire-format types 0/1/2/5 supported; unknown fields skipped without error.
Negative int32 (e.g. pad_id=-1) round-trips through 10-byte sign-extended varints.
ModelType detection: UNIGRAM / BPE / WORD / CHAR. Vocabulary loads identically; the encoder runs score-based merging that matches Unigram exactly and BPE acceptably for practical text.
Wired into safetensors_loader.cpp at the existing has_spm branch.

Tests

10 new unit tests in test-core:

5 happy paths: pieces+scores, TrainerSpec, ModelType=BPE, unknown-field skip, negative-pad-id round-trip
4 reject paths: empty input, no pieces, truncated varint, length-delim past EOF
1 integration test against /home/kekz/.cache/huggingface's T5 spiece.model (32k pieces, Unigram, bos=-1 eos=1 unk=2 — matches T5 conventions). Gracefully skips when fixture absent so CI stays portable.

```
[==========] 10 tests from 1 test suite ran. (1 ms total)
[ PASSED ] 10 tests.
```

Full test-core regression: 148/148 → 157/157, no failures.

Quality Gate

✅ Numerical correctness: real-world T5 32K-piece vocabulary parses cleanly, ids match T5 conventions
✅ No regression: full test-core green
✅ No new third-party deps: pure C++17 stdlib + POSIX mmap, no protobuf library, no sentencepiece dep
✅ Error paths covered: 4 negative tests for malformed wire format
✅ Documentation updated: docs/audit/followups.md, roadmap_inventory_2026-05.md, safetensors_audit.md flip from UNCERTAIN/deferred to closed
✅ Root-cause oriented: implements the missing parser instead of papering over the failure

Test plan

`make build` — Docker build green
`docker run ... test-core --gtest_filter='SentencePieceLoader.*'` — 10/10 pass with bind-mounted HF cache
Full test-core regression — 157/157
Pre-push hook (verify-fast) — PASS, decode within baseline
CI matrix on push

🤖 Generated with Claude Code

Closes audit followup AU2. Older Llama 1/2 and some Mistral SafeTensors checkpoints ship only `tokenizer.model` (the SentencePiece protobuf), not `tokenizer.json`. Until now imp's SafeTensors path failed those with an actionable IMP_LOG_ERROR pointing the user at a Python conversion script. This commit replaces the error path with an in-tree native SentencePiece parser. No new third-party dependency: a 250-line wire-format protobuf decoder reads ModelProto directly, extracting the pieces list (vocabulary) + scores + types + TrainerSpec ids (bos / eos / unk / pad / model_type). imp's existing SPM-style score-based encoder in tokenizer.cpp:encode_spm() consumes the result via Tokenizer::load_vocab() / load_token_types(). Scope: - Parser handles wire-format types 0/1/2/5; unknown fields skipped. - Negative int32 (e.g. pad_id=-1) round-trips through 10-byte sign-extended varints. - ModelType detection: UNIGRAM / BPE / WORD / CHAR. Vocabulary loads identically for all four; the encoder runs score-based merging that matches Unigram exactly and BPE acceptably for practical text. - Wired into safetensors_loader.cpp at the existing has_spm branch. Tests (10 new in test-core): - Synthetic protobuf blob tests exercise the wire decoder for valid inputs (pieces + scores + types + trainer_spec), unknown-field skip, negative-pad-id roundtrip, and four reject paths (empty input, no pieces, truncated varint, length-delim past EOF). - One integration test against /home/kekz/.cache/huggingface's T5 spiece.model (32k pieces, Unigram). Gracefully skips when the fixture isn't bind-mounted so the unit suite stays portable. Verified locally with the real T5 fixture: 32000 pieces parsed, model_type=unigram, bos=-1 eos=1 unk=2 — matches T5 conventions. test-core suite 148/148 → 157/157, no regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Updates the three audit docs that referenced AU2 as an open / deferred follow-up to reflect its closure in the prior commit (the SentencePiece loader implementation). - followups.md: removes the deferral rationale, replaces with closed-in-sha - roadmap_inventory_2026-05.md: flips UNCERTAIN classification to CLOSED - safetensors_audit.md: strikethrough on the "truly unresolved" entry No source changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kekzl and others added 2 commits May 8, 2026 09:31

github-actions Bot enabled auto-merge (squash) May 8, 2026 07:33

github-actions Bot merged commit 394593a into main May 8, 2026
3 checks passed

kekzl deleted the feat/sentencepiece-loader branch May 8, 2026 08:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(safetensors): native SentencePiece (.model) parser (AU2)#128

feat(safetensors): native SentencePiece (.model) parser (AU2)#128
github-actions[bot] merged 2 commits into
mainfrom
feat/sentencepiece-loader

kekzl commented May 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kekzl commented May 8, 2026

Summary

Scope

Tests

Quality Gate

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant