Skip to content

feat(safetensors): native SentencePiece (.model) parser (AU2)#128

Merged
github-actions[bot] merged 2 commits into
mainfrom
feat/sentencepiece-loader
May 8, 2026
Merged

feat(safetensors): native SentencePiece (.model) parser (AU2)#128
github-actions[bot] merged 2 commits into
mainfrom
feat/sentencepiece-loader

Conversation

@kekzl
Copy link
Copy Markdown
Owner

@kekzl kekzl commented May 8, 2026

Summary

Closes audit followup AU2. Older Llama 1/2 and some Mistral SafeTensors checkpoints ship only tokenizer.model (the SentencePiece protobuf), not tokenizer.json. Until now imp's SafeTensors path failed those with an actionable IMP_LOG_ERROR pointing the user at a Python conversion script (PR #116, Phase 2).

This PR replaces the error path with an in-tree native SentencePiece parser. Zero new dependencies — a 250-line wire-format protobuf decoder reads ModelProto directly, extracting the pieces list (vocabulary) + scores + types + TrainerSpec ids. imp's existing SPM-style score-based encoder in tokenizer.cpp:encode_spm() consumes the result via Tokenizer::load_vocab() / load_token_types().

Scope

  • Wire-format types 0/1/2/5 supported; unknown fields skipped without error.
  • Negative int32 (e.g. pad_id=-1) round-trips through 10-byte sign-extended varints.
  • ModelType detection: UNIGRAM / BPE / WORD / CHAR. Vocabulary loads identically; the encoder runs score-based merging that matches Unigram exactly and BPE acceptably for practical text.
  • Wired into safetensors_loader.cpp at the existing has_spm branch.

Tests

10 new unit tests in test-core:

  • 5 happy paths: pieces+scores, TrainerSpec, ModelType=BPE, unknown-field skip, negative-pad-id round-trip
  • 4 reject paths: empty input, no pieces, truncated varint, length-delim past EOF
  • 1 integration test against /home/kekz/.cache/huggingface's T5 spiece.model (32k pieces, Unigram, bos=-1 eos=1 unk=2 — matches T5 conventions). Gracefully skips when fixture absent so CI stays portable.

```
[==========] 10 tests from 1 test suite ran. (1 ms total)
[ PASSED ] 10 tests.
```

Full test-core regression: 148/148 → 157/157, no failures.

Quality Gate

  • ✅ Numerical correctness: real-world T5 32K-piece vocabulary parses cleanly, ids match T5 conventions
  • ✅ No regression: full test-core green
  • ✅ No new third-party deps: pure C++17 stdlib + POSIX mmap, no protobuf library, no sentencepiece dep
  • ✅ Error paths covered: 4 negative tests for malformed wire format
  • ✅ Documentation updated: docs/audit/followups.md, roadmap_inventory_2026-05.md, safetensors_audit.md flip from UNCERTAIN/deferred to closed
  • ✅ Root-cause oriented: implements the missing parser instead of papering over the failure

Test plan

  • `make build` — Docker build green
  • `docker run ... test-core --gtest_filter='SentencePieceLoader.*'` — 10/10 pass with bind-mounted HF cache
  • Full test-core regression — 157/157
  • Pre-push hook (verify-fast) — PASS, decode within baseline
  • CI matrix on push

🤖 Generated with Claude Code

kekzl and others added 2 commits May 8, 2026 09:31
Closes audit followup AU2. Older Llama 1/2 and some Mistral SafeTensors
checkpoints ship only `tokenizer.model` (the SentencePiece protobuf), not
`tokenizer.json`. Until now imp's SafeTensors path failed those with an
actionable IMP_LOG_ERROR pointing the user at a Python conversion script.

This commit replaces the error path with an in-tree native SentencePiece
parser. No new third-party dependency: a 250-line wire-format protobuf
decoder reads ModelProto directly, extracting the pieces list (vocabulary)
+ scores + types + TrainerSpec ids (bos / eos / unk / pad / model_type).
imp's existing SPM-style score-based encoder in tokenizer.cpp:encode_spm()
consumes the result via Tokenizer::load_vocab() / load_token_types().

Scope:
- Parser handles wire-format types 0/1/2/5; unknown fields skipped.
- Negative int32 (e.g. pad_id=-1) round-trips through 10-byte sign-extended
  varints.
- ModelType detection: UNIGRAM / BPE / WORD / CHAR. Vocabulary loads
  identically for all four; the encoder runs score-based merging that
  matches Unigram exactly and BPE acceptably for practical text.
- Wired into safetensors_loader.cpp at the existing has_spm branch.

Tests (10 new in test-core):
- Synthetic protobuf blob tests exercise the wire decoder for valid
  inputs (pieces + scores + types + trainer_spec), unknown-field skip,
  negative-pad-id roundtrip, and four reject paths (empty input, no
  pieces, truncated varint, length-delim past EOF).
- One integration test against /home/kekz/.cache/huggingface's T5
  spiece.model (32k pieces, Unigram). Gracefully skips when the fixture
  isn't bind-mounted so the unit suite stays portable.

Verified locally with the real T5 fixture: 32000 pieces parsed,
model_type=unigram, bos=-1 eos=1 unk=2 — matches T5 conventions.
test-core suite 148/148 → 157/157, no regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the three audit docs that referenced AU2 as an open / deferred
follow-up to reflect its closure in the prior commit (the SentencePiece
loader implementation).

- followups.md: removes the deferral rationale, replaces with closed-in-sha
- roadmap_inventory_2026-05.md: flips UNCERTAIN classification to CLOSED
- safetensors_audit.md: strikethrough on the "truly unresolved" entry

No source changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot enabled auto-merge (squash) May 8, 2026 07:33
@github-actions github-actions Bot merged commit 394593a into main May 8, 2026
3 checks passed
@kekzl kekzl deleted the feat/sentencepiece-loader branch May 8, 2026 08:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant