-
Notifications
You must be signed in to change notification settings - Fork 12
Closed
Labels
P0Critical priorityCritical prioritySSCShell Safety ClassifierShell Safety ClassifierenhancementNew feature or requestNew feature or request
Description
SSC-023: BPE Tokenizer — Load HuggingFace tokenizer.json
Priority: P0
Blocked by: —
Blocks: SSC-026 (training loop needs real tokenization)
Contract: provable-contracts/contracts/aprender/tokenizer-loading-v1.yaml
Problem
BpeTokenizer::from_huggingface() is declared but not implemented. BpeTokenizer::new(config) creates empty vocab/merges. Without this, we can only do byte-level tokenization (what the demo does) which destroys all pretrained knowledge from Qwen2.5-Coder.
What Exists
src/text/bpe/qwen2bpe_tokenizer.rshasBpeConfig::qwen2()preset (vocab_size=151,936, special tokens)BpeTokenizerstruct has all fields (vocab HashMap, merges Vec, byte_encoder)- Merge-rule priority system works
ShellVocabulary(250 tokens) works for v1 MLP but not for v2 transformer
What's Missing
Loading from HuggingFace tokenizer.json format:
- Parse JSON with
model.vocab(HashMap<String, u32>),model.merges(Vec),added_tokens(Vec) - Populate
BpeTokenizer.vocabandBpeTokenizer.mergesfrom parsed data - Handle special tokens (151,643..151,935):
<|endoftext|>,<|im_start|>,<|im_end|>, etc. - Build byte_encoder for UTF-8 byte-level BPE
Contract Invariants
| ID | Invariant | Description |
|---|---|---|
| F-TOK-001 | Roundtrip | decode(encode(text)) == text for all valid UTF-8 |
| F-TOK-002 | Special tokens | Special token IDs match config (e.g., `< |
| F-TOK-003 | Vocab size | tokenizer.vocab_size() == config.vocab_size (151,936 for Qwen2) |
| F-TOK-004 | Determinism | Same input always produces same token IDs |
| F-TOK-005 | Empty input | encode("") returns only special tokens (BOS/EOS if configured) |
Acceptance Criteria
-
BpeTokenizer::from_huggingface("path/to/tokenizer.json")loads real vocab + merges -
encode("echo $HOME")produces valid token IDs within vocab range -
decode(encode(text)) == textfor ASCII and UTF-8 inputs - Special tokens preserved at correct IDs
- vocab_size matches BpeConfig::qwen2() (151,936)
- All existing BPE tests pass
- New tests: roundtrip, special tokens, edge cases (empty, long, unicode)
- Contract falsification tests pass
References
- HuggingFace tokenizer.json format: https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B/blob/main/tokenizer.json
BpeConfig::qwen2()insrc/text/bpe/qwen2bpe_tokenizer.rs- Spec:
bashrs/docs/specifications/shell-safety-inference.mdSection 14
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P0Critical priorityCritical prioritySSCShell Safety ClassifierShell Safety ClassifierenhancementNew feature or requestNew feature or request