SSC-023: BPE tokenizer: implement from_huggingface() for Qwen2 151K vocab

## SSC-023: BPE Tokenizer — Load HuggingFace tokenizer.json

**Priority**: P0
**Blocked by**: —
**Blocks**: SSC-026 (training loop needs real tokenization)
**Contract**: `provable-contracts/contracts/aprender/tokenizer-loading-v1.yaml`

### Problem

`BpeTokenizer::from_huggingface()` is declared but **not implemented**. `BpeTokenizer::new(config)` creates empty vocab/merges. Without this, we can only do byte-level tokenization (what the demo does) which destroys all pretrained knowledge from Qwen2.5-Coder.

### What Exists

- `src/text/bpe/qwen2bpe_tokenizer.rs` has `BpeConfig::qwen2()` preset (vocab_size=151,936, special tokens)
- `BpeTokenizer` struct has all fields (vocab HashMap, merges Vec, byte_encoder)
- Merge-rule priority system works
- `ShellVocabulary` (250 tokens) works for v1 MLP but not for v2 transformer

### What's Missing

Loading from HuggingFace `tokenizer.json` format:
1. Parse JSON with `model.vocab` (HashMap<String, u32>), `model.merges` (Vec<String>), `added_tokens` (Vec<AddedToken>)
2. Populate `BpeTokenizer.vocab` and `BpeTokenizer.merges` from parsed data
3. Handle special tokens (151,643..151,935): `<|endoftext|>`, `<|im_start|>`, `<|im_end|>`, etc.
4. Build byte_encoder for UTF-8 byte-level BPE

### Contract Invariants

| ID | Invariant | Description |
|----|-----------|-------------|
| F-TOK-001 | Roundtrip | `decode(encode(text)) == text` for all valid UTF-8 |
| F-TOK-002 | Special tokens | Special token IDs match config (e.g., `<|endoftext|>` = 151,643) |
| F-TOK-003 | Vocab size | `tokenizer.vocab_size() == config.vocab_size` (151,936 for Qwen2) |
| F-TOK-004 | Determinism | Same input always produces same token IDs |
| F-TOK-005 | Empty input | `encode("")` returns only special tokens (BOS/EOS if configured) |

### Acceptance Criteria

- [ ] `BpeTokenizer::from_huggingface("path/to/tokenizer.json")` loads real vocab + merges
- [ ] `encode("echo $HOME")` produces valid token IDs within vocab range
- [ ] `decode(encode(text)) == text` for ASCII and UTF-8 inputs
- [ ] Special tokens preserved at correct IDs
- [ ] vocab_size matches BpeConfig::qwen2() (151,936)
- [ ] All existing BPE tests pass
- [ ] New tests: roundtrip, special tokens, edge cases (empty, long, unicode)
- [ ] Contract falsification tests pass

### References

- HuggingFace tokenizer.json format: https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B/blob/main/tokenizer.json
- `BpeConfig::qwen2()` in `src/text/bpe/qwen2bpe_tokenizer.rs`
- Spec: `bashrs/docs/specifications/shell-safety-inference.md` Section 14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSC-023: BPE tokenizer: implement from_huggingface() for Qwen2 151K vocab #334

SSC-023: BPE Tokenizer — Load HuggingFace tokenizer.json

Problem

What Exists

What's Missing

Contract Invariants

Acceptance Criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ID	Invariant	Description
F-TOK-001	Roundtrip	`decode(encode(text)) == text` for all valid UTF-8
F-TOK-002	Special tokens	Special token IDs match config (e.g., `<
F-TOK-003	Vocab size	`tokenizer.vocab_size() == config.vocab_size` (151,936 for Qwen2)
F-TOK-004	Determinism	Same input always produces same token IDs
F-TOK-005	Empty input	`encode("")` returns only special tokens (BOS/EOS if configured)

SSC-023: BPE tokenizer: implement from_huggingface() for Qwen2 151K vocab #334

Description

SSC-023: BPE Tokenizer — Load HuggingFace tokenizer.json

Problem

What Exists

What's Missing

Contract Invariants

Acceptance Criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions