Skip to content

SSC-023: BPE tokenizer: implement from_huggingface() for Qwen2 151K vocab #334

@noahgift

Description

@noahgift

SSC-023: BPE Tokenizer — Load HuggingFace tokenizer.json

Priority: P0
Blocked by: —
Blocks: SSC-026 (training loop needs real tokenization)
Contract: provable-contracts/contracts/aprender/tokenizer-loading-v1.yaml

Problem

BpeTokenizer::from_huggingface() is declared but not implemented. BpeTokenizer::new(config) creates empty vocab/merges. Without this, we can only do byte-level tokenization (what the demo does) which destroys all pretrained knowledge from Qwen2.5-Coder.

What Exists

  • src/text/bpe/qwen2bpe_tokenizer.rs has BpeConfig::qwen2() preset (vocab_size=151,936, special tokens)
  • BpeTokenizer struct has all fields (vocab HashMap, merges Vec, byte_encoder)
  • Merge-rule priority system works
  • ShellVocabulary (250 tokens) works for v1 MLP but not for v2 transformer

What's Missing

Loading from HuggingFace tokenizer.json format:

  1. Parse JSON with model.vocab (HashMap<String, u32>), model.merges (Vec), added_tokens (Vec)
  2. Populate BpeTokenizer.vocab and BpeTokenizer.merges from parsed data
  3. Handle special tokens (151,643..151,935): <|endoftext|>, <|im_start|>, <|im_end|>, etc.
  4. Build byte_encoder for UTF-8 byte-level BPE

Contract Invariants

ID Invariant Description
F-TOK-001 Roundtrip decode(encode(text)) == text for all valid UTF-8
F-TOK-002 Special tokens Special token IDs match config (e.g., `<
F-TOK-003 Vocab size tokenizer.vocab_size() == config.vocab_size (151,936 for Qwen2)
F-TOK-004 Determinism Same input always produces same token IDs
F-TOK-005 Empty input encode("") returns only special tokens (BOS/EOS if configured)

Acceptance Criteria

  • BpeTokenizer::from_huggingface("path/to/tokenizer.json") loads real vocab + merges
  • encode("echo $HOME") produces valid token IDs within vocab range
  • decode(encode(text)) == text for ASCII and UTF-8 inputs
  • Special tokens preserved at correct IDs
  • vocab_size matches BpeConfig::qwen2() (151,936)
  • All existing BPE tests pass
  • New tests: roundtrip, special tokens, edge cases (empty, long, unicode)
  • Contract falsification tests pass

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Critical prioritySSCShell Safety ClassifierenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions