Skip to content

Overhaul microgpt training pipeline: BPE tokenizer, optimizer, and data quality#1023

Merged
aaylward merged 3 commits intomainfrom
caddddy
Feb 19, 2026
Merged

Overhaul microgpt training pipeline: BPE tokenizer, optimizer, and data quality#1023
aaylward merged 3 commits intomainfrom
caddddy

Conversation

@aaylward
Copy link
Copy Markdown
Collaborator

Summary

  • Training infrastructure: Custom Adam optimizer with checkpoint/resume, batched training, vectorized updates, BF16 mixed-precision on Metal, LR warmup + linear decay, periodic checkpointing, --resume/--dtype/--checkpoint-every flags
  • BPE tokenization: Migrate from character-level to byte-level BPE (HuggingFace tokenizers crate) with configurable vocab size and chat special tokens
  • Data quality: Auto-trim trailing user turns with no assistant reply, --skip-long flag to exclude documents exceeding block_size, markdown-stripped OASST2 dataset variant
  • Inference: Suppress special tokens during generation, min_tokens guard, leading punctuation stripping, rolling context window with BOS restoration and dynamic generation budget
  • Serve: Input validation, chat prompt truncation with tokens_dropped reporting, BOS restoration matching CLI behavior
  • Documentation: Recommended parameters by model size, training loss targets, data cleanup and LR warmup guidance
  • Tests: Comprehensive coverage for tokenizer, data cleanup, token suppression, context window management, training safety, and flaky test fix

Test plan

  • bazel test //domains/ai/... — all 14 targets pass
  • bpe_handles_unicode verified stable over 10 consecutive runs
  • Manual chat testing with quick-trained models on OASST2 plain data

Made with Cursor

@aaylward aaylward enabled auto-merge (squash) February 19, 2026 22:06
Co-authored-by: Cursor <cursoragent@cursor.com>
@aaylward aaylward merged commit 0f0c6d0 into main Feb 19, 2026
9 checks passed
@aaylward aaylward deleted the caddddy branch February 19, 2026 22:29
aaylward and others added 2 commits February 21, 2026 23:21
…ta quality

Training infrastructure:
- Custom Adam optimizer with serializable state for checkpoint/resume
- Batched training with cross-entropy loss and masked padding
- Vectorized Adam update (~15 GPU kernels instead of ~15xN_params)
- BF16 mixed-precision on Metal with F32 rmsnorm/log_softmax for stability
- Linear LR warmup (default 200 steps) followed by linear decay to zero
- Periodic checkpointing (--checkpoint-every) with atomic file staging
- --resume flag restores weights, optimizer state, and training config
- --dtype flag (f32/bf16/f16) with bf16 default on Metal

Tokenization:
- Migrate from character-level to byte-level BPE (HuggingFace tokenizers)
- Configurable --vocab-size (default 4096), tokenizer saved as tokenizer.json
- Special tokens for chat: <bos>, <user>, <assistant>, <end_turn>

Data quality:
- Auto-trim trailing user turns with no assistant reply (33% of OASST2)
- --skip-long flag to exclude documents exceeding block_size
- Add markdown-stripped variant of OASST2 English dataset

Inference:
- Suppress special tokens during generation (logits masked to -inf)
- min_tokens guard prevents immediate end-of-turn on first position
- Strip leading ASCII punctuation from chat output
- Rolling context window: re-prepend BOS after truncation, skip empty
  responses in history, dynamic generation budget with 64-token floor

Serve:
- Input validation for temperature, num_samples, max_tokens, messages
- Chat prompt truncation with tokens_dropped reporting
- BOS restoration and dynamic max_gen matching CLI behavior

CLI:
- --skip-long, --warmup-steps, --batch-size, --vocab-size flags
- export subcommand with --half for f16 inference models
- Data cleanup stats logged at load time

Documentation:
- Recommended parameters by model size, training loss targets
- Data cleanup, --skip-long, and LR warmup guidance

Tests:
- Comprehensive tokenizer, dataset, and chat data cleanup tests
- Token suppression, min_tokens guard, BOS restoration tests
- Dynamic max_gen, empty output handling, long-context integration
- Training safety and optimizer serialization tests
- Fix flaky bpe_handles_unicode test (deterministic merge frequency)

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant