Overhaul microgpt training pipeline: BPE tokenizer, optimizer, and data quality by aaylward · Pull Request #1023 · muchq/MoonBase

aaylward · 2026-02-19T22:06:13Z

Summary

Training infrastructure: Custom Adam optimizer with checkpoint/resume, batched training, vectorized updates, BF16 mixed-precision on Metal, LR warmup + linear decay, periodic checkpointing, --resume/--dtype/--checkpoint-every flags
BPE tokenization: Migrate from character-level to byte-level BPE (HuggingFace tokenizers crate) with configurable vocab size and chat special tokens
Data quality: Auto-trim trailing user turns with no assistant reply, --skip-long flag to exclude documents exceeding block_size, markdown-stripped OASST2 dataset variant
Inference: Suppress special tokens during generation, min_tokens guard, leading punctuation stripping, rolling context window with BOS restoration and dynamic generation budget
Serve: Input validation, chat prompt truncation with tokens_dropped reporting, BOS restoration matching CLI behavior
Documentation: Recommended parameters by model size, training loss targets, data cleanup and LR warmup guidance
Tests: Comprehensive coverage for tokenizer, data cleanup, token suppression, context window management, training safety, and flaky test fix

Test plan

bazel test //domains/ai/... — all 14 targets pass
bpe_handles_unicode verified stable over 10 consecutive runs
Manual chat testing with quick-trained models on OASST2 plain data

Made with Cursor

Co-authored-by: Cursor <cursoragent@cursor.com>

…ta quality Training infrastructure: - Custom Adam optimizer with serializable state for checkpoint/resume - Batched training with cross-entropy loss and masked padding - Vectorized Adam update (~15 GPU kernels instead of ~15xN_params) - BF16 mixed-precision on Metal with F32 rmsnorm/log_softmax for stability - Linear LR warmup (default 200 steps) followed by linear decay to zero - Periodic checkpointing (--checkpoint-every) with atomic file staging - --resume flag restores weights, optimizer state, and training config - --dtype flag (f32/bf16/f16) with bf16 default on Metal Tokenization: - Migrate from character-level to byte-level BPE (HuggingFace tokenizers) - Configurable --vocab-size (default 4096), tokenizer saved as tokenizer.json - Special tokens for chat: <bos>, <user>, <assistant>, <end_turn> Data quality: - Auto-trim trailing user turns with no assistant reply (33% of OASST2) - --skip-long flag to exclude documents exceeding block_size - Add markdown-stripped variant of OASST2 English dataset Inference: - Suppress special tokens during generation (logits masked to -inf) - min_tokens guard prevents immediate end-of-turn on first position - Strip leading ASCII punctuation from chat output - Rolling context window: re-prepend BOS after truncation, skip empty responses in history, dynamic generation budget with 64-token floor Serve: - Input validation for temperature, num_samples, max_tokens, messages - Chat prompt truncation with tokens_dropped reporting - BOS restoration and dynamic max_gen matching CLI behavior CLI: - --skip-long, --warmup-steps, --batch-size, --vocab-size flags - export subcommand with --half for f16 inference models - Data cleanup stats logged at load time Documentation: - Recommended parameters by model size, training loss targets - Data cleanup, --skip-long, and LR warmup guidance Tests: - Comprehensive tokenizer, dataset, and chat data cleanup tests - Token suppression, min_tokens guard, BOS restoration tests - Dynamic max_gen, empty output handling, long-context integration - Training safety and optimizer serialization tests - Fix flaky bpe_handles_unicode test (deterministic merge frequency) Co-authored-by: Cursor <cursoragent@cursor.com>

aaylward enabled auto-merge (squash) February 19, 2026 22:06

Bump microgpt CLI to 0.6.0 and remove dead code

db064ab

Co-authored-by: Cursor <cursoragent@cursor.com>

aaylward added web chat rust AI labels Feb 19, 2026

aaylward merged commit 0f0c6d0 into main Feb 19, 2026
9 checks passed

aaylward deleted the caddddy branch February 19, 2026 22:29

aaylward and others added 2 commits February 21, 2026 23:21

LOCK

c11ca64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul microgpt training pipeline: BPE tokenizer, optimizer, and data quality#1023

Overhaul microgpt training pipeline: BPE tokenizer, optimizer, and data quality#1023
aaylward merged 3 commits intomainfrom
caddddy

aaylward commented Feb 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aaylward commented Feb 19, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant