Conversation
Co-authored-by: Cursor <cursoragent@cursor.com>
…ta quality Training infrastructure: - Custom Adam optimizer with serializable state for checkpoint/resume - Batched training with cross-entropy loss and masked padding - Vectorized Adam update (~15 GPU kernels instead of ~15xN_params) - BF16 mixed-precision on Metal with F32 rmsnorm/log_softmax for stability - Linear LR warmup (default 200 steps) followed by linear decay to zero - Periodic checkpointing (--checkpoint-every) with atomic file staging - --resume flag restores weights, optimizer state, and training config - --dtype flag (f32/bf16/f16) with bf16 default on Metal Tokenization: - Migrate from character-level to byte-level BPE (HuggingFace tokenizers) - Configurable --vocab-size (default 4096), tokenizer saved as tokenizer.json - Special tokens for chat: <bos>, <user>, <assistant>, <end_turn> Data quality: - Auto-trim trailing user turns with no assistant reply (33% of OASST2) - --skip-long flag to exclude documents exceeding block_size - Add markdown-stripped variant of OASST2 English dataset Inference: - Suppress special tokens during generation (logits masked to -inf) - min_tokens guard prevents immediate end-of-turn on first position - Strip leading ASCII punctuation from chat output - Rolling context window: re-prepend BOS after truncation, skip empty responses in history, dynamic generation budget with 64-token floor Serve: - Input validation for temperature, num_samples, max_tokens, messages - Chat prompt truncation with tokens_dropped reporting - BOS restoration and dynamic max_gen matching CLI behavior CLI: - --skip-long, --warmup-steps, --batch-size, --vocab-size flags - export subcommand with --half for f16 inference models - Data cleanup stats logged at load time Documentation: - Recommended parameters by model size, training loss targets - Data cleanup, --skip-long, and LR warmup guidance Tests: - Comprehensive tokenizer, dataset, and chat data cleanup tests - Token suppression, min_tokens guard, BOS restoration tests - Dynamic max_gen, empty output handling, long-context integration - Training safety and optimizer serialization tests - Fix flaky bpe_handles_unicode test (deterministic merge frequency) Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
bazel test //domains/ai/...— all 14 targets passbpe_handles_unicodeverified stable over 10 consecutive runsMade with Cursor