You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Memory-mapped dataset support: TextDataset now supports memory-mapped file storage for large datasets to prevent RAM exhaustion
Automatically enabled for ULTRA config (batch_size=1)
Tokens stored on disk, loaded on-demand
Reduces RAM usage from ~170MB to <10MB for 42M token dataset
9 comprehensive tests for mmap functionality
Strict config loading in train.py: Training script now uses atlas.config.load_config() for type-safe, validated configuration loading instead of raw YAML dict access
Advanced training time estimator: New estimate_training_time.py script that performs comprehensive benchmarking with:
Thermal throttling detection
Checkpoint overhead measurement
Validation time estimation
First epoch overhead calculation
Timeline predictions with completion dates
No-resume flag: Added --no-resume flag to prevent duplicate checkpoint prompts in automated scripts
Built-in checkpoint saving: Trainer now handles checkpoint saving internally during training
Accepts checkpoint_manager and auto_save_interval in constructor
Saves checkpoints automatically at specified step intervals
Tracks current_epoch for checkpoint metadata
More reliable than callback-based approach (no Python import cache issues)
Mid-epoch checkpoint callback: Trainer.train_epoch() still supports optional step_callback parameter for extensibility
Callback function receives (trainer, loss) after each global step
Used for custom logic beyond checkpointing
5 comprehensive tests for callback functionality
Cache clearing utilities: New clear_all_cache.ps1 script for manual cache clearing
17 new comprehensive tests (total: 324 passing tests)
9 tests for memory-mapped dataset
3 tests for config loading validation
5 tests for step_callback functionality (mid-epoch checkpointing)
Changed
TrainingConfig now supports all YAML field names with proper aliases (max_grad_norm, scheduler_type, gradient_checkpointing, keep_checkpoints)
DataConfig enhanced with max_seq_len and num_workers support from YAML configs
All config field aliases properly synced in __post_init__ methods
Training script now uses attribute access (config.training.learning_rate) instead of dict access (config['training']['learning_rate'])
TextDataset.__getitem__ now supports negative indexing (e.g., dataset[-1])
Dataset loading includes automatic garbage collection and CUDA cache clearing to free memory
Checkpoint saving architecture: Moved from callback-based to built-in trainer mechanism
Trainer.__init__() now accepts checkpoint_manager and auto_save_interval parameters
Checkpoint saving happens directly in train_epoch() method after global step updates
Eliminates Python import cache issues that prevented callbacks from executing
More reliable and maintainable implementation
Pipeline scripts enhanced: Both .ps1 and .sh scripts now clear Python cache and use -B flag before training
Default checkpoint interval: Reduced from 1000 to 100 steps for more frequent saves (~4-5 minutes with ULTRA config)
Fixed
System freeze issue: 8-bit optimizer (adamw8bit) reduces optimizer memory by 75%, preventing system freeze during optimizer.step()
Missing checkpoints during training: Moved checkpoint saving logic from callback to built-in trainer functionality
Previously, with batch_size=1 and 166K sequences, no checkpoints would save until epoch completed (~166K steps)