v1.1.0

juliuspleunes4 released this 07 Dec 02:13

· 22 commits to main since this release

afc5e86

[v1.1.0] - 2025-12-07 - Memory-Efficient Optimizer & Reliable Checkpointing

Added

8-bit optimizer support: Added adamw8bit optimizer type using bitsandbytes for 75% memory reduction
- Stores momentum states in 8-bit precision instead of 32-bit
- Reduces optimizer memory from ~1.9GB to ~0.5GB for 655M parameter models
- Fixes system freeze issue that occurred at gradient accumulation boundary
- Requires bitsandbytes package (added to dependencies)
Multiple optimizer types: create_optimizer() now supports:
- adamw: Standard AdamW (2 momentum states, ~1.9GB for 655M params)
- adamw8bit: 8-bit AdamW (~0.5GB for 655M params, 75% memory reduction)
- sgd: SGD with Nesterov momentum (~0.95GB for 655M params, 50% memory reduction)
Optimizer configuration fields: Added to TrainingConfig:
- optimizer_type: Select optimizer algorithm (default: "adamw")
- momentum: Momentum factor for SGD (default: 0.9)
Memory-mapped dataset support: TextDataset now supports memory-mapped file storage for large datasets to prevent RAM exhaustion
- Automatically enabled for ULTRA config (batch_size=1)
- Tokens stored on disk, loaded on-demand
- Reduces RAM usage from ~170MB to <10MB for 42M token dataset
- 9 comprehensive tests for mmap functionality
Strict config loading in train.py: Training script now uses atlas.config.load_config() for type-safe, validated configuration loading instead of raw YAML dict access
Advanced training time estimator: New estimate_training_time.py script that performs comprehensive benchmarking with:
- Thermal throttling detection
- Checkpoint overhead measurement
- Validation time estimation
- First epoch overhead calculation
- Timeline predictions with completion dates
No-resume flag: Added --no-resume flag to prevent duplicate checkpoint prompts in automated scripts
Built-in checkpoint saving: Trainer now handles checkpoint saving internally during training
- Accepts checkpoint_manager and auto_save_interval in constructor
- Saves checkpoints automatically at specified step intervals
- Tracks current_epoch for checkpoint metadata
- More reliable than callback-based approach (no Python import cache issues)
Mid-epoch checkpoint callback: Trainer.train_epoch() still supports optional step_callback parameter for extensibility
- Callback function receives (trainer, loss) after each global step
- Used for custom logic beyond checkpointing
- 5 comprehensive tests for callback functionality
Cache clearing utilities: New clear_all_cache.ps1 script for manual cache clearing
17 new comprehensive tests (total: 324 passing tests)
- 9 tests for memory-mapped dataset
- 3 tests for config loading validation
- 5 tests for step_callback functionality (mid-epoch checkpointing)

Changed

TrainingConfig now supports all YAML field names with proper aliases (max_grad_norm, scheduler_type, gradient_checkpointing, keep_checkpoints)
DataConfig enhanced with max_seq_len and num_workers support from YAML configs
All config field aliases properly synced in __post_init__ methods
Training script now uses attribute access (config.training.learning_rate) instead of dict access (config['training']['learning_rate'])
TextDataset.__getitem__ now supports negative indexing (e.g., dataset[-1])
Dataset loading includes automatic garbage collection and CUDA cache clearing to free memory
Checkpoint saving architecture: Moved from callback-based to built-in trainer mechanism
- Trainer.__init__() now accepts checkpoint_manager and auto_save_interval parameters
- Checkpoint saving happens directly in train_epoch() method after global step updates
- Eliminates Python import cache issues that prevented callbacks from executing
- More reliable and maintainable implementation
Pipeline scripts enhanced: Both .ps1 and .sh scripts now clear Python cache and use -B flag before training
Default checkpoint interval: Reduced from 1000 to 100 steps for more frequent saves (~4-5 minutes with ULTRA config)

Fixed

System freeze issue: 8-bit optimizer (adamw8bit) reduces optimizer memory by 75%, preventing system freeze during optimizer.step()
Missing checkpoints during training: Moved checkpoint saving logic from callback to built-in trainer functionality
- Previously, with batch_size=1 and 166K sequences, no checkpoints would save until epoch completed (~166K steps)
- Root cause: Python bytecode caching prevented updated callback code from loading
- Solution: Built checkpoint saving directly into Trainer.train_epoch() method
- Checkpoints now reliably save every N steps (default 100, configurable)
- Added Python cache clearing to pipeline scripts (-B flag prevents bytecode caching)
- Added -W ignore::SyntaxWarning to suppress PyTorch internal warnings
System freeze issue: Memory-mapped storage prevents RAM exhaustion when training with large datasets
Python module caching: Pipeline scripts now clear __pycache__ and use -B flag to ensure latest code is always loaded
Config validation now happens at load time, catching errors before training starts
Duplicate num_workers field in DataConfig removed
Sequence length synchronization between max_seq_len and sequence_length
Duplicate checkpoint prompts in pipeline scripts eliminated with --no-resume flag
Tokenizer initialization in train.py (was incorrectly trying to access non-existent config.tokenizer)

Assets 2