Skip to content

v1.1.0

Choose a tag to compare

@juliuspleunes4 juliuspleunes4 released this 07 Dec 02:13
· 22 commits to main since this release

[v1.1.0] - 2025-12-07 - Memory-Efficient Optimizer & Reliable Checkpointing

Added

  • 8-bit optimizer support: Added adamw8bit optimizer type using bitsandbytes for 75% memory reduction
    • Stores momentum states in 8-bit precision instead of 32-bit
    • Reduces optimizer memory from ~1.9GB to ~0.5GB for 655M parameter models
    • Fixes system freeze issue that occurred at gradient accumulation boundary
    • Requires bitsandbytes package (added to dependencies)
  • Multiple optimizer types: create_optimizer() now supports:
    • adamw: Standard AdamW (2 momentum states, ~1.9GB for 655M params)
    • adamw8bit: 8-bit AdamW (~0.5GB for 655M params, 75% memory reduction)
    • sgd: SGD with Nesterov momentum (~0.95GB for 655M params, 50% memory reduction)
  • Optimizer configuration fields: Added to TrainingConfig:
    • optimizer_type: Select optimizer algorithm (default: "adamw")
    • momentum: Momentum factor for SGD (default: 0.9)
  • Memory-mapped dataset support: TextDataset now supports memory-mapped file storage for large datasets to prevent RAM exhaustion
    • Automatically enabled for ULTRA config (batch_size=1)
    • Tokens stored on disk, loaded on-demand
    • Reduces RAM usage from ~170MB to <10MB for 42M token dataset
    • 9 comprehensive tests for mmap functionality
  • Strict config loading in train.py: Training script now uses atlas.config.load_config() for type-safe, validated configuration loading instead of raw YAML dict access
  • Advanced training time estimator: New estimate_training_time.py script that performs comprehensive benchmarking with:
    • Thermal throttling detection
    • Checkpoint overhead measurement
    • Validation time estimation
    • First epoch overhead calculation
    • Timeline predictions with completion dates
  • No-resume flag: Added --no-resume flag to prevent duplicate checkpoint prompts in automated scripts
  • Built-in checkpoint saving: Trainer now handles checkpoint saving internally during training
    • Accepts checkpoint_manager and auto_save_interval in constructor
    • Saves checkpoints automatically at specified step intervals
    • Tracks current_epoch for checkpoint metadata
    • More reliable than callback-based approach (no Python import cache issues)
  • Mid-epoch checkpoint callback: Trainer.train_epoch() still supports optional step_callback parameter for extensibility
    • Callback function receives (trainer, loss) after each global step
    • Used for custom logic beyond checkpointing
    • 5 comprehensive tests for callback functionality
  • Cache clearing utilities: New clear_all_cache.ps1 script for manual cache clearing
  • 17 new comprehensive tests (total: 324 passing tests)
    • 9 tests for memory-mapped dataset
    • 3 tests for config loading validation
    • 5 tests for step_callback functionality (mid-epoch checkpointing)

Changed

  • TrainingConfig now supports all YAML field names with proper aliases (max_grad_norm, scheduler_type, gradient_checkpointing, keep_checkpoints)
  • DataConfig enhanced with max_seq_len and num_workers support from YAML configs
  • All config field aliases properly synced in __post_init__ methods
  • Training script now uses attribute access (config.training.learning_rate) instead of dict access (config['training']['learning_rate'])
  • TextDataset.__getitem__ now supports negative indexing (e.g., dataset[-1])
  • Dataset loading includes automatic garbage collection and CUDA cache clearing to free memory
  • Checkpoint saving architecture: Moved from callback-based to built-in trainer mechanism
    • Trainer.__init__() now accepts checkpoint_manager and auto_save_interval parameters
    • Checkpoint saving happens directly in train_epoch() method after global step updates
    • Eliminates Python import cache issues that prevented callbacks from executing
    • More reliable and maintainable implementation
  • Pipeline scripts enhanced: Both .ps1 and .sh scripts now clear Python cache and use -B flag before training
  • Default checkpoint interval: Reduced from 1000 to 100 steps for more frequent saves (~4-5 minutes with ULTRA config)

Fixed

  • System freeze issue: 8-bit optimizer (adamw8bit) reduces optimizer memory by 75%, preventing system freeze during optimizer.step()
  • Missing checkpoints during training: Moved checkpoint saving logic from callback to built-in trainer functionality
    • Previously, with batch_size=1 and 166K sequences, no checkpoints would save until epoch completed (~166K steps)
    • Root cause: Python bytecode caching prevented updated callback code from loading
    • Solution: Built checkpoint saving directly into Trainer.train_epoch() method
    • Checkpoints now reliably save every N steps (default 100, configurable)
    • Added Python cache clearing to pipeline scripts (-B flag prevents bytecode caching)
    • Added -W ignore::SyntaxWarning to suppress PyTorch internal warnings
  • System freeze issue: Memory-mapped storage prevents RAM exhaustion when training with large datasets
  • Python module caching: Pipeline scripts now clear __pycache__ and use -B flag to ensure latest code is always loaded
  • Config validation now happens at load time, catching errors before training starts
  • Duplicate num_workers field in DataConfig removed
  • Sequence length synchronization between max_seq_len and sequence_length
  • Duplicate checkpoint prompts in pipeline scripts eliminated with --no-resume flag
  • Tokenizer initialization in train.py (was incorrectly trying to access non-existent config.tokenizer)