Skip to content

Releases: juliuspleunes4/Atlas

v1.2.4

07 Dec 18:37

Choose a tag to compare

[v1.2.4] - 2025-12-07 - Persistent Best Checkpoint Tracking Across Sessions

Fixed

  • Persistent best checkpoint tracking: Fixed best checkpoint being overwritten by worse models across training sessions
    • Root cause: CheckpointManager initialized best_metric = float('inf') on every startup, never checking for existing atlas_best.pt
    • This caused a critical regression scenario: Session 1 saves loss 3.88 as best β†’ Session 2 resumes from checkpoint with loss 5.0 β†’ Session 2 thinks loss 4.5 is "best" β†’ overwrites 3.88 with 4.5
    • CheckpointManager now loads existing best checkpoint on initialization:
      • Checks for {model_name}_best.pt file existence in checkpoint directory
      • Loads existing best checkpoint with torch.load(..., weights_only=False)
      • Extracts metadata['loss'] to initialize self.best_metric instead of float('inf')
      • Logs "Found existing best checkpoint with loss: X.XXXX" when found
    • train.py now initializes from persistent best metric:
      • Primary source: best_train_loss = checkpoint_manager.best_metric (loaded from existing best)
      • Fallback: If resumed checkpoint has better loss, use that instead
      • Logs initialization source for transparency
    • save_checkpoint now verifies improvement before overwriting:
      • Added comparison: if is_best and self.keep_best and metadata.loss < self.best_metric:
      • Only saves atlas_best.pt if new loss is strictly better than existing best
      • Updates self.best_metric only after successful save
      • Logs "[BEST] Saved new best checkpoint (loss: X.XXXX)" when overwriting
    • Best model now properly persists across all training sessions, preventing loss of superior checkpoints

[v1.2.3] - 2025-12-07 - Best Checkpoint Tracking for Step-Based Saves

Fixed

  • Best checkpoint creation at step intervals: Fixed best model checkpoint not being created during mid-epoch step saves
    • Root cause: Best loss tracking (is_best flag) only calculated at epoch boundaries, not at step-based checkpoints
    • Now tracks and updates best_train_loss in the step checkpoint callback using nonlocal
    • Best loss check happens on every training step, not just at epoch end
    • atlas_best.pt now created immediately when loss improves, even mid-epoch
    • Logs "[BEST] NEW BEST TRAINING LOSS" with improvement percentage when new best is found
    • Step checkpoint metadata now includes best_metric field with current best loss
    • Ensures best model is always preserved regardless of checkpoint cleanup (keep_checkpoints setting)

v1.2.2

07 Dec 15:28

Choose a tag to compare

[v1.2.2] - 2025-12-07 - Inference Script Fixes

Fixed

  • Inference script compatibility: Fixed multiple issues preventing inference from working
    • Fixed Tokenizer initialization: use encoding_name parameter instead of tokenizer_name
    • Fixed checkpoint config inference: use correct state dict keys (embeddings.token_embedding.embedding.weight and embeddings.positional_embedding.embedding.weight)
    • Fixed TextGenerator initialization: removed invalid tokenizer parameter
    • Added tokenizer parameter to generate_interactive() and generate_batch() functions
    • Pass tokenizer to generate_from_prompt() calls in both interactive and batch modes
    • Inference now works correctly with scripts/infer.py for both interactive and batch generation

v1.2.1

07 Dec 13:08

Choose a tag to compare

[v1.2.1] - 2025-12-07 - Best Model Checkpoint Resume Fix

Added

  • New test for checkpoint resume: Added test_best_checkpoint_restored_on_resume verifying best_metric restoration on resume (total: 326 passing tests)
    • Tests checkpoint saves best_metric correctly
    • Verifies loading checkpoint restores best_metric
    • Ensures subsequent saves only mark as "best" when loss improves from restored value
    • Prevents regression where best checkpoints weren't created after resume

Fixed

  • Best model tracking on resume: Fixed best checkpoint not being saved after resuming training
    • Root cause: best_train_loss was always reset to float('inf') on resume instead of being restored from checkpoint
    • Now restores best_train_loss from checkpoint metadata when resuming
    • Checkpoint metadata now saves best_train_loss (was saving None before when no validation data)
    • Ensures atlas_best.pt is created when training loss improves, even across multiple training sessions
    • Example: Resuming from step 600 now correctly recognizes if step 1400 has the best loss

v1.2.0

07 Dec 11:34

Choose a tag to compare

[v1.2.0] - 2025-12-07 - Checkpoint Resume & Progress Tracking Fixes

Added

  • Best model tracking without validation: Now saves atlas_best.pt based on training loss when no validation data provided
    • Tracks best training loss throughout training
    • Saves checkpoint whenever training loss improves
    • Logs improvement percentage
    • Ensures best model is always available even without validation set
  • New test for best model checkpoint: Added comprehensive test for training-loss-based best model tracking (total: 325 passing tests)

Fixed

  • Checkpoint resume double-prompt: Fixed pipeline script asking twice about resuming from checkpoint
    • Pipeline now passes --no-resume flag when user chooses not to resume
    • Eliminates duplicate prompts between pipeline and train.py
  • Epoch counter on resume: Fixed epoch display jumping to next epoch when resuming from checkpoint
    • Resuming from epoch 1, step 100 now correctly shows "EPOCH 1" instead of "EPOCH 2"
    • Initialize epoch = start_epoch - 1 to account for loop increment
  • Progress bar tracking: Fixed progress bar to show global steps instead of batches
    • Progress bar now displays correct position when resuming (e.g., 100/80000)
    • Eliminates confusing batch counter that resets each epoch
    • Shows consistent progress toward max_steps goal throughout training
    • Manual progress bar updates prevent jumping between batch and step counts
  • PowerShell checkpoint path quoting: Fixed checkpoint path being double-quoted in run_pipeline.ps1
    • Changed from building --resume "path" string to passing path variable directly
    • Resolves "unrecognized arguments" error when resuming
  • Ctrl+C interrupt handling: Fixed checkpoint not saving when pressing Ctrl+C during training
    • Added check_interrupt callback to trainer's batch loop
    • Trainer now checks for interrupt on every batch iteration
    • Saves checkpoint immediately when interrupt detected
    • Prevents duplicate interrupt messages
    • Avoids saving multiple checkpoints (epoch + step + interrupt)
  • Memory-mapped file cleanup: Fixed "file in use" warning when exiting training
    • Properly closes mmap file before unlinking
    • Silently ignores cleanup errors (OS temp directory handles cleanup)

Changed

  • Progress bar display: Training progress now shows global steps throughout entire training session
    • Before: Showed batches (0-166478 per epoch), reset each epoch
    • After: Shows global steps (100-80000), continuous across epochs
    • Time estimate shows seconds per global step instead of seconds per batch
    • Makes progress tracking more intuitive for long training runs
  • Interrupt checkpoint priority: When interrupted, only saves one checkpoint instead of multiple
    • Interrupt checkpoint takes priority over epoch and interval checkpoints
    • Skips epoch checkpoint and validation when interrupted
    • Reduces disk I/O and saves time when exiting
    • Clean, fast exit on Ctrl+C
  • Auto-save checkpoint retention: Increased from 3 to 5 most recent step-based checkpoints
    • Keeps 5 recent auto-save checkpoints (every 100 global steps)
    • Automatically deletes older checkpoints to save disk space
    • Saves ~7.4GB disk space per 1000 steps (at 3.68GB per checkpoint)

v1.1.0

07 Dec 02:13

Choose a tag to compare

[v1.1.0] - 2025-12-07 - Memory-Efficient Optimizer & Reliable Checkpointing

Added

  • 8-bit optimizer support: Added adamw8bit optimizer type using bitsandbytes for 75% memory reduction
    • Stores momentum states in 8-bit precision instead of 32-bit
    • Reduces optimizer memory from ~1.9GB to ~0.5GB for 655M parameter models
    • Fixes system freeze issue that occurred at gradient accumulation boundary
    • Requires bitsandbytes package (added to dependencies)
  • Multiple optimizer types: create_optimizer() now supports:
    • adamw: Standard AdamW (2 momentum states, ~1.9GB for 655M params)
    • adamw8bit: 8-bit AdamW (~0.5GB for 655M params, 75% memory reduction)
    • sgd: SGD with Nesterov momentum (~0.95GB for 655M params, 50% memory reduction)
  • Optimizer configuration fields: Added to TrainingConfig:
    • optimizer_type: Select optimizer algorithm (default: "adamw")
    • momentum: Momentum factor for SGD (default: 0.9)
  • Memory-mapped dataset support: TextDataset now supports memory-mapped file storage for large datasets to prevent RAM exhaustion
    • Automatically enabled for ULTRA config (batch_size=1)
    • Tokens stored on disk, loaded on-demand
    • Reduces RAM usage from ~170MB to <10MB for 42M token dataset
    • 9 comprehensive tests for mmap functionality
  • Strict config loading in train.py: Training script now uses atlas.config.load_config() for type-safe, validated configuration loading instead of raw YAML dict access
  • Advanced training time estimator: New estimate_training_time.py script that performs comprehensive benchmarking with:
    • Thermal throttling detection
    • Checkpoint overhead measurement
    • Validation time estimation
    • First epoch overhead calculation
    • Timeline predictions with completion dates
  • No-resume flag: Added --no-resume flag to prevent duplicate checkpoint prompts in automated scripts
  • Built-in checkpoint saving: Trainer now handles checkpoint saving internally during training
    • Accepts checkpoint_manager and auto_save_interval in constructor
    • Saves checkpoints automatically at specified step intervals
    • Tracks current_epoch for checkpoint metadata
    • More reliable than callback-based approach (no Python import cache issues)
  • Mid-epoch checkpoint callback: Trainer.train_epoch() still supports optional step_callback parameter for extensibility
    • Callback function receives (trainer, loss) after each global step
    • Used for custom logic beyond checkpointing
    • 5 comprehensive tests for callback functionality
  • Cache clearing utilities: New clear_all_cache.ps1 script for manual cache clearing
  • 17 new comprehensive tests (total: 324 passing tests)
    • 9 tests for memory-mapped dataset
    • 3 tests for config loading validation
    • 5 tests for step_callback functionality (mid-epoch checkpointing)

Changed

  • TrainingConfig now supports all YAML field names with proper aliases (max_grad_norm, scheduler_type, gradient_checkpointing, keep_checkpoints)
  • DataConfig enhanced with max_seq_len and num_workers support from YAML configs
  • All config field aliases properly synced in __post_init__ methods
  • Training script now uses attribute access (config.training.learning_rate) instead of dict access (config['training']['learning_rate'])
  • TextDataset.__getitem__ now supports negative indexing (e.g., dataset[-1])
  • Dataset loading includes automatic garbage collection and CUDA cache clearing to free memory
  • Checkpoint saving architecture: Moved from callback-based to built-in trainer mechanism
    • Trainer.__init__() now accepts checkpoint_manager and auto_save_interval parameters
    • Checkpoint saving happens directly in train_epoch() method after global step updates
    • Eliminates Python import cache issues that prevented callbacks from executing
    • More reliable and maintainable implementation
  • Pipeline scripts enhanced: Both .ps1 and .sh scripts now clear Python cache and use -B flag before training
  • Default checkpoint interval: Reduced from 1000 to 100 steps for more frequent saves (~4-5 minutes with ULTRA config)

Fixed

  • System freeze issue: 8-bit optimizer (adamw8bit) reduces optimizer memory by 75%, preventing system freeze during optimizer.step()
  • Missing checkpoints during training: Moved checkpoint saving logic from callback to built-in trainer functionality
    • Previously, with batch_size=1 and 166K sequences, no checkpoints would save until epoch completed (~166K steps)
    • Root cause: Python bytecode caching prevented updated callback code from loading
    • Solution: Built checkpoint saving directly into Trainer.train_epoch() method
    • Checkpoints now reliably save every N steps (default 100, configurable)
    • Added Python cache clearing to pipeline scripts (-B flag prevents bytecode caching)
    • Added -W ignore::SyntaxWarning to suppress PyTorch internal warnings
  • System freeze issue: Memory-mapped storage prevents RAM exhaustion when training with large datasets
  • Python module caching: Pipeline scripts now clear __pycache__ and use -B flag to ensure latest code is always loaded
  • Config validation now happens at load time, catching errors before training starts
  • Duplicate num_workers field in DataConfig removed
  • Sequence length synchronization between max_seq_len and sequence_length
  • Duplicate checkpoint prompts in pipeline scripts eliminated with --no-resume flag
  • Tokenizer initialization in train.py (was incorrectly trying to access non-existent config.tokenizer)

v1.0.0

06 Dec 23:47

Choose a tag to compare

[v1.0.0] - 2025-12-07 - First Stable Release πŸŽ‰

Major Milestone: Atlas v1.0.0 represents the first complete, production-ready release of the from-scratch language model implementation.

🎯 Complete Features

Core Architecture (Phase 3):

  • Full decoder-only transformer architecture (GPT-style)
  • Multi-head self-attention with causal masking
  • Feed-forward networks with multiple activation functions (GELU, SiLU, ReLU)
  • Pre-norm architecture with residual connections
  • Learned positional embeddings
  • Weight tying between embeddings and output head
  • Gradient checkpointing for memory efficiency
  • 51 comprehensive model tests

Training Infrastructure (Phase 5):

  • Complete training loop with gradient accumulation
  • Learning rate scheduling (warmup + cosine decay)
  • Checkpoint management (step-based, epoch-based, best model)
  • Automatic checkpoint resumption with interactive prompts
  • Progress tracking and logging
  • Validation and evaluation
  • 62 training tests including auto-resume

Data Pipeline (Phase 4):

  • Text dataset with sliding window tokenization
  • Multiple file format support (txt, JSONL)
  • Preprocessing utilities (cleaning, chunking, filtering)
  • Efficient data loading with PyTorch DataLoader
  • Train/validation splitting
  • 72 data pipeline tests

Configuration System (Phase 1):

  • YAML-based configuration
  • CLI override support
  • Multiple pre-configured model sizes (TINY to ULTRA)
  • Validation and type checking
  • 32 configuration tests

Tokenizer (Phase 2):

  • GPT-2 BPE tokenizer via tiktoken
  • Batch encoding/decoding
  • Special token handling
  • 27 tokenizer tests

Inference (Phase 6):

  • Text generation with sampling strategies
  • Temperature, top-k, top-p sampling
  • Interactive and batch modes
  • 33 inference tests

Model Export (Phase 7):

  • GGUF format export
  • Float32 and Float16 quantization
  • Metadata embedding
  • 17 export tests

πŸ“Š Statistics

  • 307 passing tests across all components
  • 6 model configurations (40M to 500M parameters)
  • 10 comprehensive documentation files
  • Clean, modular codebase with 94%+ coverage on core modules

🎁 Model Configurations

Six production-ready configurations:

  • TINY (40M params): Testing and development
  • SMALL (124M params): GPT-2 Small equivalent
  • DEFAULT (350M params): Recommended, GPT-2 Medium equivalent
  • LARGE (500M params): Maximum quality
  • XLARGE (500M params): Memory-optimized
  • ULTRA (500M params): Extreme low-temperature operation

πŸ“š Documentation

Complete documentation suite:

  • README.md - Project overview and quickstart
  • ROADMAP.md - Development plan and progress
  • CHANGELOG.md - This file
  • ARCHITECTURE.md - Technical deep-dive
  • CONTRIBUTING.md - Contribution guidelines
  • CODE_OF_CONDUCT.md - Community standards
  • SECURITY.md - Security policy
  • LICENSE_GUIDE.md - Licensing information
  • TESTING.md - Testing guide
  • FAQ.md - Frequently asked questions

πŸš€ Getting Started

git clone https://github.com/juliuspleunes4/Atlas.git
cd Atlas
.\scripts\run_pipeline.ps1  # Windows
./scripts/run_pipeline.sh   # Linux/Mac

πŸ™ Acknowledgments

This release represents the culmination of comprehensive development work across all phases of the project. Special thanks to all contributors and users who provided feedback during development.