Releases · juliuspleunes4/Atlas

07 Dec 18:37

juliuspleunes4

v1.2.4

2c78cf0

v1.2.4 Latest

Latest

[v1.2.4] - 2025-12-07 - Persistent Best Checkpoint Tracking Across Sessions

Fixed

Persistent best checkpoint tracking: Fixed best checkpoint being overwritten by worse models across training sessions
- Root cause: CheckpointManager initialized best_metric = float('inf') on every startup, never checking for existing atlas_best.pt
- This caused a critical regression scenario: Session 1 saves loss 3.88 as best → Session 2 resumes from checkpoint with loss 5.0 → Session 2 thinks loss 4.5 is "best" → overwrites 3.88 with 4.5
- CheckpointManager now loads existing best checkpoint on initialization:
  - Checks for {model_name}_best.pt file existence in checkpoint directory
  - Loads existing best checkpoint with torch.load(..., weights_only=False)
  - Extracts metadata['loss'] to initialize self.best_metric instead of float('inf')
  - Logs "Found existing best checkpoint with loss: X.XXXX" when found
- train.py now initializes from persistent best metric:
  - Primary source: best_train_loss = checkpoint_manager.best_metric (loaded from existing best)
  - Fallback: If resumed checkpoint has better loss, use that instead
  - Logs initialization source for transparency
- save_checkpoint now verifies improvement before overwriting:
  - Added comparison: if is_best and self.keep_best and metadata.loss < self.best_metric:
  - Only saves atlas_best.pt if new loss is strictly better than existing best
  - Updates self.best_metric only after successful save
  - Logs "[BEST] Saved new best checkpoint (loss: X.XXXX)" when overwriting
- Best model now properly persists across all training sessions, preventing loss of superior checkpoints

[v1.2.3] - 2025-12-07 - Best Checkpoint Tracking for Step-Based Saves

Fixed

Best checkpoint creation at step intervals: Fixed best model checkpoint not being created during mid-epoch step saves
- Root cause: Best loss tracking (is_best flag) only calculated at epoch boundaries, not at step-based checkpoints
- Now tracks and updates best_train_loss in the step checkpoint callback using nonlocal
- Best loss check happens on every training step, not just at epoch end
- atlas_best.pt now created immediately when loss improves, even mid-epoch
- Logs "[BEST] NEW BEST TRAINING LOSS" with improvement percentage when new best is found
- Step checkpoint metadata now includes best_metric field with current best loss
- Ensures best model is always preserved regardless of checkpoint cleanup (keep_checkpoints setting)

Assets 2

07 Dec 15:28

juliuspleunes4

v1.2.2

94d3d9e

v1.2.2

[v1.2.2] - 2025-12-07 - Inference Script Fixes

Fixed

Inference script compatibility: Fixed multiple issues preventing inference from working
- Fixed Tokenizer initialization: use encoding_name parameter instead of tokenizer_name
- Fixed checkpoint config inference: use correct state dict keys (embeddings.token_embedding.embedding.weight and embeddings.positional_embedding.embedding.weight)
- Fixed TextGenerator initialization: removed invalid tokenizer parameter
- Added tokenizer parameter to generate_interactive() and generate_batch() functions
- Pass tokenizer to generate_from_prompt() calls in both interactive and batch modes
- Inference now works correctly with scripts/infer.py for both interactive and batch generation

Assets 2

07 Dec 13:08

juliuspleunes4

v1.2.1

cd4f6aa

v1.2.1

[v1.2.1] - 2025-12-07 - Best Model Checkpoint Resume Fix

Added

New test for checkpoint resume: Added test_best_checkpoint_restored_on_resume verifying best_metric restoration on resume (total: 326 passing tests)
- Tests checkpoint saves best_metric correctly
- Verifies loading checkpoint restores best_metric
- Ensures subsequent saves only mark as "best" when loss improves from restored value
- Prevents regression where best checkpoints weren't created after resume

Fixed

Best model tracking on resume: Fixed best checkpoint not being saved after resuming training
- Root cause: best_train_loss was always reset to float('inf') on resume instead of being restored from checkpoint
- Now restores best_train_loss from checkpoint metadata when resuming
- Checkpoint metadata now saves best_train_loss (was saving None before when no validation data)
- Ensures atlas_best.pt is created when training loss improves, even across multiple training sessions
- Example: Resuming from step 600 now correctly recognizes if step 1400 has the best loss

Assets 2

07 Dec 11:34

juliuspleunes4

v1.2.0

04853ab

v1.2.0

[v1.2.0] - 2025-12-07 - Checkpoint Resume & Progress Tracking Fixes

Added

Best model tracking without validation: Now saves atlas_best.pt based on training loss when no validation data provided
- Tracks best training loss throughout training
- Saves checkpoint whenever training loss improves
- Logs improvement percentage
- Ensures best model is always available even without validation set
New test for best model checkpoint: Added comprehensive test for training-loss-based best model tracking (total: 325 passing tests)

Fixed

Checkpoint resume double-prompt: Fixed pipeline script asking twice about resuming from checkpoint
- Pipeline now passes --no-resume flag when user chooses not to resume
- Eliminates duplicate prompts between pipeline and train.py
Epoch counter on resume: Fixed epoch display jumping to next epoch when resuming from checkpoint
- Resuming from epoch 1, step 100 now correctly shows "EPOCH 1" instead of "EPOCH 2"
- Initialize epoch = start_epoch - 1 to account for loop increment
Progress bar tracking: Fixed progress bar to show global steps instead of batches
- Progress bar now displays correct position when resuming (e.g., 100/80000)
- Eliminates confusing batch counter that resets each epoch
- Shows consistent progress toward max_steps goal throughout training
- Manual progress bar updates prevent jumping between batch and step counts
PowerShell checkpoint path quoting: Fixed checkpoint path being double-quoted in run_pipeline.ps1
- Changed from building --resume "path" string to passing path variable directly
- Resolves "unrecognized arguments" error when resuming
Ctrl+C interrupt handling: Fixed checkpoint not saving when pressing Ctrl+C during training
- Added check_interrupt callback to trainer's batch loop
- Trainer now checks for interrupt on every batch iteration
- Saves checkpoint immediately when interrupt detected
- Prevents duplicate interrupt messages
- Avoids saving multiple checkpoints (epoch + step + interrupt)
Memory-mapped file cleanup: Fixed "file in use" warning when exiting training
- Properly closes mmap file before unlinking
- Silently ignores cleanup errors (OS temp directory handles cleanup)

Changed

Progress bar display: Training progress now shows global steps throughout entire training session
- Before: Showed batches (0-166478 per epoch), reset each epoch
- After: Shows global steps (100-80000), continuous across epochs
- Time estimate shows seconds per global step instead of seconds per batch
- Makes progress tracking more intuitive for long training runs
Interrupt checkpoint priority: When interrupted, only saves one checkpoint instead of multiple
- Interrupt checkpoint takes priority over epoch and interval checkpoints
- Skips epoch checkpoint and validation when interrupted
- Reduces disk I/O and saves time when exiting
- Clean, fast exit on Ctrl+C
Auto-save checkpoint retention: Increased from 3 to 5 most recent step-based checkpoints
- Keeps 5 recent auto-save checkpoints (every 100 global steps)
- Automatically deletes older checkpoints to save disk space
- Saves ~7.4GB disk space per 1000 steps (at 3.68GB per checkpoint)

Assets 2

07 Dec 02:13

juliuspleunes4

v1.1.0

afc5e86

v1.1.0

[v1.1.0] - 2025-12-07 - Memory-Efficient Optimizer & Reliable Checkpointing

Added

8-bit optimizer support: Added adamw8bit optimizer type using bitsandbytes for 75% memory reduction
- Stores momentum states in 8-bit precision instead of 32-bit
- Reduces optimizer memory from ~1.9GB to ~0.5GB for 655M parameter models
- Fixes system freeze issue that occurred at gradient accumulation boundary
- Requires bitsandbytes package (added to dependencies)
Multiple optimizer types: create_optimizer() now supports:
- adamw: Standard AdamW (2 momentum states, ~1.9GB for 655M params)
- adamw8bit: 8-bit AdamW (~0.5GB for 655M params, 75% memory reduction)
- sgd: SGD with Nesterov momentum (~0.95GB for 655M params, 50% memory reduction)
Optimizer configuration fields: Added to TrainingConfig:
- optimizer_type: Select optimizer algorithm (default: "adamw")
- momentum: Momentum factor for SGD (default: 0.9)
Memory-mapped dataset support: TextDataset now supports memory-mapped file storage for large datasets to prevent RAM exhaustion
- Automatically enabled for ULTRA config (batch_size=1)
- Tokens stored on disk, loaded on-demand
- Reduces RAM usage from ~170MB to <10MB for 42M token dataset
- 9 comprehensive tests for mmap functionality
Strict config loading in train.py: Training script now uses atlas.config.load_config() for type-safe, validated configuration loading instead of raw YAML dict access
Advanced training time estimator: New estimate_training_time.py script that performs comprehensive benchmarking with:
- Thermal throttling detection
- Checkpoint overhead measurement
- Validation time estimation
- First epoch overhead calculation
- Timeline predictions with completion dates
No-resume flag: Added --no-resume flag to prevent duplicate checkpoint prompts in automated scripts
Built-in checkpoint saving: Trainer now handles checkpoint saving internally during training
- Accepts checkpoint_manager and auto_save_interval in constructor
- Saves checkpoints automatically at specified step intervals
- Tracks current_epoch for checkpoint metadata
- More reliable than callback-based approach (no Python import cache issues)
Mid-epoch checkpoint callback: Trainer.train_epoch() still supports optional step_callback parameter for extensibility
- Callback function receives (trainer, loss) after each global step
- Used for custom logic beyond checkpointing
- 5 comprehensive tests for callback functionality
Cache clearing utilities: New clear_all_cache.ps1 script for manual cache clearing
17 new comprehensive tests (total: 324 passing tests)
- 9 tests for memory-mapped dataset
- 3 tests for config loading validation
- 5 tests for step_callback functionality (mid-epoch checkpointing)

Changed

TrainingConfig now supports all YAML field names with proper aliases (max_grad_norm, scheduler_type, gradient_checkpointing, keep_checkpoints)
DataConfig enhanced with max_seq_len and num_workers support from YAML configs
All config field aliases properly synced in __post_init__ methods
Training script now uses attribute access (config.training.learning_rate) instead of dict access (config['training']['learning_rate'])
TextDataset.__getitem__ now supports negative indexing (e.g., dataset[-1])
Dataset loading includes automatic garbage collection and CUDA cache clearing to free memory
Checkpoint saving architecture: Moved from callback-based to built-in trainer mechanism
- Trainer.__init__() now accepts checkpoint_manager and auto_save_interval parameters
- Checkpoint saving happens directly in train_epoch() method after global step updates
- Eliminates Python import cache issues that prevented callbacks from executing
- More reliable and maintainable implementation
Pipeline scripts enhanced: Both .ps1 and .sh scripts now clear Python cache and use -B flag before training
Default checkpoint interval: Reduced from 1000 to 100 steps for more frequent saves (~4-5 minutes with ULTRA config)

Fixed

System freeze issue: 8-bit optimizer (adamw8bit) reduces optimizer memory by 75%, preventing system freeze during optimizer.step()
Missing checkpoints during training: Moved checkpoint saving logic from callback to built-in trainer functionality
- Previously, with batch_size=1 and 166K sequences, no checkpoints would save until epoch completed (~166K steps)
- Root cause: Python bytecode caching prevented updated callback code from loading
- Solution: Built checkpoint saving directly into Trainer.train_epoch() method
- Checkpoints now reliably save every N steps (default 100, configurable)
- Added Python cache clearing to pipeline scripts (-B flag prevents bytecode caching)
- Added -W ignore::SyntaxWarning to suppress PyTorch internal warnings
System freeze issue: Memory-mapped storage prevents RAM exhaustion when training with large datasets
Python module caching: Pipeline scripts now clear __pycache__ and use -B flag to ensure latest code is always loaded
Config validation now happens at load time, catching errors before training starts
Duplicate num_workers field in DataConfig removed
Sequence length synchronization between max_seq_len and sequence_length
Duplicate checkpoint prompts in pipeline scripts eliminated with --no-resume flag
Tokenizer initialization in train.py (was incorrectly trying to access non-existent config.tokenizer)

Assets 2

06 Dec 23:47

juliuspleunes4

v1.0.0

4edafa3

v1.0.0

[v1.0.0] - 2025-12-07 - First Stable Release 🎉

Major Milestone: Atlas v1.0.0 represents the first complete, production-ready release of the from-scratch language model implementation.

🎯 Complete Features

Core Architecture (Phase 3):

Full decoder-only transformer architecture (GPT-style)
Multi-head self-attention with causal masking
Feed-forward networks with multiple activation functions (GELU, SiLU, ReLU)
Pre-norm architecture with residual connections
Learned positional embeddings
Weight tying between embeddings and output head
Gradient checkpointing for memory efficiency
51 comprehensive model tests

Training Infrastructure (Phase 5):

Complete training loop with gradient accumulation
Learning rate scheduling (warmup + cosine decay)
Checkpoint management (step-based, epoch-based, best model)
Automatic checkpoint resumption with interactive prompts
Progress tracking and logging
Validation and evaluation
62 training tests including auto-resume

Data Pipeline (Phase 4):

Text dataset with sliding window tokenization
Multiple file format support (txt, JSONL)
Preprocessing utilities (cleaning, chunking, filtering)
Efficient data loading with PyTorch DataLoader
Train/validation splitting
72 data pipeline tests

Configuration System (Phase 1):

YAML-based configuration
CLI override support
Multiple pre-configured model sizes (TINY to ULTRA)
Validation and type checking
32 configuration tests

Tokenizer (Phase 2):

GPT-2 BPE tokenizer via tiktoken
Batch encoding/decoding
Special token handling
27 tokenizer tests

Inference (Phase 6):

Text generation with sampling strategies
Temperature, top-k, top-p sampling
Interactive and batch modes
33 inference tests

Model Export (Phase 7):

GGUF format export
Float32 and Float16 quantization
Metadata embedding
17 export tests

📊 Statistics

307 passing tests across all components
6 model configurations (40M to 500M parameters)
10 comprehensive documentation files
Clean, modular codebase with 94%+ coverage on core modules

🎁 Model Configurations

Six production-ready configurations:

TINY (40M params): Testing and development
SMALL (124M params): GPT-2 Small equivalent
DEFAULT (350M params): Recommended, GPT-2 Medium equivalent
LARGE (500M params): Maximum quality
XLARGE (500M params): Memory-optimized
ULTRA (500M params): Extreme low-temperature operation

📚 Documentation

Complete documentation suite:

README.md - Project overview and quickstart
ROADMAP.md - Development plan and progress
CHANGELOG.md - This file
ARCHITECTURE.md - Technical deep-dive
CONTRIBUTING.md - Contribution guidelines
CODE_OF_CONDUCT.md - Community standards
SECURITY.md - Security policy
LICENSE_GUIDE.md - Licensing information
TESTING.md - Testing guide
FAQ.md - Frequently asked questions

🚀 Getting Started

git clone https://github.com/juliuspleunes4/Atlas.git
cd Atlas
.\scripts\run_pipeline.ps1  # Windows
./scripts/run_pipeline.sh   # Linux/Mac

🙏 Acknowledgments

This release represents the culmination of comprehensive development work across all phases of the project. Special thanks to all contributors and users who provided feedback during development.

Assets 2

Releases: juliuspleunes4/Atlas

v1.2.4

[v1.2.4] - 2025-12-07 - Persistent Best Checkpoint Tracking Across Sessions

Fixed

[v1.2.3] - 2025-12-07 - Best Checkpoint Tracking for Step-Based Saves

Fixed

Uh oh!

v1.2.2

[v1.2.2] - 2025-12-07 - Inference Script Fixes

Fixed

Uh oh!

v1.2.1

[v1.2.1] - 2025-12-07 - Best Model Checkpoint Resume Fix

Added

Fixed

Uh oh!

v1.2.0

[v1.2.0] - 2025-12-07 - Checkpoint Resume & Progress Tracking Fixes

Added

Fixed

Changed

Uh oh!

v1.1.0

[v1.1.0] - 2025-12-07 - Memory-Efficient Optimizer & Reliable Checkpointing

Added

Changed

Fixed

Uh oh!

v1.0.0

[v1.0.0] - 2025-12-07 - First Stable Release 🎉

🎯 Complete Features

📊 Statistics

🎁 Model Configurations

📚 Documentation

🚀 Getting Started

🙏 Acknowledgments

Uh oh!