Releases: juliuspleunes4/Atlas
Releases Β· juliuspleunes4/Atlas
v1.2.4
[v1.2.4] - 2025-12-07 - Persistent Best Checkpoint Tracking Across Sessions
Fixed
- Persistent best checkpoint tracking: Fixed best checkpoint being overwritten by worse models across training sessions
- Root cause:
CheckpointManagerinitializedbest_metric = float('inf')on every startup, never checking for existingatlas_best.pt - This caused a critical regression scenario: Session 1 saves loss 3.88 as best β Session 2 resumes from checkpoint with loss 5.0 β Session 2 thinks loss 4.5 is "best" β overwrites 3.88 with 4.5
- CheckpointManager now loads existing best checkpoint on initialization:
- Checks for
{model_name}_best.ptfile existence in checkpoint directory - Loads existing best checkpoint with
torch.load(..., weights_only=False) - Extracts
metadata['loss']to initializeself.best_metricinstead offloat('inf') - Logs "Found existing best checkpoint with loss: X.XXXX" when found
- Checks for
- train.py now initializes from persistent best metric:
- Primary source:
best_train_loss = checkpoint_manager.best_metric(loaded from existing best) - Fallback: If resumed checkpoint has better loss, use that instead
- Logs initialization source for transparency
- Primary source:
- save_checkpoint now verifies improvement before overwriting:
- Added comparison:
if is_best and self.keep_best and metadata.loss < self.best_metric: - Only saves
atlas_best.ptif new loss is strictly better than existing best - Updates
self.best_metriconly after successful save - Logs "[BEST] Saved new best checkpoint (loss: X.XXXX)" when overwriting
- Added comparison:
- Best model now properly persists across all training sessions, preventing loss of superior checkpoints
- Root cause:
[v1.2.3] - 2025-12-07 - Best Checkpoint Tracking for Step-Based Saves
Fixed
- Best checkpoint creation at step intervals: Fixed best model checkpoint not being created during mid-epoch step saves
- Root cause: Best loss tracking (
is_bestflag) only calculated at epoch boundaries, not at step-based checkpoints - Now tracks and updates
best_train_lossin the step checkpoint callback usingnonlocal - Best loss check happens on every training step, not just at epoch end
atlas_best.ptnow created immediately when loss improves, even mid-epoch- Logs "[BEST] NEW BEST TRAINING LOSS" with improvement percentage when new best is found
- Step checkpoint metadata now includes
best_metricfield with current best loss - Ensures best model is always preserved regardless of checkpoint cleanup (keep_checkpoints setting)
- Root cause: Best loss tracking (
v1.2.2
[v1.2.2] - 2025-12-07 - Inference Script Fixes
Fixed
- Inference script compatibility: Fixed multiple issues preventing inference from working
- Fixed
Tokenizerinitialization: useencoding_nameparameter instead oftokenizer_name - Fixed checkpoint config inference: use correct state dict keys (
embeddings.token_embedding.embedding.weightandembeddings.positional_embedding.embedding.weight) - Fixed
TextGeneratorinitialization: removed invalidtokenizerparameter - Added
tokenizerparameter togenerate_interactive()andgenerate_batch()functions - Pass
tokenizertogenerate_from_prompt()calls in both interactive and batch modes - Inference now works correctly with
scripts/infer.pyfor both interactive and batch generation
- Fixed
v1.2.1
[v1.2.1] - 2025-12-07 - Best Model Checkpoint Resume Fix
Added
- New test for checkpoint resume: Added
test_best_checkpoint_restored_on_resumeverifying best_metric restoration on resume (total: 326 passing tests)- Tests checkpoint saves
best_metriccorrectly - Verifies loading checkpoint restores
best_metric - Ensures subsequent saves only mark as "best" when loss improves from restored value
- Prevents regression where best checkpoints weren't created after resume
- Tests checkpoint saves
Fixed
- Best model tracking on resume: Fixed best checkpoint not being saved after resuming training
- Root cause:
best_train_losswas always reset tofloat('inf')on resume instead of being restored from checkpoint - Now restores
best_train_lossfrom checkpoint metadata when resuming - Checkpoint metadata now saves
best_train_loss(was savingNonebefore when no validation data) - Ensures
atlas_best.ptis created when training loss improves, even across multiple training sessions - Example: Resuming from step 600 now correctly recognizes if step 1400 has the best loss
- Root cause:
v1.2.0
[v1.2.0] - 2025-12-07 - Checkpoint Resume & Progress Tracking Fixes
Added
- Best model tracking without validation: Now saves
atlas_best.ptbased on training loss when no validation data provided- Tracks best training loss throughout training
- Saves checkpoint whenever training loss improves
- Logs improvement percentage
- Ensures best model is always available even without validation set
- New test for best model checkpoint: Added comprehensive test for training-loss-based best model tracking (total: 325 passing tests)
Fixed
- Checkpoint resume double-prompt: Fixed pipeline script asking twice about resuming from checkpoint
- Pipeline now passes
--no-resumeflag when user chooses not to resume - Eliminates duplicate prompts between pipeline and train.py
- Pipeline now passes
- Epoch counter on resume: Fixed epoch display jumping to next epoch when resuming from checkpoint
- Resuming from epoch 1, step 100 now correctly shows "EPOCH 1" instead of "EPOCH 2"
- Initialize
epoch = start_epoch - 1to account for loop increment
- Progress bar tracking: Fixed progress bar to show global steps instead of batches
- Progress bar now displays correct position when resuming (e.g., 100/80000)
- Eliminates confusing batch counter that resets each epoch
- Shows consistent progress toward max_steps goal throughout training
- Manual progress bar updates prevent jumping between batch and step counts
- PowerShell checkpoint path quoting: Fixed checkpoint path being double-quoted in run_pipeline.ps1
- Changed from building
--resume "path"string to passing path variable directly - Resolves "unrecognized arguments" error when resuming
- Changed from building
- Ctrl+C interrupt handling: Fixed checkpoint not saving when pressing Ctrl+C during training
- Added
check_interruptcallback to trainer's batch loop - Trainer now checks for interrupt on every batch iteration
- Saves checkpoint immediately when interrupt detected
- Prevents duplicate interrupt messages
- Avoids saving multiple checkpoints (epoch + step + interrupt)
- Added
- Memory-mapped file cleanup: Fixed "file in use" warning when exiting training
- Properly closes mmap file before unlinking
- Silently ignores cleanup errors (OS temp directory handles cleanup)
Changed
- Progress bar display: Training progress now shows global steps throughout entire training session
- Before: Showed batches (0-166478 per epoch), reset each epoch
- After: Shows global steps (100-80000), continuous across epochs
- Time estimate shows seconds per global step instead of seconds per batch
- Makes progress tracking more intuitive for long training runs
- Interrupt checkpoint priority: When interrupted, only saves one checkpoint instead of multiple
- Interrupt checkpoint takes priority over epoch and interval checkpoints
- Skips epoch checkpoint and validation when interrupted
- Reduces disk I/O and saves time when exiting
- Clean, fast exit on Ctrl+C
- Auto-save checkpoint retention: Increased from 3 to 5 most recent step-based checkpoints
- Keeps 5 recent auto-save checkpoints (every 100 global steps)
- Automatically deletes older checkpoints to save disk space
- Saves ~7.4GB disk space per 1000 steps (at 3.68GB per checkpoint)
v1.1.0
[v1.1.0] - 2025-12-07 - Memory-Efficient Optimizer & Reliable Checkpointing
Added
- 8-bit optimizer support: Added
adamw8bitoptimizer type using bitsandbytes for 75% memory reduction- Stores momentum states in 8-bit precision instead of 32-bit
- Reduces optimizer memory from ~1.9GB to ~0.5GB for 655M parameter models
- Fixes system freeze issue that occurred at gradient accumulation boundary
- Requires
bitsandbytespackage (added to dependencies)
- Multiple optimizer types:
create_optimizer()now supports:adamw: Standard AdamW (2 momentum states, ~1.9GB for 655M params)adamw8bit: 8-bit AdamW (~0.5GB for 655M params, 75% memory reduction)sgd: SGD with Nesterov momentum (~0.95GB for 655M params, 50% memory reduction)
- Optimizer configuration fields: Added to
TrainingConfig:optimizer_type: Select optimizer algorithm (default: "adamw")momentum: Momentum factor for SGD (default: 0.9)
- Memory-mapped dataset support:
TextDatasetnow supports memory-mapped file storage for large datasets to prevent RAM exhaustion- Automatically enabled for ULTRA config (batch_size=1)
- Tokens stored on disk, loaded on-demand
- Reduces RAM usage from ~170MB to <10MB for 42M token dataset
- 9 comprehensive tests for mmap functionality
- Strict config loading in train.py: Training script now uses
atlas.config.load_config()for type-safe, validated configuration loading instead of raw YAML dict access - Advanced training time estimator: New
estimate_training_time.pyscript that performs comprehensive benchmarking with:- Thermal throttling detection
- Checkpoint overhead measurement
- Validation time estimation
- First epoch overhead calculation
- Timeline predictions with completion dates
- No-resume flag: Added
--no-resumeflag to prevent duplicate checkpoint prompts in automated scripts - Built-in checkpoint saving:
Trainernow handles checkpoint saving internally during training- Accepts
checkpoint_managerandauto_save_intervalin constructor - Saves checkpoints automatically at specified step intervals
- Tracks
current_epochfor checkpoint metadata - More reliable than callback-based approach (no Python import cache issues)
- Accepts
- Mid-epoch checkpoint callback:
Trainer.train_epoch()still supports optionalstep_callbackparameter for extensibility- Callback function receives
(trainer, loss)after each global step - Used for custom logic beyond checkpointing
- 5 comprehensive tests for callback functionality
- Callback function receives
- Cache clearing utilities: New
clear_all_cache.ps1script for manual cache clearing - 17 new comprehensive tests (total: 324 passing tests)
- 9 tests for memory-mapped dataset
- 3 tests for config loading validation
- 5 tests for step_callback functionality (mid-epoch checkpointing)
Changed
TrainingConfignow supports all YAML field names with proper aliases (max_grad_norm,scheduler_type,gradient_checkpointing,keep_checkpoints)DataConfigenhanced withmax_seq_lenandnum_workerssupport from YAML configs- All config field aliases properly synced in
__post_init__methods - Training script now uses attribute access (
config.training.learning_rate) instead of dict access (config['training']['learning_rate']) TextDataset.__getitem__now supports negative indexing (e.g.,dataset[-1])- Dataset loading includes automatic garbage collection and CUDA cache clearing to free memory
- Checkpoint saving architecture: Moved from callback-based to built-in trainer mechanism
Trainer.__init__()now acceptscheckpoint_managerandauto_save_intervalparameters- Checkpoint saving happens directly in
train_epoch()method after global step updates - Eliminates Python import cache issues that prevented callbacks from executing
- More reliable and maintainable implementation
- Pipeline scripts enhanced: Both
.ps1and.shscripts now clear Python cache and use-Bflag before training - Default checkpoint interval: Reduced from 1000 to 100 steps for more frequent saves (~4-5 minutes with ULTRA config)
Fixed
- System freeze issue: 8-bit optimizer (adamw8bit) reduces optimizer memory by 75%, preventing system freeze during optimizer.step()
- Missing checkpoints during training: Moved checkpoint saving logic from callback to built-in trainer functionality
- Previously, with batch_size=1 and 166K sequences, no checkpoints would save until epoch completed (~166K steps)
- Root cause: Python bytecode caching prevented updated callback code from loading
- Solution: Built checkpoint saving directly into
Trainer.train_epoch()method - Checkpoints now reliably save every N steps (default 100, configurable)
- Added Python cache clearing to pipeline scripts (
-Bflag prevents bytecode caching) - Added
-W ignore::SyntaxWarningto suppress PyTorch internal warnings
- System freeze issue: Memory-mapped storage prevents RAM exhaustion when training with large datasets
- Python module caching: Pipeline scripts now clear
__pycache__and use-Bflag to ensure latest code is always loaded - Config validation now happens at load time, catching errors before training starts
- Duplicate
num_workersfield inDataConfigremoved - Sequence length synchronization between
max_seq_lenandsequence_length - Duplicate checkpoint prompts in pipeline scripts eliminated with
--no-resumeflag - Tokenizer initialization in
train.py(was incorrectly trying to access non-existentconfig.tokenizer)
v1.0.0
[v1.0.0] - 2025-12-07 - First Stable Release π
Major Milestone: Atlas v1.0.0 represents the first complete, production-ready release of the from-scratch language model implementation.
π― Complete Features
Core Architecture (Phase 3):
- Full decoder-only transformer architecture (GPT-style)
- Multi-head self-attention with causal masking
- Feed-forward networks with multiple activation functions (GELU, SiLU, ReLU)
- Pre-norm architecture with residual connections
- Learned positional embeddings
- Weight tying between embeddings and output head
- Gradient checkpointing for memory efficiency
- 51 comprehensive model tests
Training Infrastructure (Phase 5):
- Complete training loop with gradient accumulation
- Learning rate scheduling (warmup + cosine decay)
- Checkpoint management (step-based, epoch-based, best model)
- Automatic checkpoint resumption with interactive prompts
- Progress tracking and logging
- Validation and evaluation
- 62 training tests including auto-resume
Data Pipeline (Phase 4):
- Text dataset with sliding window tokenization
- Multiple file format support (txt, JSONL)
- Preprocessing utilities (cleaning, chunking, filtering)
- Efficient data loading with PyTorch DataLoader
- Train/validation splitting
- 72 data pipeline tests
Configuration System (Phase 1):
- YAML-based configuration
- CLI override support
- Multiple pre-configured model sizes (TINY to ULTRA)
- Validation and type checking
- 32 configuration tests
Tokenizer (Phase 2):
- GPT-2 BPE tokenizer via tiktoken
- Batch encoding/decoding
- Special token handling
- 27 tokenizer tests
Inference (Phase 6):
- Text generation with sampling strategies
- Temperature, top-k, top-p sampling
- Interactive and batch modes
- 33 inference tests
Model Export (Phase 7):
- GGUF format export
- Float32 and Float16 quantization
- Metadata embedding
- 17 export tests
π Statistics
- 307 passing tests across all components
- 6 model configurations (40M to 500M parameters)
- 10 comprehensive documentation files
- Clean, modular codebase with 94%+ coverage on core modules
π Model Configurations
Six production-ready configurations:
- TINY (40M params): Testing and development
- SMALL (124M params): GPT-2 Small equivalent
- DEFAULT (350M params): Recommended, GPT-2 Medium equivalent
- LARGE (500M params): Maximum quality
- XLARGE (500M params): Memory-optimized
- ULTRA (500M params): Extreme low-temperature operation
π Documentation
Complete documentation suite:
- README.md - Project overview and quickstart
- ROADMAP.md - Development plan and progress
- CHANGELOG.md - This file
- ARCHITECTURE.md - Technical deep-dive
- CONTRIBUTING.md - Contribution guidelines
- CODE_OF_CONDUCT.md - Community standards
- SECURITY.md - Security policy
- LICENSE_GUIDE.md - Licensing information
- TESTING.md - Testing guide
- FAQ.md - Frequently asked questions
π Getting Started
git clone https://github.com/juliuspleunes4/Atlas.git
cd Atlas
.\scripts\run_pipeline.ps1 # Windows
./scripts/run_pipeline.sh # Linux/Macπ Acknowledgments
This release represents the culmination of comprehensive development work across all phases of the project. Special thanks to all contributors and users who provided feedback during development.