v1.2.0

juliuspleunes4 released this 07 Dec 11:34

· 14 commits to main since this release

04853ab

[v1.2.0] - 2025-12-07 - Checkpoint Resume & Progress Tracking Fixes

Added

Best model tracking without validation: Now saves atlas_best.pt based on training loss when no validation data provided
- Tracks best training loss throughout training
- Saves checkpoint whenever training loss improves
- Logs improvement percentage
- Ensures best model is always available even without validation set
New test for best model checkpoint: Added comprehensive test for training-loss-based best model tracking (total: 325 passing tests)

Fixed

Checkpoint resume double-prompt: Fixed pipeline script asking twice about resuming from checkpoint
- Pipeline now passes --no-resume flag when user chooses not to resume
- Eliminates duplicate prompts between pipeline and train.py
Epoch counter on resume: Fixed epoch display jumping to next epoch when resuming from checkpoint
- Resuming from epoch 1, step 100 now correctly shows "EPOCH 1" instead of "EPOCH 2"
- Initialize epoch = start_epoch - 1 to account for loop increment
Progress bar tracking: Fixed progress bar to show global steps instead of batches
- Progress bar now displays correct position when resuming (e.g., 100/80000)
- Eliminates confusing batch counter that resets each epoch
- Shows consistent progress toward max_steps goal throughout training
- Manual progress bar updates prevent jumping between batch and step counts
PowerShell checkpoint path quoting: Fixed checkpoint path being double-quoted in run_pipeline.ps1
- Changed from building --resume "path" string to passing path variable directly
- Resolves "unrecognized arguments" error when resuming
Ctrl+C interrupt handling: Fixed checkpoint not saving when pressing Ctrl+C during training
- Added check_interrupt callback to trainer's batch loop
- Trainer now checks for interrupt on every batch iteration
- Saves checkpoint immediately when interrupt detected
- Prevents duplicate interrupt messages
- Avoids saving multiple checkpoints (epoch + step + interrupt)
Memory-mapped file cleanup: Fixed "file in use" warning when exiting training
- Properly closes mmap file before unlinking
- Silently ignores cleanup errors (OS temp directory handles cleanup)

Changed

Progress bar display: Training progress now shows global steps throughout entire training session
- Before: Showed batches (0-166478 per epoch), reset each epoch
- After: Shows global steps (100-80000), continuous across epochs
- Time estimate shows seconds per global step instead of seconds per batch
- Makes progress tracking more intuitive for long training runs
Interrupt checkpoint priority: When interrupted, only saves one checkpoint instead of multiple
- Interrupt checkpoint takes priority over epoch and interval checkpoints
- Skips epoch checkpoint and validation when interrupted
- Reduces disk I/O and saves time when exiting
- Clean, fast exit on Ctrl+C
Auto-save checkpoint retention: Increased from 3 to 5 most recent step-based checkpoints
- Keeps 5 recent auto-save checkpoints (every 100 global steps)
- Automatically deletes older checkpoints to save disk space
- Saves ~7.4GB disk space per 1000 steps (at 3.68GB per checkpoint)

Assets 2