Releases
v1.2.0
Compare
Sorry, something went wrong.
No results found
[v1.2.0] - 2025-12-07 - Checkpoint Resume & Progress Tracking Fixes
Added
Best model tracking without validation : Now saves atlas_best.pt based on training loss when no validation data provided
Tracks best training loss throughout training
Saves checkpoint whenever training loss improves
Logs improvement percentage
Ensures best model is always available even without validation set
New test for best model checkpoint : Added comprehensive test for training-loss-based best model tracking (total: 325 passing tests)
Fixed
Checkpoint resume double-prompt : Fixed pipeline script asking twice about resuming from checkpoint
Pipeline now passes --no-resume flag when user chooses not to resume
Eliminates duplicate prompts between pipeline and train.py
Epoch counter on resume : Fixed epoch display jumping to next epoch when resuming from checkpoint
Resuming from epoch 1, step 100 now correctly shows "EPOCH 1" instead of "EPOCH 2"
Initialize epoch = start_epoch - 1 to account for loop increment
Progress bar tracking : Fixed progress bar to show global steps instead of batches
Progress bar now displays correct position when resuming (e.g., 100/80000)
Eliminates confusing batch counter that resets each epoch
Shows consistent progress toward max_steps goal throughout training
Manual progress bar updates prevent jumping between batch and step counts
PowerShell checkpoint path quoting : Fixed checkpoint path being double-quoted in run_pipeline.ps1
Changed from building --resume "path" string to passing path variable directly
Resolves "unrecognized arguments" error when resuming
Ctrl+C interrupt handling : Fixed checkpoint not saving when pressing Ctrl+C during training
Added check_interrupt callback to trainer's batch loop
Trainer now checks for interrupt on every batch iteration
Saves checkpoint immediately when interrupt detected
Prevents duplicate interrupt messages
Avoids saving multiple checkpoints (epoch + step + interrupt)
Memory-mapped file cleanup : Fixed "file in use" warning when exiting training
Properly closes mmap file before unlinking
Silently ignores cleanup errors (OS temp directory handles cleanup)
Changed
Progress bar display : Training progress now shows global steps throughout entire training session
Before: Showed batches (0-166478 per epoch), reset each epoch
After: Shows global steps (100-80000), continuous across epochs
Time estimate shows seconds per global step instead of seconds per batch
Makes progress tracking more intuitive for long training runs
Interrupt checkpoint priority : When interrupted, only saves one checkpoint instead of multiple
Interrupt checkpoint takes priority over epoch and interval checkpoints
Skips epoch checkpoint and validation when interrupted
Reduces disk I/O and saves time when exiting
Clean, fast exit on Ctrl+C
Auto-save checkpoint retention : Increased from 3 to 5 most recent step-based checkpoints
Keeps 5 recent auto-save checkpoints (every 100 global steps)
Automatically deletes older checkpoints to save disk space
Saves ~7.4GB disk space per 1000 steps (at 3.68GB per checkpoint)
You can’t perform that action at this time.