Skip to content

v1.2.0

Choose a tag to compare

@juliuspleunes4 juliuspleunes4 released this 07 Dec 11:34
· 14 commits to main since this release

[v1.2.0] - 2025-12-07 - Checkpoint Resume & Progress Tracking Fixes

Added

  • Best model tracking without validation: Now saves atlas_best.pt based on training loss when no validation data provided
    • Tracks best training loss throughout training
    • Saves checkpoint whenever training loss improves
    • Logs improvement percentage
    • Ensures best model is always available even without validation set
  • New test for best model checkpoint: Added comprehensive test for training-loss-based best model tracking (total: 325 passing tests)

Fixed

  • Checkpoint resume double-prompt: Fixed pipeline script asking twice about resuming from checkpoint
    • Pipeline now passes --no-resume flag when user chooses not to resume
    • Eliminates duplicate prompts between pipeline and train.py
  • Epoch counter on resume: Fixed epoch display jumping to next epoch when resuming from checkpoint
    • Resuming from epoch 1, step 100 now correctly shows "EPOCH 1" instead of "EPOCH 2"
    • Initialize epoch = start_epoch - 1 to account for loop increment
  • Progress bar tracking: Fixed progress bar to show global steps instead of batches
    • Progress bar now displays correct position when resuming (e.g., 100/80000)
    • Eliminates confusing batch counter that resets each epoch
    • Shows consistent progress toward max_steps goal throughout training
    • Manual progress bar updates prevent jumping between batch and step counts
  • PowerShell checkpoint path quoting: Fixed checkpoint path being double-quoted in run_pipeline.ps1
    • Changed from building --resume "path" string to passing path variable directly
    • Resolves "unrecognized arguments" error when resuming
  • Ctrl+C interrupt handling: Fixed checkpoint not saving when pressing Ctrl+C during training
    • Added check_interrupt callback to trainer's batch loop
    • Trainer now checks for interrupt on every batch iteration
    • Saves checkpoint immediately when interrupt detected
    • Prevents duplicate interrupt messages
    • Avoids saving multiple checkpoints (epoch + step + interrupt)
  • Memory-mapped file cleanup: Fixed "file in use" warning when exiting training
    • Properly closes mmap file before unlinking
    • Silently ignores cleanup errors (OS temp directory handles cleanup)

Changed

  • Progress bar display: Training progress now shows global steps throughout entire training session
    • Before: Showed batches (0-166478 per epoch), reset each epoch
    • After: Shows global steps (100-80000), continuous across epochs
    • Time estimate shows seconds per global step instead of seconds per batch
    • Makes progress tracking more intuitive for long training runs
  • Interrupt checkpoint priority: When interrupted, only saves one checkpoint instead of multiple
    • Interrupt checkpoint takes priority over epoch and interval checkpoints
    • Skips epoch checkpoint and validation when interrupted
    • Reduces disk I/O and saves time when exiting
    • Clean, fast exit on Ctrl+C
  • Auto-save checkpoint retention: Increased from 3 to 5 most recent step-based checkpoints
    • Keeps 5 recent auto-save checkpoints (every 100 global steps)
    • Automatically deletes older checkpoints to save disk space
    • Saves ~7.4GB disk space per 1000 steps (at 3.68GB per checkpoint)