v1.2.4

Latest

Latest

juliuspleunes4 released this 07 Dec 18:37

· 3 commits to main since this release

2c78cf0

[v1.2.4] - 2025-12-07 - Persistent Best Checkpoint Tracking Across Sessions

Fixed

Persistent best checkpoint tracking: Fixed best checkpoint being overwritten by worse models across training sessions
- Root cause: CheckpointManager initialized best_metric = float('inf') on every startup, never checking for existing atlas_best.pt
- This caused a critical regression scenario: Session 1 saves loss 3.88 as best → Session 2 resumes from checkpoint with loss 5.0 → Session 2 thinks loss 4.5 is "best" → overwrites 3.88 with 4.5
- CheckpointManager now loads existing best checkpoint on initialization:
  - Checks for {model_name}_best.pt file existence in checkpoint directory
  - Loads existing best checkpoint with torch.load(..., weights_only=False)
  - Extracts metadata['loss'] to initialize self.best_metric instead of float('inf')
  - Logs "Found existing best checkpoint with loss: X.XXXX" when found
- train.py now initializes from persistent best metric:
  - Primary source: best_train_loss = checkpoint_manager.best_metric (loaded from existing best)
  - Fallback: If resumed checkpoint has better loss, use that instead
  - Logs initialization source for transparency
- save_checkpoint now verifies improvement before overwriting:
  - Added comparison: if is_best and self.keep_best and metadata.loss < self.best_metric:
  - Only saves atlas_best.pt if new loss is strictly better than existing best
  - Updates self.best_metric only after successful save
  - Logs "[BEST] Saved new best checkpoint (loss: X.XXXX)" when overwriting
- Best model now properly persists across all training sessions, preventing loss of superior checkpoints

[v1.2.3] - 2025-12-07 - Best Checkpoint Tracking for Step-Based Saves

Fixed

Best checkpoint creation at step intervals: Fixed best model checkpoint not being created during mid-epoch step saves
- Root cause: Best loss tracking (is_best flag) only calculated at epoch boundaries, not at step-based checkpoints
- Now tracks and updates best_train_loss in the step checkpoint callback using nonlocal
- Best loss check happens on every training step, not just at epoch end
- atlas_best.pt now created immediately when loss improves, even mid-epoch
- Logs "[BEST] NEW BEST TRAINING LOSS" with improvement percentage when new best is found
- Step checkpoint metadata now includes best_metric field with current best loss
- Ensures best model is always preserved regardless of checkpoint cleanup (keep_checkpoints setting)

Assets 2