Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint saving not working as expected #1645

Closed
roybenhayun opened this issue Oct 11, 2023 · 0 comments · Fixed by #1647
Closed

Checkpoint saving not working as expected #1645

roybenhayun opened this issue Oct 11, 2023 · 0 comments · Fixed by #1647
Labels
trainers PyTorch Lightning trainers
Milestone

Comments

@roybenhayun
Copy link

roybenhayun commented Oct 11, 2023

Description

After migrating to release 0.5.0 noticed that checkpoint saving is not working as expected.

description

tried different configuration e.g., checkpoint_callback = ModelCheckpoint(monitor="val_loss", dirpath=ckpt_dir, save_last=True, every_n_epochs=1, save_top_k=1) for example when running 20-30 epochs for training a model.
after training was completed could not find the ckpt file. what was found was a single ckpt file of the first epoch only, in a wrong directory.

severance

the bug is very limiting. for example, after hours of training a model, there is no way to load the model from a checkpoint to run inference. the single shot to run inference was during the same run.

expected behavior

using a given configuration expected to see:

  • checkpoint files saved every number of epoch
  • the last epoch checkpoint file
  • the checkpoints should have been saved to the given directory

observed behavior

  • after training several epochs only the first was saved.
  • the single checkpoint were saved to another directory under the logger output

initial investigation

  1. checkpoint callback created and training fit called

  2. later, see image and call stack:
    seems like c'tor called again with save_last=None
    image

  3. when saving later supposed to happen, the save_last is None:
    image

  4. last checkpoint saving is skipped

Steps to reproduce

  1. create a checkpoint callback and use different checkpoints saving parameters e.g., checkpoint_callback = ModelCheckpoint(monitor="val_loss", dirpath=ckpt_dir, save_last=True, every_n_epochs=1, save_top_k=1)
  2. call trainer fit and run several epochs
  3. check expected results:
  • saving location as expected e.g., under C:\foo
  • check last epoch checkpoint saved - must have last.ckpt
  • check how many checkpoints were saved e.g., every 2 etc

Version

torchgeo version 0.5.0, lightning version 2.0.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
trainers PyTorch Lightning trainers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants