You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After migrating to release 0.5.0 noticed that checkpoint saving is not working as expected.
description
tried different configuration e.g., checkpoint_callback = ModelCheckpoint(monitor="val_loss", dirpath=ckpt_dir, save_last=True, every_n_epochs=1, save_top_k=1) for example when running 20-30 epochs for training a model.
after training was completed could not find the ckpt file. what was found was a single ckpt file of the first epoch only, in a wrong directory.
severance
the bug is very limiting. for example, after hours of training a model, there is no way to load the model from a checkpoint to run inference. the single shot to run inference was during the same run.
expected behavior
using a given configuration expected to see:
checkpoint files saved every number of epoch
the last epoch checkpoint file
the checkpoints should have been saved to the given directory
observed behavior
after training several epochs only the first was saved.
the single checkpoint were saved to another directory under the logger output
initial investigation
checkpoint callback created and training fit called
later, see image and call stack:
seems like c'tor called again with save_last=None
when saving later supposed to happen, the save_last is None:
last checkpoint saving is skipped
Steps to reproduce
create a checkpoint callback and use different checkpoints saving parameters e.g., checkpoint_callback = ModelCheckpoint(monitor="val_loss", dirpath=ckpt_dir, save_last=True, every_n_epochs=1, save_top_k=1)
call trainer fit and run several epochs
check expected results:
saving location as expected e.g., under C:\foo
check last epoch checkpoint saved - must have last.ckpt
check how many checkpoints were saved e.g., every 2 etc
Version
torchgeo version 0.5.0, lightning version 2.0.9
The text was updated successfully, but these errors were encountered:
Description
After migrating to release 0.5.0 noticed that checkpoint saving is not working as expected.
description
tried different configuration e.g.,
checkpoint_callback = ModelCheckpoint(monitor="val_loss", dirpath=ckpt_dir, save_last=True, every_n_epochs=1, save_top_k=1)
for example when running 20-30 epochs for training a model.after training was completed could not find the ckpt file. what was found was a single ckpt file of the first epoch only, in a wrong directory.
severance
the bug is very limiting. for example, after hours of training a model, there is no way to load the model from a checkpoint to run inference. the single shot to run inference was during the same run.
expected behavior
using a given configuration expected to see:
observed behavior
initial investigation
checkpoint callback created and training fit called
later, see image and call stack:
seems like c'tor called again with save_last=None
when saving later supposed to happen, the save_last is None:
last checkpoint saving is skipped
Steps to reproduce
Version
torchgeo version 0.5.0, lightning version 2.0.9
The text was updated successfully, but these errors were encountered: