Checkpoint saving not working as expected #1645

roybenhayun · 2023-10-11T05:37:21Z

Description

After migrating to release 0.5.0 noticed that checkpoint saving is not working as expected.

description

tried different configuration e.g., checkpoint_callback = ModelCheckpoint(monitor="val_loss", dirpath=ckpt_dir, save_last=True, every_n_epochs=1, save_top_k=1) for example when running 20-30 epochs for training a model.
after training was completed could not find the ckpt file. what was found was a single ckpt file of the first epoch only, in a wrong directory.

severance

the bug is very limiting. for example, after hours of training a model, there is no way to load the model from a checkpoint to run inference. the single shot to run inference was during the same run.

expected behavior

using a given configuration expected to see:

checkpoint files saved every number of epoch
the last epoch checkpoint file
the checkpoints should have been saved to the given directory

observed behavior

after training several epochs only the first was saved.
the single checkpoint were saved to another directory under the logger output

initial investigation

checkpoint callback created and training fit called
later, see image and call stack:
seems like c'tor called again with save_last=None
when saving later supposed to happen, the save_last is None:
last checkpoint saving is skipped

Steps to reproduce

create a checkpoint callback and use different checkpoints saving parameters e.g., checkpoint_callback = ModelCheckpoint(monitor="val_loss", dirpath=ckpt_dir, save_last=True, every_n_epochs=1, save_top_k=1)
call trainer fit and run several epochs
check expected results:

saving location as expected e.g., under C:\foo
check last epoch checkpoint saved - must have last.ckpt
check how many checkpoints were saved e.g., every 2 etc

Version

torchgeo version 0.5.0, lightning version 2.0.9

The text was updated successfully, but these errors were encountered:

roybenhayun mentioned this issue Oct 11, 2023

trainer.test results not reproduced when loading from a checkpoint #1640

Closed

robmarkcole mentioned this issue Oct 11, 2023

Remove configure_callbacks #1647

Merged

adamjstewart added this to the 0.5.1 milestone Oct 11, 2023

adamjstewart added the trainers PyTorch Lightning trainers label Oct 11, 2023

adamjstewart closed this as completed in #1647 Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint saving not working as expected #1645

Checkpoint saving not working as expected #1645

roybenhayun commented Oct 11, 2023 •

edited

Checkpoint saving not working as expected #1645

Checkpoint saving not working as expected #1645

Comments

roybenhayun commented Oct 11, 2023 • edited

Description

description

severance

expected behavior

observed behavior

initial investigation

Steps to reproduce

Version

roybenhayun commented Oct 11, 2023 •

edited