checkpoints not saved due to wrong loss comparison? #9168

riqiang-dp · 2024-05-11T00:13:23Z

Describe the bug

I'm using val_loss as criteria to compare checkpoints and save top k. However, unlike WER, there seems to be some weird miscalculation happening: during training, I would see 'val_loss' was not in top {k}, but then I check the checkpoints directory, the latest model's val_loss is definitely within top {k}. An example is given in the image (the fact that the files are sorted by name and this last checkpoint appears within two other checkpoints indicates the checkpoint's loss is at least better than the kth checkpoint):

I saw this a year ago and didn't think much of it, checked pytorch lightning code and didn't find anything weird there. For WER it seems to work fine. After all this time I still stumbled upon this bug, I think it's kinda weird and don't know where else to start debugging.

Steps/Code to reproduce bug

exp_manager:
  exp_dir: null
  name: ${name}
  version: trial_1
  create_tensorboard_logger: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    monitor: "val_loss"
    mode: "min"
    save_top_k: 15
    always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints

  resume_if_exists: true
  resume_ignore_no_checkpoint: true

Expected behavior

Should save the model in the picture in the top k.

Environment overview (please complete the following information)

Environment location: GCP
Method of NeMo install: from source

Environment details

PyTorch version: 2.1
Python version: 3.11

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2024-06-10T01:49:31Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-06-17T01:50:13Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

riqiang-dp added the bug Something isn't working label May 11, 2024

github-actions bot added the stale label Jun 10, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoints not saved due to wrong loss comparison? #9168

checkpoints not saved due to wrong loss comparison? #9168

riqiang-dp commented May 11, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 17, 2024

checkpoints not saved due to wrong loss comparison? #9168

checkpoints not saved due to wrong loss comparison? #9168

Comments

riqiang-dp commented May 11, 2024

github-actions bot commented Jun 10, 2024

github-actions bot commented Jun 17, 2024