Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkpoints not saved due to wrong loss comparison? #9168

Closed
riqiang-dp opened this issue May 11, 2024 · 2 comments
Closed

checkpoints not saved due to wrong loss comparison? #9168

riqiang-dp opened this issue May 11, 2024 · 2 comments
Labels
bug Something isn't working stale

Comments

@riqiang-dp
Copy link

Describe the bug

I'm using val_loss as criteria to compare checkpoints and save top k. However, unlike WER, there seems to be some weird miscalculation happening: during training, I would see 'val_loss' was not in top {k}, but then I check the checkpoints directory, the latest model's val_loss is definitely within top {k}. An example is given in the image (the fact that the files are sorted by name and this last checkpoint appears within two other checkpoints indicates the checkpoint's loss is at least better than the kth checkpoint):
image
I saw this a year ago and didn't think much of it, checked pytorch lightning code and didn't find anything weird there. For WER it seems to work fine. After all this time I still stumbled upon this bug, I think it's kinda weird and don't know where else to start debugging.

Steps/Code to reproduce bug

exp_manager:
  exp_dir: null
  name: ${name}
  version: trial_1
  create_tensorboard_logger: true
  create_checkpoint_callback: true
  checkpoint_callback_params:
    monitor: "val_loss"
    mode: "min"
    save_top_k: 15
    always_save_nemo: True # saves the checkpoints as nemo files instead of PTL checkpoints

  resume_if_exists: true
  resume_ignore_no_checkpoint: true

Expected behavior

Should save the model in the picture in the top k.

Environment overview (please complete the following information)

  • Environment location: GCP
  • Method of NeMo install: from source

Environment details

  • PyTorch version: 2.1
  • Python version: 3.11

Additional context

@riqiang-dp riqiang-dp added the bug Something isn't working label May 11, 2024
Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Jun 10, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

1 participant