You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using val_loss as criteria to compare checkpoints and save top k. However, unlike WER, there seems to be some weird miscalculation happening: during training, I would see 'val_loss' was not in top {k}, but then I check the checkpoints directory, the latest model's val_loss is definitely within top {k}. An example is given in the image (the fact that the files are sorted by name and this last checkpoint appears within two other checkpoints indicates the checkpoint's loss is at least better than the kth checkpoint):
I saw this a year ago and didn't think much of it, checked pytorch lightning code and didn't find anything weird there. For WER it seems to work fine. After all this time I still stumbled upon this bug, I think it's kinda weird and don't know where else to start debugging.
Describe the bug
I'm using val_loss as criteria to compare checkpoints and save top k. However, unlike WER, there seems to be some weird miscalculation happening: during training, I would see
'val_loss' was not in top {k}
, but then I check the checkpoints directory, the latest model's val_loss is definitely within top {k}. An example is given in the image (the fact that the files are sorted by name and this last checkpoint appears within two other checkpoints indicates the checkpoint's loss is at least better than the kth checkpoint):I saw this a year ago and didn't think much of it, checked pytorch lightning code and didn't find anything weird there. For WER it seems to work fine. After all this time I still stumbled upon this bug, I think it's kinda weird and don't know where else to start debugging.
Steps/Code to reproduce bug
Expected behavior
Should save the model in the picture in the top k.
Environment overview (please complete the following information)
Environment details
Additional context
The text was updated successfully, but these errors were encountered: