Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save only the best checkpoint #1475

Closed
bernardohenz opened this issue Jul 24, 2018 · 8 comments
Closed

Save only the best checkpoint #1475

bernardohenz opened this issue Jul 24, 2018 · 8 comments

Comments

@bernardohenz
Copy link
Contributor

Instead of using EarlyStop to avoid overfitting, I would like to save the model at the time it had the best (lowest) validation loss. In other words, I would like to keep track only for the best checkpoint (based on val/dev set).

I tried to use a wrapper for tf.train.Saver (this one: https://github.com/vonclites/checkmate), but couldn't make it works with DeepSpeech. Is there a easy way to do tha (maybe using the MonitoredTrainingSession as you are using)?

@reuben
Copy link
Contributor

reuben commented Jul 24, 2018

I looked into this briefly but couldn't find a clean way to implement it with MonitoredTrainingSession (which is IMO a terrible API). I ended up just writing a hack that works, but isn't really code we can land. I'm attaching the patch.

save_best_val.patch.txt

@bernardohenz
Copy link
Contributor Author

Thanks @reuben , I had to change the name of the MonitoredTrainingSession to train_session for this to work. Now it is working perfectly fine.

@lissyx
Copy link
Collaborator

lissyx commented Oct 2, 2018

@bernardohenz What's the status here, is the issue fixed, do you have a workaround ? Should we close this ?

@bernardohenz
Copy link
Contributor Author

@lissyx yes, the patch from @reuben worked just fine.

@reuben
Copy link
Contributor

reuben commented Oct 2, 2018

We should have a proper solution for this in-tree. This would be too much work with the current training setup, but would probably be very simple if we used TF Eager, for example. Reopening so we don't forget.

@reuben reuben reopened this Oct 2, 2018
@reuben
Copy link
Contributor

reuben commented Nov 17, 2018

Updated version of the patch is here: https://gist.github.com/reuben/dcc2deaf85568591e34ce363bc3bac2a

@tilmankamp
Copy link
Contributor

#1988 added this feature. It checkpoints the best validating epoch and supports different ways to resume - see load command-line parameter.

@lock
Copy link

lock bot commented May 2, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators May 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants