Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interrupting and Resuming Training #2

Open
burnedsap opened this issue Aug 16, 2018 · 6 comments
Open

Interrupting and Resuming Training #2

burnedsap opened this issue Aug 16, 2018 · 6 comments

Comments

@burnedsap
Copy link

Is it possible to interrupt training, and then resume it later?

Best case scenario, I train it on one machine, interrupt, and then continue training on another. No GPU involved, CPU only.

Or is it possible to interrupt training, and save to a model immediately, before it has trained completely? When I interrupt now, no Models folder is created, to save the models.

Thanks

@cvalenzuela
Copy link
Member

Yes, it is possible. Here are the arguments you need to pass to the script:

    parser.add_argument('--init_from', type=str, default=None,
                        help="""continue training from saved model at this path. Path must contain files saved by previous training process:
                            'config.pkl'        : configuration;
                            'chars_vocab.pkl'   : vocabulary definitions;
                            'checkpoint'        : paths to model file(s) (created by tf).
                                                  Note: this file contains absolute paths, be careful when moving files around;
                            'model.ckpt-*'      : file(s) with model definition (created by tf)
                        """)

See https://github.com/ml5js/training-lstm/blob/master/train.py#L67

@burnedsap
Copy link
Author

Oh cool, thanks! The file model.ckpt isn't created, so I'm a little worried about whether it actually picks up from where it left off. It does say that it's loading the preprocessed files, but then the time taken to train the model is no different from doing it all over again.

I assume this is because this process is usually meant for re-training models rather than shifting from computer to computer like what I'm wanting to do?

@cvalenzuela
Copy link
Member

No, you are right.
The actual code just saves the checkpoint when the model finishes the training: https://github.com/ml5js/training-lstm/blob/master/train.py#L168

We should add support to save the model by default every N amount of iterations.
Here's the original code doing that: https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/train.py#L133

We removed it at some point and now it makes sense to put it back

@burnedsap
Copy link
Author

Good spot!

There's no easy way to put it back in right? My knowledge of python is limited, but we would have to rewrite that entire last bit to make it work, right?

@zachwhalen
Copy link

Has this ability been implemented (or re-implemented)? It would be very useful, in the spirit of making machine learning accessible, to have the ability to start, stop, and resume training, given the enormous amounts of time training can take. For example, I was trying this just now on an admittedly pretty large input corpus (around 300MB), and if I understand how it's working, it's going to take several weeks of computer time to finish on the fasted computer I could find today. Obviously, finding a good GPU or maybe renting some cloud computing time would speed that up, but ¯_(ツ)_/¯

@burnedsap
Copy link
Author

Hello so I tried to figure something out with my rudimentary python, but I can't seem to get anywhere. Some error with the args somewhere.

Any luck with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants