Skip to content
This repository has been archived by the owner on Apr 7, 2018. It is now read-only.

The agent hit the global step limit, how do I restore from checkpoints and restore trainning #62

Closed
syhdog opened this issue Jan 31, 2017 · 2 comments

Comments

@syhdog
Copy link

syhdog commented Jan 31, 2017

I am working on a ideal that requires long time training, about 10 days. I forgot to modify the global step limit. So the agent stopped at 100M step. I want to restore the model and go on my training. I have been look through the code and wandered what should I do.

Sincerely thank you to open this project, it has been a very very respectable work and help a lot with my research. Truly, we find it really enjoyable to develop agents. Thank you a lot.

@KaixiangLin
Copy link

In worker.py, add those two lines to specify the variables you want to restore:

    variables_to_restore = [v for v in tf.all_variables() if v.name.startswith("global")]
    pre_train_saver = FastSaver(variables_to_restore)

Add one more line in init_fn

    def init_fn(ses):
        logger.info("Initializing all parameters.")
        ses.run(init_all_op)
        pre_train_saver.restore(ses,
                                "THE_PATH_TO_YOUR_MODEL/model.ckpt-4986751")

@syhdog
Copy link
Author

syhdog commented Feb 6, 2017

Than you very much, it's be a great help.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants