Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to restore the trainning #160

Closed
robeson1010 opened this issue Jul 5, 2018 · 3 comments
Closed

How to restore the trainning #160

robeson1010 opened this issue Jul 5, 2018 · 3 comments

Comments

@robeson1010
Copy link

I have trained the data for 3 days but unfortunately the processing interrupted due to some reasons. I have used the 'python main.py -- train --pipeline_name unet_weighted' but it trained from epochs 0. How can I restore the training processing from my last time (54 epochs already)?

@jakubczakon
Copy link
Collaborator

@robeson1010 Hi sorry for late resposne.
To restart training you need to overwrite the set_model function for example:

    def set_model(self):
        encoder = self.architecture_config['model_params']['encoder']
        if encoder == 'from_scratch':
            self.model = UNet(**self.architecture_config['model_params'])
        else:
            config = PRETRAINED_NETWORKS[encoder]
            self.model = config['model'](**config['model_config'])
            self._initialize_model_weights = lambda: None
            self.load('YOUR_FILEPATH_TO_MODEL')

If you want to load the model that you pretrained that has one of those Resnet archs.
It is important to have self._initialize_weights set to None or else it would simply overwrite your loaded weights with random stuff.

When you restart it will start from epoch 0 (though your weights from epoch 54 will be used). I would suggest using a smaller lr if you were using some sort of decay. As of now we are not checkpointing the optimizer state so it will be difficult to restore the exact state of your training at epoch 54 but usually restarting with new optimizer gets the job done.

I hope this helps.

@robeson1010
Copy link
Author

@jakubczakon Really thanks

@carbonox-infernox
Copy link

@jakubczakon

"As of now we are not checkpointing the optimizer state so it will be difficult to restore the exact state of your training"

Is this still the case? I was hoping to run the training 5-10 epochs at a time and keep checking on the model's progress. Then I'd like to add some new classes, but that's a different problem. Basically I don't want to pay for the full 100 and then find out that something went wrong, or otherwise pay for 100 when 50 might suffice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants