How to restore the trainning #160

robeson1010 · 2018-07-05T07:26:15Z

I have trained the data for 3 days but unfortunately the processing interrupted due to some reasons. I have used the 'python main.py -- train --pipeline_name unet_weighted' but it trained from epochs 0. How can I restore the training processing from my last time (54 epochs already)?

jakubczakon · 2018-07-09T06:26:17Z

@robeson1010 Hi sorry for late resposne.
To restart training you need to overwrite the set_model function for example:

    def set_model(self):
        encoder = self.architecture_config['model_params']['encoder']
        if encoder == 'from_scratch':
            self.model = UNet(**self.architecture_config['model_params'])
        else:
            config = PRETRAINED_NETWORKS[encoder]
            self.model = config['model'](**config['model_config'])
            self._initialize_model_weights = lambda: None
            self.load('YOUR_FILEPATH_TO_MODEL')

If you want to load the model that you pretrained that has one of those Resnet archs.
It is important to have self._initialize_weights set to None or else it would simply overwrite your loaded weights with random stuff.

When you restart it will start from epoch 0 (though your weights from epoch 54 will be used). I would suggest using a smaller lr if you were using some sort of decay. As of now we are not checkpointing the optimizer state so it will be difficult to restore the exact state of your training at epoch 54 but usually restarting with new optimizer gets the job done.

I hope this helps.

robeson1010 · 2018-07-09T16:00:04Z

@jakubczakon Really thanks

carbonox-infernox · 2018-09-28T18:29:34Z

@jakubczakon

"As of now we are not checkpointing the optimizer state so it will be difficult to restore the exact state of your training"

Is this still the case? I was hoping to run the training 5-10 epochs at a time and keep checking on the model's progress. Then I'd like to add some new classes, but that's a different problem. Basically I don't want to pay for the full 100 and then find out that something went wrong, or otherwise pay for 100 when 50 might suffice.

robeson1010 closed this as completed Jul 9, 2018

newcoder0531 mentioned this issue Jul 25, 2018

how to visulization the result,if there is a tool can do it ? #168

Open

cvKDean mentioned this issue Sep 25, 2019

Training stops before epoch 0 after loading best.torch #218

Open

Christophe-pere mentioned this issue Feb 19, 2020

transformer aren't generate #224

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to restore the trainning #160

How to restore the trainning #160

robeson1010 commented Jul 5, 2018

jakubczakon commented Jul 9, 2018

robeson1010 commented Jul 9, 2018

carbonox-infernox commented Sep 28, 2018

How to restore the trainning #160

How to restore the trainning #160

Comments

robeson1010 commented Jul 5, 2018

jakubczakon commented Jul 9, 2018

robeson1010 commented Jul 9, 2018

carbonox-infernox commented Sep 28, 2018