how to save and load model, optimizer and scheduler's state dictionary? #154

samarth-b · 2021-09-02T08:29:14Z

How do I save and load the model, optimizer and scheduler state dictionarys that has gone through accelerator.prepare()?

for model

I used the unwrap function as described in the documentation

accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(args.model_path, 
                            save_function=accelerator.save, 
                            state_dict=accelerator.get_state_dict(model))

however, I get the following error when loading the model
model = MT5ForConditionalGeneration.from_pretrained(args.model_path, config=config)

    model, optimizer, training_loader, dev_loader = accelerator.prepare(
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 269, in prepare
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 269, in <genexpr>
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 227, in _prepare_one
    return self.prepare_model(obj)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 285, in prepare_model
    model = model.to(self.device)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

For optimizer and scheduler

currently using torch.save(optimizer.state_dict(), /exp1/file.opt) for save gives the error RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable when loading using optimizer.load_state_dict(torch.load('exp1/file.opt'))

Does accelerator.unwrap( work the same way as for a model?

accelerator.wait_for_everyone()
unwrapped_optmizer = accelerator.unwrap_model(optmizer)
accelerator.save(unwrapped_optmizer.state_dict(), filename)

Using torch.save(scheduler.state_dict(), /exp1/sch)and loading withscheduler.load_state_dict(torch.load('path')` is working.

EDITS: I updated the original issue with more details and exact error messages.

The text was updated successfully, but these errors were encountered:

sgugger · 2021-09-02T11:52:26Z

You should use accelerator.save everywhere and not torch.save (though I must say I have never seen that particular error).
For reloading, you should be able to reload a state dict in the unwrapped model or the optimizer. If you do

model = MT5ForConditionalGeneration.from_pretrained(args.model_path, config=config)

You create a brand new model, so you should pass it to the prepare method.

Note that adding checkpointing utility in Accelerate is on the roadmap, to make all of this easier.

samarth-b · 2021-09-02T17:32:16Z

thanks, I was able to load the model with torch.save('filename', map_location='cpu') so it looks like unwrap_model needs to remove any location GPU information.

seanbenhur · 2022-02-24T15:34:55Z

Note that adding checkpointing utility in Accelerate is on the roadmap, to make all of this easier.

Is this feature, currently available?

sgugger · 2022-02-24T15:58:30Z

It's under development on #255, we're hoping to have it merge next week.

muellerzr · 2022-03-02T17:37:23Z

Closed with #255! 🎉

samarth-b changed the title ~~how to save optmizer state dictionary?~~ how to save optimizer state dictionary? Sep 2, 2021

samarth-b changed the title ~~how to save optimizer state dictionary?~~ how to save and load model, optimizer and scheduler's state dictionary? Sep 2, 2021

muellerzr closed this as completed Mar 2, 2022

ashutoshsaboo mentioned this issue Mar 18, 2022

Issues with saving model/optimizer and loading them back #285

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to save and load model, optimizer and scheduler's state dictionary? #154

how to save and load model, optimizer and scheduler's state dictionary? #154

samarth-b commented Sep 2, 2021 •

edited

Loading

sgugger commented Sep 2, 2021

samarth-b commented Sep 2, 2021

seanbenhur commented Feb 24, 2022

sgugger commented Feb 24, 2022

muellerzr commented Mar 2, 2022

how to save and load model, optimizer and scheduler's state dictionary? #154

how to save and load model, optimizer and scheduler's state dictionary? #154

Comments

samarth-b commented Sep 2, 2021 • edited Loading

for model

For optimizer and scheduler

sgugger commented Sep 2, 2021

samarth-b commented Sep 2, 2021

seanbenhur commented Feb 24, 2022

sgugger commented Feb 24, 2022

muellerzr commented Mar 2, 2022

samarth-b commented Sep 2, 2021 •

edited

Loading