Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to save and load model, optimizer and scheduler's state dictionary? #154

Closed
samarth-b opened this issue Sep 2, 2021 · 5 comments
Closed

Comments

@samarth-b
Copy link

samarth-b commented Sep 2, 2021

How do I save and load the model, optimizer and scheduler state dictionarys that has gone through accelerator.prepare()?

for model

I used the unwrap function as described in the documentation

accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(args.model_path, 
                            save_function=accelerator.save, 
                            state_dict=accelerator.get_state_dict(model))

however, I get the following error when loading the model
model = MT5ForConditionalGeneration.from_pretrained(args.model_path, config=config)

    model, optimizer, training_loader, dev_loader = accelerator.prepare(
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 269, in prepare
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 269, in <genexpr>
    result = tuple(self._prepare_one(obj) for obj in args)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 227, in _prepare_one
    return self.prepare_model(obj)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/accelerate/accelerator.py", line 285, in prepare_model
    model = model.to(self.device)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/dccstor/cssblr/samarth/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

For optimizer and scheduler

currently using torch.save(optimizer.state_dict(), /exp1/file.opt) for save gives the error RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable when loading using optimizer.load_state_dict(torch.load('exp1/file.opt'))

Does accelerator.unwrap( work the same way as for a model?

accelerator.wait_for_everyone()
unwrapped_optmizer = accelerator.unwrap_model(optmizer)
accelerator.save(unwrapped_optmizer.state_dict(), filename)

Using torch.save(scheduler.state_dict(), /exp1/sch)and loading withscheduler.load_state_dict(torch.load('path')` is working.

EDITS: I updated the original issue with more details and exact error messages.

@samarth-b samarth-b changed the title how to save optmizer state dictionary? how to save optimizer state dictionary? Sep 2, 2021
@samarth-b samarth-b changed the title how to save optimizer state dictionary? how to save and load model, optimizer and scheduler's state dictionary? Sep 2, 2021
@sgugger
Copy link
Collaborator

sgugger commented Sep 2, 2021

You should use accelerator.save everywhere and not torch.save (though I must say I have never seen that particular error).
For reloading, you should be able to reload a state dict in the unwrapped model or the optimizer. If you do

model = MT5ForConditionalGeneration.from_pretrained(args.model_path, config=config)

You create a brand new model, so you should pass it to the prepare method.

Note that adding checkpointing utility in Accelerate is on the roadmap, to make all of this easier.

@samarth-b
Copy link
Author

thanks, I was able to load the model with torch.save('filename', map_location='cpu') so it looks like unwrap_model needs to remove any location GPU information.

@seanbenhur
Copy link

Note that adding checkpointing utility in Accelerate is on the roadmap, to make all of this easier.

Is this feature, currently available?

@sgugger
Copy link
Collaborator

sgugger commented Feb 24, 2022

It's under development on #255, we're hoping to have it merge next week.

@muellerzr
Copy link
Collaborator

Closed with #255! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants