Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot resume training on multiple GPUs #31

Closed
shaibagon opened this issue Jul 26, 2022 · 2 comments
Closed

Cannot resume training on multiple GPUs #31

shaibagon opened this issue Jul 26, 2022 · 2 comments

Comments

@shaibagon
Copy link

I'm trying to train an inpainting model using multiple GPUs.
Initial training worked fine, progressed well and saved checkpoints to experiments/.../checkpoint folder.

However, when I try to resume the same training (by modifying "resume_state" in the config) I get this error:

Traceback (most recent call last):
  File "/.../Palette-Image-to-Image-Diffusion-Models/run.py", line 58, in main_worker
    model.train()
  File "/.../Palette-Image-to-Image-Diffusion-Models/core/base_model.py", line 45, in train
    train_log = self.train_step()
  File "/.../Palette-Image-to-Image-Diffusion-Models/models/model.py", line 111, in train_step
    self.optG.step()
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/optimizer.py", line 109, in wrapper
    return func(*args, **kwargs)
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 157, in step
    adam(params_with_grad,
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 213, in adam
    func(params,
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 255, in _single_tensor_adam
    assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
AssertionError: If capturable=False, state_steps should not be CUDA tensors.

It seems like in multi-GPU when resuming some tensors (parameters or optimizer's internal variables) are not moved to the right device.

@rybchuk
Copy link

rybchuk commented Aug 5, 2022

It looks like this is an issue tied to PyTorch 1.12.0. I ran into the same issue, and I'm now downgrading to PyTorch 1.11.0, which should solve the problem.

@Janspiry
Copy link
Owner

Feel free to reopen the issue if there is any question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants