Cannot resume training on multiple GPUs #31

shaibagon · 2022-07-26T06:57:47Z

I'm trying to train an inpainting model using multiple GPUs.
Initial training worked fine, progressed well and saved checkpoints to experiments/.../checkpoint folder.

However, when I try to resume the same training (by modifying "resume_state" in the config) I get this error:

Traceback (most recent call last):
  File "/.../Palette-Image-to-Image-Diffusion-Models/run.py", line 58, in main_worker
    model.train()
  File "/.../Palette-Image-to-Image-Diffusion-Models/core/base_model.py", line 45, in train
    train_log = self.train_step()
  File "/.../Palette-Image-to-Image-Diffusion-Models/models/model.py", line 111, in train_step
    self.optG.step()
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/optimizer.py", line 109, in wrapper
    return func(*args, **kwargs)
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 157, in step
    adam(params_with_grad,
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 213, in adam
    func(params,
  File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 255, in _single_tensor_adam
    assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
AssertionError: If capturable=False, state_steps should not be CUDA tensors.

It seems like in multi-GPU when resuming some tensors (parameters or optimizer's internal variables) are not moved to the right device.

The text was updated successfully, but these errors were encountered:

rybchuk · 2022-08-05T01:26:05Z

It looks like this is an issue tied to PyTorch 1.12.0. I ran into the same issue, and I'm now downgrading to PyTorch 1.11.0, which should solve the problem.

Janspiry · 2022-08-25T06:14:50Z

Feel free to reopen the issue if there is any question.

Janspiry closed this as completed Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot resume training on multiple GPUs #31

Cannot resume training on multiple GPUs #31

shaibagon commented Jul 26, 2022

rybchuk commented Aug 5, 2022

Janspiry commented Aug 25, 2022

Cannot resume training on multiple GPUs #31

Cannot resume training on multiple GPUs #31

Comments

shaibagon commented Jul 26, 2022

rybchuk commented Aug 5, 2022

Janspiry commented Aug 25, 2022