You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to train an inpainting model using multiple GPUs.
Initial training worked fine, progressed well and saved checkpoints to experiments/.../checkpoint folder.
However, when I try to resume the same training (by modifying "resume_state" in the config) I get this error:
Traceback (most recent call last):
File "/.../Palette-Image-to-Image-Diffusion-Models/run.py", line 58, in main_worker
model.train()
File "/.../Palette-Image-to-Image-Diffusion-Models/core/base_model.py", line 45, in train
train_log = self.train_step()
File "/.../Palette-Image-to-Image-Diffusion-Models/models/model.py", line 111, in train_step
self.optG.step()
File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/optimizer.py", line 109, in wrapper
return func(*args, **kwargs)
File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 157, in step
adam(params_with_grad,
File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 213, in adam
func(params,
File "/.../.conda/envs/.../lib/python3.10/site-packages/torch/optim/adam.py", line 255, in _single_tensor_adam
assert not step_t.is_cuda, "If capturable=False, state_steps should not be CUDA tensors."
AssertionError: If capturable=False, state_steps should not be CUDA tensors.
It seems like in multi-GPU when resuming some tensors (parameters or optimizer's internal variables) are not moved to the right device.
The text was updated successfully, but these errors were encountered:
It looks like this is an issue tied to PyTorch 1.12.0. I ran into the same issue, and I'm now downgrading to PyTorch 1.11.0, which should solve the problem.
I'm trying to train an inpainting model using multiple GPUs.
Initial training worked fine, progressed well and saved checkpoints to
experiments/.../checkpoint
folder.However, when I try to resume the same training (by modifying
"resume_state"
in the config) I get this error:It seems like in multi-GPU when resuming some tensors (parameters or optimizer's internal variables) are not moved to the right device.
The text was updated successfully, but these errors were encountered: