You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there. First off, thanks for this great implementation of GPipe for PyTorch, it is very much appreciated. I am looking forward to getting it running for a research project I am working on, however, I have encountered an issue. I'm not sure if it is a bug, but I have essentially used the code exactly as provided in the docs.
In my project, I'm training ResNet on ImageNet. I am using the nn.Sequential ResNet adaptation provided under benchmarks, this function specifically:
I have avoided doing any manual CUDA device assignments or manipulation, but I am running on a single node with 2x P40 GPUs installed. Here is a simplified version of my code:
...
model = resnet101() # From torchgpipe
...
partitions = torch.cuda.device_count()
sample = torch.rand(128, 3, 224, 224)
balance = balance_by_time(partitions, model, sample)
model = GPipe(model, balance, chunks=8)
...
for i, (images, target) in enumerate(train_loader):
...
# Exception is thrown here!
# images has shape: torch.Size([256, 3, 224, 224])
output = model(images)
...
And here is the actual error thrown:
Traceback (most recent call last):
File "models/ImageNet/train.py", line 513, in <module>
main()
File "models/ImageNet/train.py", line 168, in main
main_worker(args.gpu, ngpus_per_node, args)
File "models/ImageNet/train.py", line 300, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File "models/ImageNet/train.py", line 347, in train
output = model(images)
File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/gpipe.py", line 376, in forward
pipeline.run()
File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 115, in run
self.compute(schedule, skip_trackers, in_queues, out_queues)
File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 249, in compute
raise exc_info[0].with_traceback(exc_info[1], exc_info[2])
File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/worker.py", line 82, in worker
batch = task.compute()
File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/worker.py", line 57, in compute
return self._compute()
File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 95, in checkpoint
self.function, input_atomic, *input)
File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 254, in forward
output = function(input[0] if input_atomic else input)
File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 202, in function
return partition(input)
File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 446, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
Any idea why this error is being thrown? If not, would you be able to recommend any steps for debugging? I believe my code is valid, but please let me know if you see any issues. I'm not sure if this issue occurs for other models that already inherit from nn.Sequential, but perhaps the issue could be with the ResNet implementation? I have successfully trained a torchvision.models.ResNet with this script (using DistributedDataParallel instead of torchgpipe), so I am less suspicious of the code I have omitted.
Thanks so much for your help!
The text was updated successfully, but these errors were encountered:
Thanks for the quick response @chiheonk, that did the trick! It is clearly explained in the docs, I don't know how I overlooked this. I appreciate your help, cheers!
Hi there. First off, thanks for this great implementation of GPipe for PyTorch, it is very much appreciated. I am looking forward to getting it running for a research project I am working on, however, I have encountered an issue. I'm not sure if it is a bug, but I have essentially used the code exactly as provided in the docs.
In my project, I'm training ResNet on ImageNet. I am using the nn.Sequential ResNet adaptation provided under benchmarks, this function specifically:
torchgpipe/benchmarks/models/resnet/__init__.py
Line 90 in a1b4ee2
I have avoided doing any manual CUDA device assignments or manipulation, but I am running on a single node with 2x P40 GPUs installed. Here is a simplified version of my code:
And here is the actual error thrown:
Any idea why this error is being thrown? If not, would you be able to recommend any steps for debugging? I believe my code is valid, but please let me know if you see any issues. I'm not sure if this issue occurs for other models that already inherit from nn.Sequential, but perhaps the issue could be with the ResNet implementation? I have successfully trained a torchvision.models.ResNet with this script (using DistributedDataParallel instead of torchgpipe), so I am less suspicious of the code I have omitted.
Thanks so much for your help!
The text was updated successfully, but these errors were encountered: