RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same #32

joelrorseth · 2021-11-27T23:23:14Z

Hi there. First off, thanks for this great implementation of GPipe for PyTorch, it is very much appreciated. I am looking forward to getting it running for a research project I am working on, however, I have encountered an issue. I'm not sure if it is a bug, but I have essentially used the code exactly as provided in the docs.

In my project, I'm training ResNet on ImageNet. I am using the nn.Sequential ResNet adaptation provided under benchmarks, this function specifically:

torchgpipe/benchmarks/models/resnet/__init__.py

Line 90 in a1b4ee2

def resnet101(**kwargs: Any) -> nn.Sequential:

I have avoided doing any manual CUDA device assignments or manipulation, but I am running on a single node with 2x P40 GPUs installed. Here is a simplified version of my code:

...
model = resnet101() # From torchgpipe
...
partitions = torch.cuda.device_count()
sample = torch.rand(128, 3, 224, 224)
balance = balance_by_time(partitions, model, sample)
model = GPipe(model, balance, chunks=8)
...
for i, (images, target) in enumerate(train_loader):
  ...
  # Exception is thrown here!
  # images has shape: torch.Size([256, 3, 224, 224])
  output = model(images)
  ...

And here is the actual error thrown:

Traceback (most recent call last):
  File "models/ImageNet/train.py", line 513, in <module>
    main()
  File "models/ImageNet/train.py", line 168, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "models/ImageNet/train.py", line 300, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "models/ImageNet/train.py", line 347, in train
    output = model(images)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/gpipe.py", line 376, in forward
    pipeline.run()
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 115, in run
    self.compute(schedule, skip_trackers, in_queues, out_queues)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 249, in compute
    raise exc_info[0].with_traceback(exc_info[1], exc_info[2])
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/worker.py", line 82, in worker
    batch = task.compute()
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/worker.py", line 57, in compute
    return self._compute()
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 95, in checkpoint
    self.function, input_atomic, *input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 254, in forward
    output = function(input[0] if input_atomic else input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 202, in function
    return partition(input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 446, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

Any idea why this error is being thrown? If not, would you be able to recommend any steps for debugging? I believe my code is valid, but please let me know if you see any issues. I'm not sure if this issue occurs for other models that already inherit from nn.Sequential, but perhaps the issue could be with the ResNet implementation? I have successfully trained a torchvision.models.ResNet with this script (using DistributedDataParallel instead of torchgpipe), so I am less suspicious of the code I have omitted.

Thanks so much for your help!

The text was updated successfully, but these errors were encountered:

chiheonk · 2021-11-28T00:08:26Z

Hi @joelrorseth,

You need to move the input to the device where the first partition is. In case of your example, you may need

images = images.to(model.devices[0])
output = model(images)

See the document regarding the input and output devices for further details.

joelrorseth · 2021-11-28T00:50:59Z

Thanks for the quick response @chiheonk, that did the trick! It is clearly explained in the docs, I don't know how I overlooked this. I appreciate your help, cheers!

chiheonk added the question Further information is requested label Nov 28, 2021

joelrorseth closed this as completed Nov 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same #32

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same #32

joelrorseth commented Nov 27, 2021

chiheonk commented Nov 28, 2021 •

edited

Loading

joelrorseth commented Nov 28, 2021

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same #32

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same #32

Comments

joelrorseth commented Nov 27, 2021

chiheonk commented Nov 28, 2021 • edited Loading

joelrorseth commented Nov 28, 2021

chiheonk commented Nov 28, 2021 •

edited

Loading