Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same #32

Closed
joelrorseth opened this issue Nov 27, 2021 · 2 comments
Labels
question Further information is requested

Comments

@joelrorseth
Copy link

Hi there. First off, thanks for this great implementation of GPipe for PyTorch, it is very much appreciated. I am looking forward to getting it running for a research project I am working on, however, I have encountered an issue. I'm not sure if it is a bug, but I have essentially used the code exactly as provided in the docs.

In my project, I'm training ResNet on ImageNet. I am using the nn.Sequential ResNet adaptation provided under benchmarks, this function specifically:

def resnet101(**kwargs: Any) -> nn.Sequential:

I have avoided doing any manual CUDA device assignments or manipulation, but I am running on a single node with 2x P40 GPUs installed. Here is a simplified version of my code:

...
model = resnet101() # From torchgpipe
...
partitions = torch.cuda.device_count()
sample = torch.rand(128, 3, 224, 224)
balance = balance_by_time(partitions, model, sample)
model = GPipe(model, balance, chunks=8)
...
for i, (images, target) in enumerate(train_loader):
  ...
  # Exception is thrown here!
  # images has shape: torch.Size([256, 3, 224, 224])
  output = model(images)
  ...

And here is the actual error thrown:

Traceback (most recent call last):
  File "models/ImageNet/train.py", line 513, in <module>
    main()
  File "models/ImageNet/train.py", line 168, in main
    main_worker(args.gpu, ngpus_per_node, args)
  File "models/ImageNet/train.py", line 300, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "models/ImageNet/train.py", line 347, in train
    output = model(images)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/gpipe.py", line 376, in forward
    pipeline.run()
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 115, in run
    self.compute(schedule, skip_trackers, in_queues, out_queues)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 249, in compute
    raise exc_info[0].with_traceback(exc_info[1], exc_info[2])
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/worker.py", line 82, in worker
    batch = task.compute()
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/worker.py", line 57, in compute
    return self._compute()
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 95, in checkpoint
    self.function, input_atomic, *input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/checkpoint.py", line 254, in forward
    output = function(input[0] if input_atomic else input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torchgpipe/pipeline.py", line 202, in function
    return partition(input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 141, in forward
    input = module(input)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 446, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/gpipe-project/venv/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor

Any idea why this error is being thrown? If not, would you be able to recommend any steps for debugging? I believe my code is valid, but please let me know if you see any issues. I'm not sure if this issue occurs for other models that already inherit from nn.Sequential, but perhaps the issue could be with the ResNet implementation? I have successfully trained a torchvision.models.ResNet with this script (using DistributedDataParallel instead of torchgpipe), so I am less suspicious of the code I have omitted.

Thanks so much for your help!

@chiheonk
Copy link
Contributor

chiheonk commented Nov 28, 2021

Hi @joelrorseth,

You need to move the input to the device where the first partition is. In case of your example, you may need

images = images.to(model.devices[0])
output = model(images)

See the document regarding the input and output devices for further details.

@chiheonk chiheonk added the question Further information is requested label Nov 28, 2021
@joelrorseth
Copy link
Author

Thanks for the quick response @chiheonk, that did the trick! It is clearly explained in the docs, I don't know how I overlooked this. I appreciate your help, cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants