Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unet3+ with resnet34 (and higher) crash #46

Open
initze opened this issue Sep 15, 2022 · 3 comments
Open

Unet3+ with resnet34 (and higher) crash #46

initze opened this issue Sep 15, 2022 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@initze
Copy link
Owner

initze commented Sep 15, 2022

Unet3+ with a backbone > resnet34, e.g. resnet50, resnet101, ... caused the following error
resnet 18 and resnet34 are working OK

version 0.8.0
´´´
File "train.py", line 153, in run
self.train_epoch(data_loader)
File "train.py", line 203, in train_epoch
y_hat = self.model(img)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/isipd/projects/p_aicore_pf/initze/code/training/lib/models/unet3p/unet3p.py", line 208, in forward
h2_PT_hd4 = self.h2_PT_hd4_relu(self.h2_PT_hd4_bn(self.h2_PT_hd4_conv(self.h2_PT_hd4(h2))))
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
return self._conv_forward(input, self.weight)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [64, 64, 3, 3], expected input[2, 256, 64, 64] to have 64 channels, but got 256 channels instead
´´´

@initze initze added the bug Something isn't working label Sep 15, 2022
@khdlr
Copy link
Collaborator

khdlr commented Sep 15, 2022

Right, ResNet50 and up have a different internal representation of their feature maps... Sorry for missing that in the first implementation. Should be an easy fix, will look into it 👍

@initze
Copy link
Owner Author

initze commented Sep 30, 2022

I have investigated potential error sources:

  1. CUDA_VISIBLE_DEVICES causes Unet3p with resnet50 to crash on branch fix/unet3p
  • starting the script without CUDA_VISIBLE_DEVICES='0,1,2,3' python <script>
    • CUDA error: invalid configuration argument

  • resnet34 is working in that configuration (with CUDA_VISIBLE_DEVICES)
  • running in my "standard" conda environment (python=3.7.12, pytorch 1.7 with cuda 11)
  1. Same issue with updated python/pytorch setup (3.10, 1.12.1)
  • CUDA_VISIBLE_DEVICES leads to crash of Unet3p with resnet50
  • DIfferent error message
(aicore_yml_v3) initze@pd-dgx-a100:/isipd/projects/p_aicore_pf/initze/code/training$ ./UNet3p_debug.sh
[2022-09-30T18:30:28.576+02:00] uc2.train INFO: Training on cuda device
running with 'ExponentialLR' learning rate scheduler with gamma = 0.9
[2022-09-30T18:30:30.056+02:00] uc2.train INFO: Starting phase "Training"
[2022-09-30T18:30:30.058+02:00] uc2.train INFO: Epoch 1 - Training Started
  0%|                                                                                            | 0/35 [00:20<?, ?it/s]
Traceback (most recent call last):
  File "/isipd/projects/p_aicore_pf/initze/code/training/train.py", line 348, in <module>
    Engine().run()
  File "/isipd/projects/p_aicore_pf/initze/code/training/train.py", line 153, in run
    self.train_epoch(data_loader)
  File "/isipd/projects/p_aicore_pf/initze/code/training/train.py", line 217, in train_epoch
    loss.backward()
  File "/isipd/projects-noreplica/p_initze/anaconda3/envs/aicore_yml_v3/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/isipd/projects-noreplica/p_initze/anaconda3/envs/aicore_yml_v3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@khdlr
Copy link
Collaborator

khdlr commented Oct 4, 2022

Interesting, could be related to the multi-GPU then. Could you try what happens when you remove this line: https://github.com/initze/thaw-slump-segmentation/blob/master/train.py#L75?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants