Unet3+ with resnet34 (and higher) crash #46

initze · 2022-09-15T07:24:09Z

Unet3+ with a backbone > resnet34, e.g. resnet50, resnet101, ... caused the following error
resnet 18 and resnet34 are working OK

version 0.8.0
´´´
File "train.py", line 153, in run
self.train_epoch(data_loader)
File "train.py", line 203, in train_epoch
y_hat = self.model(img)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/isipd/projects/p_aicore_pf/initze/code/training/lib/models/unet3p/unet3p.py", line 208, in forward
h2_PT_hd4 = self.h2_PT_hd4_relu(self.h2_PT_hd4_bn(self.h2_PT_hd4_conv(self.h2_PT_hd4(h2))))
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
return self._conv_forward(input, self.weight)
File "/home/pd/initze/anaconda3/envs/aicore/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [64, 64, 3, 3], expected input[2, 256, 64, 64] to have 64 channels, but got 256 channels instead
´´´

khdlr · 2022-09-15T08:01:10Z

Right, ResNet50 and up have a different internal representation of their feature maps... Sorry for missing that in the first implementation. Should be an easy fix, will look into it 👍

initze · 2022-09-30T16:19:09Z

I have investigated potential error sources:

CUDA_VISIBLE_DEVICES causes Unet3p with resnet50 to crash on branch fix/unet3p

starting the script without CUDA_VISIBLE_DEVICES='0,1,2,3' python <script>
- CUDA error: invalid configuration argument
resnet34 is working in that configuration (with CUDA_VISIBLE_DEVICES)
running in my "standard" conda environment (python=3.7.12, pytorch 1.7 with cuda 11)

Same issue with updated python/pytorch setup (3.10, 1.12.1)

CUDA_VISIBLE_DEVICES leads to crash of Unet3p with resnet50
DIfferent error message

(aicore_yml_v3) initze@pd-dgx-a100:/isipd/projects/p_aicore_pf/initze/code/training$ ./UNet3p_debug.sh
[2022-09-30T18:30:28.576+02:00] uc2.train INFO: Training on cuda device
running with 'ExponentialLR' learning rate scheduler with gamma = 0.9
[2022-09-30T18:30:30.056+02:00] uc2.train INFO: Starting phase "Training"
[2022-09-30T18:30:30.058+02:00] uc2.train INFO: Epoch 1 - Training Started
  0%|                                                                                            | 0/35 [00:20<?, ?it/s]
Traceback (most recent call last):
  File "/isipd/projects/p_aicore_pf/initze/code/training/train.py", line 348, in <module>
    Engine().run()
  File "/isipd/projects/p_aicore_pf/initze/code/training/train.py", line 153, in run
    self.train_epoch(data_loader)
  File "/isipd/projects/p_aicore_pf/initze/code/training/train.py", line 217, in train_epoch
    loss.backward()
  File "/isipd/projects-noreplica/p_initze/anaconda3/envs/aicore_yml_v3/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/isipd/projects-noreplica/p_initze/anaconda3/envs/aicore_yml_v3/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

khdlr · 2022-10-04T09:22:13Z

Interesting, could be related to the multi-GPU then. Could you try what happens when you remove this line: https://github.com/initze/thaw-slump-segmentation/blob/master/train.py#L75?

initze added the bug Something isn't working label Sep 15, 2022

initze assigned khdlr Sep 15, 2022

khdlr mentioned this issue Sep 16, 2022

De-hardcode feature sequence for UNet3+ #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unet3+ with resnet34 (and higher) crash #46

Unet3+ with resnet34 (and higher) crash #46

initze commented Sep 15, 2022

khdlr commented Sep 15, 2022

initze commented Sep 30, 2022 •

edited

Loading

khdlr commented Oct 4, 2022

Unet3+ with resnet34 (and higher) crash #46

Unet3+ with resnet34 (and higher) crash #46

Comments

initze commented Sep 15, 2022

khdlr commented Sep 15, 2022

initze commented Sep 30, 2022 • edited Loading

khdlr commented Oct 4, 2022

initze commented Sep 30, 2022 •

edited

Loading