[Question] The default train-run of CON caused Out-Of-Memory #1

soskek · 2021-08-11T00:41:32Z

(Not urgent question.)

I run the training script in the example of CON with the default args (= grid mode) using ShapeNet (downloaded by occupancy_networks repo's script) using 32GB GPU. However, it caused OOM. When setting -bs 24, it works (memory usage 30622MiB / 32510MiB).
Is this an intended behavior?

$ python -u examples/con/train.py -dd /mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/occupancy_networks/data/ShapeNet -sd saved_models_grid
Traceback (most recent call last):
  File "examples/con/train.py", line 218, in <module>
    main()
  File "examples/con/train.py", line 214, in main
    train(dataset, model, optimizer, args)
  File "examples/con/train.py", line 103, in train
    prediction = model(input_points, query_points)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/pipeline/con.py", line 99, in forward
    features = self.feature_encoder(input_points)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/models/con/local_pool_pointnet.py", line 275, in forward
    input_points, c, feature_grid=grid_id
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/models/con/local_pool_pointnet.py", line 191, in generate_coordinate_features
    fea_grid = self.feature_processing_fn(fea_grid)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/models/con/unet3d.py", line 289, in forward
    x = layer(encoders_features[idx + 1], x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/models/con/unet3d.py", line 172, in forward
    x = self.layer(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/models/con/unet3d.py", line 82, in forward
    x = self.relu(self.convolution1(self.group_norm1(x)))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/normalization.py", line 246, in forward
    input, self.num_groups, self.weight, self.bias, self.eps)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2112, in group_norm
    torch.backends.cudnn.enabled)
RuntimeError: CUDA out of memory. Tried to allocate 3.00 GiB (GPU 0; 31.75 GiB total capacity; 27.60 GiB already allocated; 2.92 GiB free; 27.72 GiB reserved in total by PyTorch)

The env (at mnj) is here: (I run https://github.com/pytorch/pytorch/blob/master/torch/utils/collect_env.py)

PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Libc version: glibc-2.10

Python version: 3.7.4 (default, Aug 13 2019, 20:35:49)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-58-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB

Nvidia driver version: 460.91.03
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] pytorch-pfn-extras==0.3.2
[pip3] torch==1.7.1
[pip3] torchtext==0.8.1
[pip3] torchvision==0.8.2
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.2.89              hfd86e86_1
[conda] mkl                       2020.2                      256
[conda] mkl-service               2.3.0            py37he8ac12f_0
[conda] mkl_fft                   1.3.0            py37h54f3939_0
[conda] mkl_random                1.1.1            py37h0573a6f_0
[conda] numpy                     1.19.2           py37h54aff64_0
[conda] numpy-base                1.19.2           py37hfa32c7d_0
[conda] pytorch                   1.7.1           py3.7_cuda10.2.89_cudnn7.6.5_0    pytorch
[conda] pytorch3d                 0.4.0           py37_cu102_pyt171    pytorch3d
[conda] torchvision               0.8.2                py37_cu102    pytorch

The text was updated successfully, but these errors were encountered:

soskek · 2021-08-11T04:58:05Z

For grid (= 3DConv), possibly, we should use --feat-dim 32 instead of the default 64?

t2kasa · 2021-09-02T11:03:20Z

Thanks for the issue.

About OOM, it would be yes for using grid with default setting.
Please try to add the argument --n-levels 3 or -nl 3 to set the U-Net 3D depth size, which is same setup for official implementation (i.e. in default the depth size is 4 for U-Net 2D and 3 for U-Net 3D. Please see official config in detail).

And the following document of listed pretrained models would be useful to know argument patterns:
https://github.com/pfnet/pynif3d/blob/main/examples/pretrained_models.md

soskek · 2021-09-02T11:18:01Z

Thank you! I'll modify the UNet arg too.

mihaimorariu transferred this issue from another repository Aug 17, 2021

mihaimorariu added this to Backlog in pynif3d Aug 17, 2021

mihaimorariu added high priority High priority items question Further information is requested labels Aug 17, 2021

soskek closed this as completed Sep 2, 2021

pynif3d automation moved this from Backlog to Done Sep 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] The default train-run of CON caused Out-Of-Memory #1

[Question] The default train-run of CON caused Out-Of-Memory #1

soskek commented Aug 11, 2021

soskek commented Aug 11, 2021

t2kasa commented Sep 2, 2021

soskek commented Sep 2, 2021

[Question] The default train-run of CON caused Out-Of-Memory #1

[Question] The default train-run of CON caused Out-Of-Memory #1

Comments

soskek commented Aug 11, 2021

soskek commented Aug 11, 2021

t2kasa commented Sep 2, 2021

soskek commented Sep 2, 2021