Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] The default train-run of CON caused Out-Of-Memory #1

Closed
soskek opened this issue Aug 11, 2021 · 3 comments
Closed

[Question] The default train-run of CON caused Out-Of-Memory #1

soskek opened this issue Aug 11, 2021 · 3 comments
Labels
high priority High priority items question Further information is requested
Projects

Comments

@soskek
Copy link
Member

soskek commented Aug 11, 2021

(Not urgent question.)

I run the training script in the example of CON with the default args (= grid mode) using ShapeNet (downloaded by occupancy_networks repo's script) using 32GB GPU. However, it caused OOM. When setting -bs 24, it works (memory usage 30622MiB / 32510MiB).
Is this an intended behavior?

$ python -u examples/con/train.py -dd /mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/occupancy_networks/data/ShapeNet -sd saved_models_grid
Traceback (most recent call last):
  File "examples/con/train.py", line 218, in <module>
    main()
  File "examples/con/train.py", line 214, in main
    train(dataset, model, optimizer, args)
  File "examples/con/train.py", line 103, in train
    prediction = model(input_points, query_points)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/pipeline/con.py", line 99, in forward
    features = self.feature_encoder(input_points)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/models/con/local_pool_pointnet.py", line 275, in forward
    input_points, c, feature_grid=grid_id
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/models/con/local_pool_pointnet.py", line 191, in generate_coordinate_features
    fea_grid = self.feature_processing_fn(fea_grid)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/models/con/unet3d.py", line 289, in forward
    x = layer(encoders_features[idx + 1], x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/models/con/unet3d.py", line 172, in forward
    x = self.layer(x)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/nfs-mnj-hot-02/tmp/sosk/pynif3dcon/pynif3d/pynif3d/models/con/unet3d.py", line 82, in forward
    x = self.relu(self.convolution1(self.group_norm1(x)))
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/normalization.py", line 246, in forward
    input, self.num_groups, self.weight, self.bias, self.eps)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2112, in group_norm
    torch.backends.cudnn.enabled)
RuntimeError: CUDA out of memory. Tried to allocate 3.00 GiB (GPU 0; 31.75 GiB total capacity; 27.60 GiB already allocated; 2.92 GiB free; 27.72 GiB reserved in total by PyTorch)

The env (at mnj) is here: (I run https://github.com/pytorch/pytorch/blob/master/torch/utils/collect_env.py)

PyTorch version: 1.7.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.10.2
Libc version: glibc-2.10

Python version: 3.7.4 (default, Aug 13 2019, 20:35:49)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-58-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration:
GPU 0: Tesla V100-SXM2-32GB
GPU 1: Tesla V100-SXM2-32GB

Nvidia driver version: 460.91.03
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.1
[pip3] pytorch-pfn-extras==0.3.2
[pip3] torch==1.7.1
[pip3] torchtext==0.8.1
[pip3] torchvision==0.8.2
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.2.89              hfd86e86_1
[conda] mkl                       2020.2                      256
[conda] mkl-service               2.3.0            py37he8ac12f_0
[conda] mkl_fft                   1.3.0            py37h54f3939_0
[conda] mkl_random                1.1.1            py37h0573a6f_0
[conda] numpy                     1.19.2           py37h54aff64_0
[conda] numpy-base                1.19.2           py37hfa32c7d_0
[conda] pytorch                   1.7.1           py3.7_cuda10.2.89_cudnn7.6.5_0    pytorch
[conda] pytorch3d                 0.4.0           py37_cu102_pyt171    pytorch3d
[conda] torchvision               0.8.2                py37_cu102    pytorch
@soskek
Copy link
Member Author

soskek commented Aug 11, 2021

For grid (= 3DConv), possibly, we should use --feat-dim 32 instead of the default 64?
スクリーンショット 2021-08-11 13 57 02

@mihaimorariu mihaimorariu transferred this issue from another repository Aug 17, 2021
@mihaimorariu mihaimorariu added this to Backlog in pynif3d Aug 17, 2021
@mihaimorariu mihaimorariu added high priority High priority items question Further information is requested labels Aug 17, 2021
@t2kasa
Copy link

t2kasa commented Sep 2, 2021

Thanks for the issue.

About OOM, it would be yes for using grid with default setting.
Please try to add the argument --n-levels 3 or -nl 3 to set the U-Net 3D depth size, which is same setup for official implementation (i.e. in default the depth size is 4 for U-Net 2D and 3 for U-Net 3D. Please see official config in detail).

And the following document of listed pretrained models would be useful to know argument patterns:
https://github.com/pfnet/pynif3d/blob/main/examples/pretrained_models.md

@soskek
Copy link
Member Author

soskek commented Sep 2, 2021

Thank you! I'll modify the UNet arg too.

@soskek soskek closed this as completed Sep 2, 2021
pynif3d automation moved this from Backlog to Done Sep 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority High priority items question Further information is requested
Projects
Development

No branches or pull requests

3 participants