Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with allocating memory (CUDA out of memory) #85

Closed
Einhartd opened this issue Nov 28, 2023 · 2 comments
Closed

Problem with allocating memory (CUDA out of memory) #85

Einhartd opened this issue Nov 28, 2023 · 2 comments

Comments

@Einhartd
Copy link

Dist: Pop OS 22.04
nvidia-smi output:

Tue Nov 28 23:18:02 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02              Driver Version: 545.29.02    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3050 ...    Off | 00000000:01:00.0 Off |                  N/A |
| N/A   58C    P8              10W /  60W |      9MiB /  4096MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2686      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

nvcc --version output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

I have encountered an error while trying to train my first model on YOLO3D model.
I have simply followed instruction in docs/mono3d.md
After entering this command: ./launchers/train.sh config/CONFIG_FILE_YOLO.py 0 proba program crashed with error below:

Traceback (most recent call last):
  File "/home/einhart/visualDet3D/scripts/train.py", line 199, in <module>
    Fire(main)
  File "/home/einhart/anaconda3/envs/visualDet3D/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/einhart/anaconda3/envs/visualDet3D/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/einhart/anaconda3/envs/visualDet3D/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/einhart/visualDet3D/scripts/train.py", line 150, in main
    training_dection(data, detector, optimizer, writer, training_loss_logger, global_step, epoch_num, cfg)
  File "/home/einhart/visualDet3D/visualDet3D/networks/pipelines/trainers.py", line 35, in train_mono_detection
    classification_loss, regression_loss, loss_dict = module(
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/visualDet3D/visualDet3D/networks/detectors/yolomono3d_detector.py", line 126, in forward
    return self.training_forward(img_batch, annotations, calib)
  File "/home/einhart/visualDet3D/visualDet3D/networks/detectors/yolomono3d_detector.py", line 91, in training_forward
    features  = self.core(dict(image=img_batch, P2=P2))
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/visualDet3D/visualDet3D/networks/detectors/yolomono3d_core.py", line 16, in forward
    x = self.backbone(x['image'])
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/visualDet3D/visualDet3D/networks/backbones/resnet.py", line 195, in forward
    x = layer(x)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/visualDet3D/visualDet3D/networks/backbones/resnet.py", line 82, in forward
    out = self.conv3(out)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/einhart/.local/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacty of 3.81 GiB of which 6.31 MiB is free. Including non-PyTorch memory, this process has 3.79 GiB memory in use. Of the allocated memory 3.60 GiB is allocated by PyTorch, and 89.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried reducing batch_size value from 8 to 2 in file CONFIG_FILE_YOLO.py which was generated by this command:
cp Yolo3D_example $CONFIG_FILE.py

Have you got any idea how to reduce allocated memory ?

@Owen-Liuyuxuan
Copy link
Owner

Owen-Liuyuxuan commented Nov 29, 2023

I have not tried training the network with 4GB of memory. You could try modifying the backbone to Resnet50. Or further minimizing the batch size to 1 (you may need to tune learning rate after this).

But it will be difficult to reproduce the full result.

@Einhartd
Copy link
Author

Einhartd commented Dec 4, 2023

It worked! Thank you so much!

@Einhartd Einhartd closed this as completed Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants