Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: RuntimeError: CUDA out of memory. #7

Closed
nyxrobotics opened this issue Dec 26, 2022 · 4 comments
Closed

Question: RuntimeError: CUDA out of memory. #7

nyxrobotics opened this issue Dec 26, 2022 · 4 comments

Comments

@nyxrobotics
Copy link

Please excuse the basic question and not the problem with this repository.
I want labelme annotation on 1920x1080 image.
However, active learning runs out of GPU memory.
So I want to change the resolution to 640x360 only when learning.
Can you give me some advice where to change?

[12/26 13:00:27 d2.data.datasets.coco]: Loaded 5 images in COCO format from ./noodle2/datasets/train.json
[12/26 13:00:27 d2.data.build]: Removed 0 images with no usable annotations. 5 images left.
[12/26 13:00:27 d2.data.build]: Distribution of instances among all 3 categories:
|   category    | #instances   |   category    | #instances   |   category    | #instances   |
|:-------------:|:-------------|:-------------:|:-------------|:-------------:|:-------------|
| curry_noodles | 28           | seafood_noo.. | 29           | soy_sauce_n.. | 0            |
|               |              |               |              |               |              |
|     total     | 57           |               |              |               |              |
[12/26 13:00:27 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()]
[12/26 13:00:27 d2.data.build]: Using training sampler TrainingSampler
[12/26 13:00:27 d2.data.common]: Serializing 5 elements to byte tensors and concatenating them all ...
[12/26 13:00:27 d2.data.common]: Serialized dataset takes 0.01 MiB
Skip loading parameter 'roi_heads.box_predictor.cls_score.weight' to the model due to incompatible shapes: (81, 1024) in the checkpoint but (4, 1024) in the model! You might want to double check if this is expected.
Skip loading parameter 'roi_heads.box_predictor.cls_score.bias' to the model due to incompatible shapes: (81,) in the checkpoint but (4,) in the model! You might want to double check if this is expected.
Skip loading parameter 'roi_heads.box_predictor.bbox_pred.weight' to the model due to incompatible shapes: (320, 1024) in the checkpoint but (12, 1024) in the model! You might want to double check if this is expected.
Skip loading parameter 'roi_heads.box_predictor.bbox_pred.bias' to the model due to incompatible shapes: (320,) in the checkpoint but (12,) in the model! You might want to double check if this is expected.
Skip loading parameter 'roi_heads.mask_head.predictor.weight' to the model due to incompatible shapes: (80, 256, 1, 1) in the checkpoint but (3, 256, 1, 1) in the model! You might want to double check if this is expected.
Skip loading parameter 'roi_heads.mask_head.predictor.bias' to the model due to incompatible shapes: (80,) in the checkpoint but (3,) in the model! You might want to double check if this is expected.
[12/26 13:00:28 d2.engine.train_loop]: Starting training from iteration 0
ERROR [12/26 13:00:29 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/home/numai/lib/maskal/detectron2/engine/train_loop.py", line 140, in train
    self.run_step()
  File "/home/numai/lib/maskal/detectron2/engine/defaults.py", line 441, in run_step
    self._trainer.run_step()
  File "/home/numai/lib/maskal/detectron2/engine/train_loop.py", line 242, in run_step
    losses.backward()
  File "/home/numai/.local/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/numai/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 90.00 MiB (GPU 0; 5.81 GiB total capacity; 3.16 GiB already allocated; 96.25 MiB free; 3.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Here is my nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   49C    P5    14W /  N/A |    837MiB /  5946MiB |     26%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
@nyxrobotics
Copy link
Author

I got same error even after resizing the image to 360p.
Is there a way to reduce GPU memory usage other than image size?

[12/26 14:49:07 d2.engine.train_loop]: Starting training from iteration 0
ERROR [12/26 14:49:08 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/home/numai/lib/maskal/detectron2/engine/train_loop.py", line 140, in train
    self.run_step()
  File "/home/numai/lib/maskal/detectron2/engine/defaults.py", line 441, in run_step
    self._trainer.run_step()
  File "/home/numai/lib/maskal/detectron2/engine/train_loop.py", line 242, in run_step
    losses.backward()
  File "/home/numai/.local/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/numai/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 156, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 5.81 GiB total capacity; 3.61 GiB already allocated; 77.75 MiB free; 3.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[12/26 14:49:08 d2.engine.hooks]: Total training time: 0:00:00 (0:00:00 on hooks)
[12/26 14:49:08 d2.utils.events]:  iter: 1  total_loss: 3.166  loss_cls: 1.695  loss_box_reg: 0.6886  loss_mask: 0.6974  loss_rpn_cls: 0.05768  loss_rpn_loc: 0.02732  data_time: 0.0986  lr: 1e-05  max_mem: 3713M

@nyxrobotics
Copy link
Author

I have verified that torch recognizes the cuda device

>>> print(torch.cuda.is_available())
True
>>> print(torch.cuda.current_device())
0
>>> print(torch.cuda.get_device_name())
NVIDIA GeForce RTX 3060 Laptop GPU
>>> print(torch.cuda.get_device_capability())
(8, 6)

@pieterblok
Copy link
Owner

pieterblok commented Dec 26, 2022

I got same error even after resizing the image to 360p.

Is there a way to reduce GPU memory usage other than image size?

[12/26 14:49:07 d2.engine.train_loop]: Starting training from iteration 0

ERROR [12/26 14:49:08 d2.engine.train_loop]: Exception during training:

Traceback (most recent call last):

  File "/home/numai/lib/maskal/detectron2/engine/train_loop.py", line 140, in train

    self.run_step()

  File "/home/numai/lib/maskal/detectron2/engine/defaults.py", line 441, in run_step

    self._trainer.run_step()

  File "/home/numai/lib/maskal/detectron2/engine/train_loop.py", line 242, in run_step

    losses.backward()

  File "/home/numai/.local/lib/python3.6/site-packages/torch/_tensor.py", line 307, in backward

    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)

  File "/home/numai/.local/lib/python3.6/site-packages/torch/autograd/__init__.py", line 156, in backward

    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag

RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 5.81 GiB total capacity; 3.61 GiB already allocated; 77.75 MiB free; 3.74 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[12/26 14:49:08 d2.engine.hooks]: Total training time: 0:00:00 (0:00:00 on hooks)

[12/26 14:49:08 d2.utils.events]:  iter: 1  total_loss: 3.166  loss_cls: 1.695  loss_box_reg: 0.6886  loss_mask: 0.6974  loss_rpn_cls: 0.05768  loss_rpn_loc: 0.02732  data_time: 0.0986  lr: 1e-05  max_mem: 3713M

@nyxrobotics that's unfortunate. Quite honestly, I've never checked maskal or detectron2 with a GPU lower than 8 GB memory. A 6 GB GPU might be too small for detectron2.

What you can do, besides lowering image resolution, is to use a less big Mask R-CNN architecture, like the one with Resnet50. Change in maskal.yaml both the network_config and pretrained_weights to a Mask R-CNN path with Resnet50 backbone, for example: mask_rcnn_R_50_FPN_3x.yaml

That might help to reduce GPU memory load...

@nyxrobotics
Copy link
Author

mask_rcnn_R_50_FPN_3x.yaml worked for me.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants