DefaultCPUAllocator: can't allocate memory: you tried to allocate 195696230400 bytes #3

4ndr3aR · 2021-06-23T20:26:27Z

Hey there,

first of all, thank you for the wonderful repo, it works great!

However, I've been experimenting for a few hours now and I can't process over seventy frames. I'm using 960x480 as resolution, but reducing the frame size doesn't seem to solve the problem.

Usually the script is interrupted by the usual CUDA OOM errors:

/local/data/venvs/swav/lib/python3.6/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Traceback (most recent call last):
  File "eval_generic.py", line 127, in <module>
    processor.interact(with_bg_msk, frame_idx, rgb.shape[1], obj_idx)
  File "/local/data/repos/STCN/inference_core_yv.py", line 119, in interact
    key_v = self.prop_net.encode_value(self.images[:,frame_idx].cuda(), qf16, self.prob[self.enabled_obj,frame_idx].cuda())
  File "/local/data/repos/STCN/model/eval_network.py", line 47, in encode_value
    f16 = self.value_encoder(frame, kf16.repeat(k,1,1,1), masks, others)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/local/data/repos/STCN/model/modules.py", line 114, in forward
    x = self.bn1(x)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 178, in forward
    self.eps,
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/nn/functional.py", line 2282, in batch_norm
    input, weight, bias, running_mean, running_var, training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 1.04 GiB (GPU 0; 31.75 GiB total capacity; 29.64 GiB already allocated; 667.50 MiB free; 29.76 GiB reserved in total by PyTorch)
Processing video1 ...

N/A% (0 of 1) |                                                                                                                                        | Elapsed Time: 0:00:00 ETA:  --:--:--Traceback (most recent call last):
  File "eval_generic.py", line 107, in <module>
    mem_every=args.mem_every, include_last=args.include_last)
  File "/local/data/repos/STCN/inference_core_yv.py", line 38, in __init__
    self.prob = torch.zeros((self.k+1, t, 1, nh, nw), dtype=torch.float32, device=self.device)
RuntimeError: CUDA out of memory. Tried to allocate 47.21 GiB (GPU 0; 31.75 GiB total capacity; 416.90 MiB already allocated; 29.99 GiB free; 446.00 MiB reserved in total by PyTorch)
Processing video1 ...

But sometimes there are much more disturbing errors like this one:

Traceback (most recent call last):
  File "eval_generic.py", line 80, in <module>
    for data in progressbar(test_loader, max_value=len(test_loader), redirect_stdout=True):
  File "/local/data/venvs/swav/lib/python3.6/site-packages/progressbar/shortcuts.py", line 10, in progressbar
    for result in progressbar(iterator):
  File "/local/data/venvs/swav/lib/python3.6/site-packages/progressbar/bar.py", line 547, in __next__
    value = next(self._iterable)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/local/data/venvs/swav/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/local/data/repos/STCN/dataset/generic_test_dataset.py", line 101, in __getitem__
    masks = torch.from_numpy(all_to_onehot(masks, labels)).float()
RuntimeError: [enforce fail at CPUAllocator.cpp:71] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 195696230400 bytes. Error code 12 (Cannot allocate memory)

100% (1 of 1) |########################################################################################################################################| Elapsed Time: 0:00:00 ETA:  00:00:00

The command line is fairly standard:

python eval_generic.py --data_path /local/data/dataset/dummy-test-set --output /local/data/repos/STCN/output-dummy-test-set

The only thing that changes is the number of images and their resolution (960x480 is the maximum).

Is there a way to do inference one batch at a time, without allocating all the memory at the beginning and thus avoiding all these OOMs?

Thank you!

The text was updated successfully, but these errors were encountered:

hkchengrex · 2021-06-24T03:39:35Z

Can you print self.k?

4ndr3aR · 2021-06-25T08:52:34Z

Here you are:

Processing video1 ...
InferenceCore.__init__() - t: 500 - h: 480 - w: 960
InferenceCore.__init__() - nh: 480 - nw: 960
N/A% (0 of 1) |                                                                                                                                        | Elapsed Time: 0:00:00 ETA:  --:--:--Traceback (most recent call last):
  File "eval_generic.py", line 107, in <module>
    mem_every=args.mem_every, include_last=args.include_last)
  File "/local/data/repos/STCN/inference_core_yv.py", line 42, in __init__
    self.prob = torch.zeros((self.k+1, t, 1, nh, nw), dtype=torch.float32, device=self.device)
RuntimeError: CUDA out of memory. Tried to allocate 47.21 GiB (GPU 0; 31.75 GiB total capacity; 416.90 MiB already allocated; 29.99 GiB free; 446.00 MiB reserved in total by PyTorch)
InferenceCore.__init__() - self.k: 54

Yep, 960×480×500×4×55 is exactly 47.21 Gb.

hkchengrex · 2021-06-25T09:38:42Z

self.k stores the number of objects -- do you really have 55 objects? Otherwise it seems like your mask files have some problems. We use np.unique to determine the number of objects.

4ndr3aR · 2021-06-25T09:59:23Z

I actually have 5 objects in the mask and I think I already know the problem... I have resized the original mask to lower resolutions, for sure the anti-aliasing has produced a lot of intermediate colors which are seen as new classes that don't really exist.

Thanks for the support & debugging!

hkchengrex closed this as completed Jun 25, 2021

1359347500cwc mentioned this issue Jul 16, 2021

RuntimeError: CUDA error: the launch timed out and was terminated #21

Closed

TomassMa mentioned this issue Apr 20, 2022

Cuda out of memory. #120

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DefaultCPUAllocator: can't allocate memory: you tried to allocate 195696230400 bytes #3

DefaultCPUAllocator: can't allocate memory: you tried to allocate 195696230400 bytes #3

4ndr3aR commented Jun 23, 2021 •

edited

hkchengrex commented Jun 24, 2021

4ndr3aR commented Jun 25, 2021 •

edited

hkchengrex commented Jun 25, 2021

4ndr3aR commented Jun 25, 2021

DefaultCPUAllocator: can't allocate memory: you tried to allocate 195696230400 bytes #3

DefaultCPUAllocator: can't allocate memory: you tried to allocate 195696230400 bytes #3

Comments

4ndr3aR commented Jun 23, 2021 • edited

hkchengrex commented Jun 24, 2021

4ndr3aR commented Jun 25, 2021 • edited

hkchengrex commented Jun 25, 2021

4ndr3aR commented Jun 25, 2021

4ndr3aR commented Jun 23, 2021 •

edited

4ndr3aR commented Jun 25, 2021 •

edited