ResourceExhaustedError MultiResUNet3D #4

emanuelolaya · 2019-06-22T12:41:22Z

When I try to train the MultiResUNet3D model with input shape = (128, 128, 128, 1) and batch size = 1, keras raises this exception:

ResourceExhaustedError: OOM when allocating tensor with shape[1,128,128,128,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node training/Adadelta/gradients/zeros_70}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[{{node loss/mul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

also, I used the following function to calculate the gpu memory that keras needs:

# function taked from:
# https://stackoverflow.com/questions/43137288/how-to-determine-needed-memory-of-keras-model

def get_model_memory_usage(batch_size, model):
    shapes_mem_count = 0
    for l in model.layers:
        single_layer_mem = 1
        for s in l.output_shape:
            if s is None:
                continue
            single_layer_mem *= s
        shapes_mem_count += single_layer_mem
    trainable_count = np.sum([K.count_params(p) for p in set(model.trainable_weights)])
    non_trainable_count = np.sum([K.count_params(p) for p in set(model.non_trainable_weights)])
    number_size = 4.0
    if K.floatx() == 'float16':
         number_size = 2.0
    if K.floatx() == 'float64':
         number_size = 8.0
    total_memory = number_size*(batch_size*shapes_mem_count + trainable_count + non_trainable_count)
    gbytes = np.round(total_memory / (1024.0 ** 3), 3)
    return gbytes

and the output when I execute the next code:

# where "model3D" is the MultiResUNet3D model.
get_model_memory_usage(1, model3D) # output 21.628 "GB"

# where "model2D" is the MultiResUNet3D model.
get_model_memory_usage(1, model2D) # output 0.256 "GB"

it is normal that the model does not run?
How much GPU memory do I need?

my GPU is a gtx 1080ti (11GB)

The text was updated successfully, but these errors were encountered:

nibtehaz · 2019-06-23T10:06:54Z

In our experiments with the MultiResUNet 3D we used 3D MRI images of dimension 80x80x48x4 and batch size =2. We used Titan XP gpu with 12GB memory.

It would seem that your input images are too large to fit the model into an 11GB gpu. Perhaps, you can reduce the size of images as we did to overcome memory constraint, as the 3D model is indeed quite expensive. Alternatively you can reduce the layers and/or kernels to fit the model with the given image size into your gpu. As the get_model_memory_usage() function returns you would need 21.628 GB memory to fit the model in your gpu.

saskra · 2020-05-05T08:36:54Z

I have a similar problem: Even with severely reduced image resolution and batch size I have 46.409 GB model memory usage. But I also have three graphics cards with 32 GB memory each. Do you have a hint how I can let your model use all of them together?

Traceback (most recent call last):                                                                                                                                                      
  File "/home/x/PycharmProjects/MultiResUNet/run_on_2D.py", line 95, in <module>
    model_dir=model_dir2)
  File "/home/x/PycharmProjects/MultiResUNet/mrun_functions.py", line 298, in train_step
    callbacks=[es, TqdmCallback(verbose=1)])
  File "/home/x/anaconda3/envs/MultiResUNet/lib/python3.7/site-packages/keras/engine/training.py", line 1239, in fit
    validation_freq=validation_freq)
  File "/home/x/anaconda3/envs/MultiResUNet/lib/python3.7/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
    outs = fit_function(ins_batch)
  File "/home/x/anaconda3/envs/MultiResUNet/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py", line 3476, in __call__
    run_metadata=self.run_metadata)
  File "/home/x/anaconda3/envs/MultiResUNet/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[10,64,288,288] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node training/Adam/gradients/zeros_86}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
         [[metrics/dice_coef/Identity/_2691]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
  (1) Resource exhausted: OOM when allocating tensor with shape[10,64,288,288] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node training/Adam/gradients/zeros_86}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
Process finished with exit code 1

Edit: In my case this seems to help. -->

from keras.utils import multi_gpu_model
model = multi_gpu_model(model, gpus=3)

saskra · 2021-04-27T11:05:43Z

Sometimes I get the above mentioned error although get_model_memory_usage had told me that it should easily fit. Anyone else?

nibtehaz · 2021-04-28T06:36:50Z

Not sure about what may happen. But maybe for some caching in GPU some memory is already allocated? You may have a look at here

tensorflow/tensorflow#17048 (comment)

saskra · 2021-04-28T07:25:13Z

Unfortunately, I wouldn't know how to restart the python interpreter in a python loop for leave-one-out cross-validation.

Perhaps the above-mentioned function for calculating the required memory is simply wrong. (I haven't found a better one yet either, though.) When I select batch size and image size so that it should just fit in the graphics memory, it basically never works. I always use one of my three graphics cards as a buffer, and even then it doesn't always work.

nibtehaz · 2021-04-29T06:50:01Z

@saskra if you running multiple training sessions in one python code, like doing LOOCV or 5 Fold CV, I follow a kind of hack. I don't know if it is general or not so please don't quote me on this 😅

I too found that the memory crashes on such cases.So, after a training session has completed, in the code I put the following lines

gc.collect()
time.sleep(30)
gc.collect()
time.sleep(30)
gc.collect()

This somewhat frees the cache (at least from my experience). You may try to see if this works for you or not, as this is just kind of a hack.

saskra · 2021-04-29T07:17:25Z

Thank you, I will try that out! But does the gc actually free GPU memory?

Nevertheless, the calculation of the required memory of the model seems to be wrong, because the crash can also occur on the first run of the loop.

nibtehaz · 2021-05-01T07:44:15Z

No @saskra . GC is unlikely to free GPU memory. I think the time.sleep() does the trick mostly. Nevertheless I put it there if it frees some RAM.

nibtehaz closed this as completed Oct 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ResourceExhaustedError MultiResUNet3D #4

ResourceExhaustedError MultiResUNet3D #4

emanuelolaya commented Jun 22, 2019 •

edited

nibtehaz commented Jun 23, 2019

saskra commented May 5, 2020 •

edited

saskra commented Apr 27, 2021

nibtehaz commented Apr 28, 2021

saskra commented Apr 28, 2021

nibtehaz commented Apr 29, 2021

saskra commented Apr 29, 2021

nibtehaz commented May 1, 2021

ResourceExhaustedError MultiResUNet3D #4

ResourceExhaustedError MultiResUNet3D #4

Comments

emanuelolaya commented Jun 22, 2019 • edited

nibtehaz commented Jun 23, 2019

saskra commented May 5, 2020 • edited

saskra commented Apr 27, 2021

nibtehaz commented Apr 28, 2021

saskra commented Apr 28, 2021

nibtehaz commented Apr 29, 2021

saskra commented Apr 29, 2021

nibtehaz commented May 1, 2021

emanuelolaya commented Jun 22, 2019 •

edited

saskra commented May 5, 2020 •

edited