Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResourceExhaustedError MultiResUNet3D #4

Closed
emanuelolaya opened this issue Jun 22, 2019 · 8 comments
Closed

ResourceExhaustedError MultiResUNet3D #4

emanuelolaya opened this issue Jun 22, 2019 · 8 comments

Comments

@emanuelolaya
Copy link

emanuelolaya commented Jun 22, 2019

When I try to train the MultiResUNet3D model with input shape = (128, 128, 128, 1) and batch size = 1, keras raises this exception:

ResourceExhaustedError: OOM when allocating tensor with shape[1,128,128,128,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node training/Adadelta/gradients/zeros_70}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[{{node loss/mul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

also, I used the following function to calculate the gpu memory that keras needs:

# function taked from:
# https://stackoverflow.com/questions/43137288/how-to-determine-needed-memory-of-keras-model

def get_model_memory_usage(batch_size, model):
    shapes_mem_count = 0
    for l in model.layers:
        single_layer_mem = 1
        for s in l.output_shape:
            if s is None:
                continue
            single_layer_mem *= s
        shapes_mem_count += single_layer_mem
    trainable_count = np.sum([K.count_params(p) for p in set(model.trainable_weights)])
    non_trainable_count = np.sum([K.count_params(p) for p in set(model.non_trainable_weights)])
    number_size = 4.0
    if K.floatx() == 'float16':
         number_size = 2.0
    if K.floatx() == 'float64':
         number_size = 8.0
    total_memory = number_size*(batch_size*shapes_mem_count + trainable_count + non_trainable_count)
    gbytes = np.round(total_memory / (1024.0 ** 3), 3)
    return gbytes

and the output when I execute the next code:

# where "model3D" is the MultiResUNet3D model.
get_model_memory_usage(1, model3D) # output 21.628 "GB"

# where "model2D" is the MultiResUNet3D model.
get_model_memory_usage(1, model2D) # output 0.256 "GB"

it is normal that the model does not run?
How much GPU memory do I need?

my GPU is a gtx 1080ti (11GB)

@nibtehaz
Copy link
Owner

In our experiments with the MultiResUNet 3D we used 3D MRI images of dimension 80x80x48x4 and batch size =2. We used Titan XP gpu with 12GB memory.

It would seem that your input images are too large to fit the model into an 11GB gpu. Perhaps, you can reduce the size of images as we did to overcome memory constraint, as the 3D model is indeed quite expensive. Alternatively you can reduce the layers and/or kernels to fit the model with the given image size into your gpu. As the get_model_memory_usage() function returns you would need 21.628 GB memory to fit the model in your gpu.

@saskra
Copy link

saskra commented May 5, 2020

I have a similar problem: Even with severely reduced image resolution and batch size I have 46.409 GB model memory usage. But I also have three graphics cards with 32 GB memory each. Do you have a hint how I can let your model use all of them together?

Traceback (most recent call last):                                                                                                                                                      
  File "/home/x/PycharmProjects/MultiResUNet/run_on_2D.py", line 95, in <module>
    model_dir=model_dir2)
  File "/home/x/PycharmProjects/MultiResUNet/mrun_functions.py", line 298, in train_step
    callbacks=[es, TqdmCallback(verbose=1)])
  File "/home/x/anaconda3/envs/MultiResUNet/lib/python3.7/site-packages/keras/engine/training.py", line 1239, in fit
    validation_freq=validation_freq)
  File "/home/x/anaconda3/envs/MultiResUNet/lib/python3.7/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
    outs = fit_function(ins_batch)
  File "/home/x/anaconda3/envs/MultiResUNet/lib/python3.7/site-packages/tensorflow_core/python/keras/backend.py", line 3476, in __call__
    run_metadata=self.run_metadata)
  File "/home/x/anaconda3/envs/MultiResUNet/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[10,64,288,288] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node training/Adam/gradients/zeros_86}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
         [[metrics/dice_coef/Identity/_2691]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
  (1) Resource exhausted: OOM when allocating tensor with shape[10,64,288,288] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node training/Adam/gradients/zeros_86}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
0 successful operations.
0 derived errors ignored.
Process finished with exit code 1

Edit: In my case this seems to help. -->

from keras.utils import multi_gpu_model
model = multi_gpu_model(model, gpus=3)

@saskra
Copy link

saskra commented Apr 27, 2021

Sometimes I get the above mentioned error although get_model_memory_usage had told me that it should easily fit. Anyone else?

@nibtehaz
Copy link
Owner

Not sure about what may happen. But maybe for some caching in GPU some memory is already allocated? You may have a look at here

tensorflow/tensorflow#17048 (comment)

@saskra
Copy link

saskra commented Apr 28, 2021

Unfortunately, I wouldn't know how to restart the python interpreter in a python loop for leave-one-out cross-validation.

Perhaps the above-mentioned function for calculating the required memory is simply wrong. (I haven't found a better one yet either, though.) When I select batch size and image size so that it should just fit in the graphics memory, it basically never works. I always use one of my three graphics cards as a buffer, and even then it doesn't always work.

@nibtehaz
Copy link
Owner

@saskra if you running multiple training sessions in one python code, like doing LOOCV or 5 Fold CV, I follow a kind of hack. I don't know if it is general or not so please don't quote me on this 😅

I too found that the memory crashes on such cases.So, after a training session has completed, in the code I put the following lines

gc.collect()
time.sleep(30)
gc.collect()
time.sleep(30)
gc.collect()

This somewhat frees the cache (at least from my experience). You may try to see if this works for you or not, as this is just kind of a hack.

@saskra
Copy link

saskra commented Apr 29, 2021

Thank you, I will try that out! But does the gc actually free GPU memory?

Nevertheless, the calculation of the required memory of the model seems to be wrong, because the crash can also occur on the first run of the loop.

@nibtehaz
Copy link
Owner

nibtehaz commented May 1, 2021

No @saskra . GC is unlikely to free GPU memory. I think the time.sleep() does the trick mostly. Nevertheless I put it there if it frees some RAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants