probav training memory error #24

robmarkcole · 2021-08-12T10:10:25Z

Using colab pro with nominally 25 Gb I am still running out of memory at 17 epochs using your probav example notebook. Is there any way to free memory on the fly? I was able to train the tensorflow RAMS implementation to 50 epochs on colab pro

CUDA out of memory. Tried to allocate 46.00 MiB (GPU 0; 15.90 GiB total capacity; 14.01 GiB already allocated; 25.75 MiB free; 14.96 GiB reserved in total by PyTorch)

The text was updated successfully, but these errors were encountered:

isaaccorley · 2021-08-12T14:05:38Z

I trained the model in the example on Colab with a Tesla T4 so maybe it gave you a different GPU?

Anyways, you cam decrease the batch_size and then increase accumulate_grad_batches to deal with this. Alternatively, you can reduce the num_res_blocks in the model.

isaaccorley · 2021-08-12T14:06:49Z

Strange. There must be a memory leak somewhere mayne in pytorch lightning? I'll look into this.

robmarkcole · 2021-08-12T14:25:40Z

I had Tesla P100. Will try your suggestions.

decreasing batch_size from 4 to 2 and doubling (I assume?) accumulate_grad_batches=2 # default 1, increase to 2 results in immediate CUDA out of memory
reverted accumulate_grad_batches to 1 and leaving the reduced batch size at 2, training proceeds without error but simply stops at 18 epochs, with the log Epoch 18, global step 5091: val_loss was not in top 1 - not sure why it stops?

isaaccorley · 2021-08-12T16:04:55Z

Not sure why you are getting that behavior when changing these params. If anything, that should free up more memory.
There is an early stopping callback. That is likely what is causing this. You can increase the patience param in the callback.

robmarkcole · 2021-08-16T12:51:01Z

Increasing from 5 to patience=20 I managed to get an additional epoch before the memory error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

probav training memory error #24

probav training memory error #24

robmarkcole commented Aug 12, 2021

isaaccorley commented Aug 12, 2021

isaaccorley commented Aug 12, 2021

robmarkcole commented Aug 12, 2021 •

edited

isaaccorley commented Aug 12, 2021 •

edited

robmarkcole commented Aug 16, 2021

probav training memory error #24

probav training memory error #24

Comments

robmarkcole commented Aug 12, 2021

isaaccorley commented Aug 12, 2021

isaaccorley commented Aug 12, 2021

robmarkcole commented Aug 12, 2021 • edited

isaaccorley commented Aug 12, 2021 • edited

robmarkcole commented Aug 16, 2021

robmarkcole commented Aug 12, 2021 •

edited

isaaccorley commented Aug 12, 2021 •

edited