Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

probav training memory error #24

Open
robmarkcole opened this issue Aug 12, 2021 · 5 comments
Open

probav training memory error #24

robmarkcole opened this issue Aug 12, 2021 · 5 comments

Comments

@robmarkcole
Copy link

Using colab pro with nominally 25 Gb I am still running out of memory at 17 epochs using your probav example notebook. Is there any way to free memory on the fly? I was able to train the tensorflow RAMS implementation to 50 epochs on colab pro

CUDA out of memory. Tried to allocate 46.00 MiB (GPU 0; 15.90 GiB total capacity; 14.01 GiB already allocated; 25.75 MiB free; 14.96 GiB reserved in total by PyTorch)
@isaaccorley
Copy link
Owner

I trained the model in the example on Colab with a Tesla T4 so maybe it gave you a different GPU?

Anyways, you cam decrease the batch_size and then increase accumulate_grad_batches to deal with this. Alternatively, you can reduce the num_res_blocks in the model.

@isaaccorley
Copy link
Owner

Strange. There must be a memory leak somewhere mayne in pytorch lightning? I'll look into this.

@robmarkcole
Copy link
Author

robmarkcole commented Aug 12, 2021

I had Tesla P100. Will try your suggestions.

  • decreasing batch_size from 4 to 2 and doubling (I assume?) accumulate_grad_batches=2 # default 1, increase to 2 results in immediate CUDA out of memory
  • reverted accumulate_grad_batches to 1 and leaving the reduced batch size at 2, training proceeds without error but simply stops at 18 epochs, with the log Epoch 18, global step 5091: val_loss was not in top 1 - not sure why it stops?

@isaaccorley
Copy link
Owner

isaaccorley commented Aug 12, 2021

  1. Not sure why you are getting that behavior when changing these params. If anything, that should free up more memory.
  2. There is an early stopping callback. That is likely what is causing this. You can increase the patience param in the callback.

@robmarkcole
Copy link
Author

Increasing from 5 to patience=20 I managed to get an additional epoch before the memory error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants