Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA out of memory #180

Closed
karimfayed opened this issue Mar 31, 2021 · 10 comments
Closed

CUDA out of memory #180

karimfayed opened this issue Mar 31, 2021 · 10 comments

Comments

@karimfayed
Copy link

I did run the fine-tuning scripts in a virtual environment and it worked. Later on, I created a new virtual environment and when i run the model again the following error keeps popping out:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 2.00 GiB total capacity; 1.28 GiB already allocated; 4.55 MiB free; 1.28 GiB reserved in total by PyTorch)

Note: batch size is 1
The fine-tuning Script: https://gist.github.com/jiahao87/50cec29725824da7ff6dd9314b53c4b3

@JingqingZ
Copy link
Collaborator

JingqingZ commented Mar 31, 2021

I didn't see any issue of your code after a quick scan. Some suggestion for your consideration:

  1. GPU is recommended to have 16 GB memory or more.
  2. Reduce max input length or max target length to reduce memory cost.
  3. Model parallel: https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html

@karrtikiyer
Copy link

Hi @JingqingZ : If we have below configuration, basically 8 vCpu's of 12 GB each, would it work or would we still need to implement Model Parallel? Or each one has to be 16GB each at least?
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K80 On | 00000000:00:17.0 Off | 0 | | N/A 36C P8 27W / 149W | 0MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 On | 00000000:00:18.0 Off | 0 | | N/A 31C P8 29W / 149W | 0MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla K80 On | 00000000:00:19.0 Off | 0 | | N/A 39C P8 25W / 149W | 0MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 Tesla K80 On | 00000000:00:1A.0 Off | 0 | | N/A 33C P8 32W / 149W | 0MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 Tesla K80 On | 00000000:00:1B.0 Off | 0 | | N/A 39C P8 25W / 149W | 0MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 Tesla K80 On | 00000000:00:1C.0 Off | 0 | | N/A 34C P8 30W / 149W | 0MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 Tesla K80 On | 00000000:00:1D.0 Off | 0 | | N/A 43C P8 26W / 149W | 0MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 Tesla K80 On | 00000000:00:1E.0 Off | 0 | | N/A 36C P8 29W / 149W | 0MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

@JingqingZ
Copy link
Collaborator

K80 can be struggling on this from my experience.

@karimfayed
Copy link
Author

2. put length or max target length to reduce memory cost.

How can I reduce max input length or max target length ? Do you mean by max input length the number of articles in the training dataset?

@karrtikiyer
Copy link

@JingqingZ : Instead of K80, do you have any recommended hardware configuration which you suggest will work fine for pegasus-large fine tuning?

@JingqingZ
Copy link
Collaborator

  1. put length or max target length to reduce memory cost.

How can I reduce max input length or max target length ? Do you mean by max input length the number of articles in the training dataset?

You may truncate the input text (and target text) into shorter length, for example 256 tokens for input text instead of 512 or 1024 tokens.

@JingqingZ
Copy link
Collaborator

@JingqingZ : Instead of K80, do you have any recommended hardware configuration which you suggest will work fine for pegasus-large fine tuning?

V100 16 GB (or 32 GB) works fine for me. Or you may try TPU v2 or v3.

@karimfayed
Copy link
Author

  1. put length or max target length to reduce memory cost.

How can I reduce max input length or max target length ? Do you mean by max input length the number of articles in the training dataset?

You may truncate the input text (and target text) into shorter length, for example 256 tokens for input text instead of 512 or 1024 tokens.

I was going to do it but I thought to try the code on COLAB first and it worked great the first time, but it stopped after 1000 epochs due to me not having enough disk space. When I tried it again on another day with the same dataset this error pops up. Is it because the GPU's provided by COLAB vary from time to time or is there something else I don't see?
WhatsApp Image 2021-04-05 at 2 26 49 AM

@JingqingZ
Copy link
Collaborator

I am sorry this particular issue is out of the scope of my knowledge.

@karimfayed
Copy link
Author

I am sorry this particular issue is out of the scope of my knowledge.

Thank you for your previous tip about using 16 GB of memory.
I subscribed to Colab Pro and it worked fine as it provided me with enough disk space and GPU memory of 16 GB.
It turns out free Colab allocate resources according to previous usage and other variables which makes it hard to repeat the process as you will rarely get the same resources twice in a row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants