Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot explain recurring OOM error #66

Open
Remorax opened this issue Mar 17, 2023 · 6 comments
Open

Cannot explain recurring OOM error #66

Remorax opened this issue Mar 17, 2023 · 6 comments

Comments

@Remorax
Copy link

Remorax commented Mar 17, 2023

Hi there,

I am trying to use the int8 quantized model of BLOOM 175B for inference and am closely following the bloom-accelerate-inference.py script. I have about 1000 prompts for which I need the outputs. I use beam size of 1 (greedy search) and batch size of 1 since I can't fit more into my GPU memory (I have 4 * 80 GB A100 GPUs). max_new_tokens is set to 64.

When running inference on this list of prompts, after successfully generating on the first few sentences (61 in this case), my script crashes with an OOM error:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 79.17 GiB total capacity; 77.63 GiB already allocated; 11.31 MiB free; 77.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Though long prompts often cause OOM, in this case, I do not think it is due to the length of the current prompt. I logged just to make sure, but prompts longer than the current one have been successfully generated in the past (in the first 61 prompts I was referring to).

I am unable to figure out what the possible reason could be. Any suggestions/ideas?

@mayank31398
Copy link
Collaborator

Can you provide a bit more details?
How have you launched the job?
Is this a standalone job or a server deployment using the Makefile?

@Remorax
Copy link
Author

Remorax commented Mar 20, 2023

Hello, thank you so much for responding. I launch it as a standalone job like this:

CUDA_VISIBLE_DEVICES=0,1,2,3 python ${preprocessing_dir}/query_bloom.py \
    --name bigscience/bloom --dtype int8 \
    --batch_size 1 --num-beams 1 --early-stopping \
    --prompts_file ${results_dir}/prompts.pkl \
    --hypo_file ${results_dir}/hypo.txt

prompts.pkl was created by a previous preprocessing script that works as expected. The only potential issue I could think of here was that it generates "too large" prompts but as explained earlier, prompt length does not appear to be the cause of this error as longer prompts have worked (unless there is a memory leak).

I have uploaded query_bloom.py as a gist over here. It is based off of the bloom-accelerate-inference.py script and is a wrapper on top of it.

Let me know if this suffices!

@richarddwang
Copy link

richarddwang commented Mar 22, 2023

May be it is because it was trying to generate too much tokens? According to the content of different prompts, it will generate different number of new tokens.

@mayank31398
Copy link
Collaborator

could be

@Remorax
Copy link
Author

Remorax commented Mar 22, 2023

Hi @richarddwang , yes but I do set max_new_tokens to be 64 (L20 in the gist). So this does not seem to be the issue

@mayank31398
Copy link
Collaborator

could be due to large number of input tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants