New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPT model #165
Comments
Solved in the latest release (v0.5.0). |
@OlivierDehaene thanks for your hard working. I have tested OPT-13B again. It works fine on inferece for a few tries, but cause OOM on GPU as follows. I loaded with --quantize option, it occupy 15G of memory on start, but after a few interence call cause OOM. I have 4090 and GPU memory is 24G.
|
Thanks! You can use the benchmarking tool https://github.com/huggingface/text-generation-inference/tree/main/benchmark to help you. |
@OlivierDehaene my launch paramerters are as follows.
|
Thank you very much for your work! Tried GPT-J 6B on a 24GB card. Exactly like @lcw99 said
Would love your advice |
Are you also using quantization? |
@OlivierDehaene Yes I am using quantization. |
@OlivierDehaene I've tested several cases. OOM is occurred only when I use streaming. In case of streaming, even if there's no inference call, GPU is running(GPU-util 50%) for long time after inference call is finished and sometime OOM occurred without any client calls. |
Can you both:
|
I have tested on two configuration both have same issue.
|
Can you run the following commands: make install-benchmark text-generation-launcher --num-shard 2 --quantize --port 8080 --model-id "./Models/GPT-NeoX-20B-instruct-native/checkpoint-100" --max-input-length 1500 --max-total-tokens 2048 and then text-generation-benchmark --tokenizer-name EleutherAI/gpt-neox-20b --batch-size 32 --sequence-length 1500 --decode-length 548 If the benchmarking command fails it means that your setup cannot handle the maximum load you might be sending to it. As I stated above:
It is enterily possible that you don't OOM when the load on the system is low because the batches stay small and once usage grows you then OOM. Also, since the sequence lengths are dynamic, one batch of size N with small sequences might go through but another batch of the same size N with longer sequences might fail. That's why you need to make sure that your max_input_length max_total_tokens max_batch_size combination works in the worst case scenario ahead of time. |
@OlivierDehaene I've tried your command settings, and I get OOM immediatly. It works on max batch size 8. I guess on streaming case, each retrieval of event generate a batch on server side. Which means, generate_stream need more GPU memory on server side compare to non-stream-generate normaly. Is it right? |
No they both have the same memory requirements as generate uses generate-stream in the backend. |
The advice around benchmarking and in particular setting the batch size helped me. Thanks I still see the server continue to print tokens (into the log) beyond max_tokens in streaming mode, even though the client for loop is done - i believe @lcw99 mentioned experiencing something similar Thanks againn for your help Olivier |
I can understand @OlivierDehaene 's explanation and I understand the benchmarks. However, in a real situation, even though there is more than 15G of GPU memory left, streaming calls are made sequentially one by one, triggering GPU OOM is somewhat difficult to understand. |
What do you mean? |
I have tested on following condition.
I've run just one instance of client, I get OOM sometimes. |
I've try to run with OPT-13B. Model is loaded succesfully but on inference time, following error occured.
The text was updated successfully, but these errors were encountered: