crash when out token length is > 64 #633

Edward-Lin · 2024-05-21T01:34:53Z

I used to test llama3, when I set max_new_tokens > 64 on windows Ultra Core platform, U9 185 32GB, the program would crash.
attached the code, and logs.
Uploading log_crash.txt…
Uploading run_generation_gpu_woq_for_llama.py.txt…

Edward-Lin · 2024-05-21T01:36:20Z

run_generation_gpu_woq_for_llama.py.txt
log_crash.txt

huiyan2021 · 2024-05-21T03:49:44Z

Hi @Edward-Lin,

Thanks for reporting, I will try to reproduce first. Could you also help collect the information by running https://github.com/intel/intel-extension-for-pytorch/blob/main/scripts/collect_env.py and upload the result here? Thanks!

btw, are you referring to the guide here: https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/

Edward-Lin · 2024-05-21T07:21:26Z

ipex-collect-env.txt
please refer the attached FYI.
Yes, I've followed the guide, from https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/, and can run 32 output token, but for the larger, will crash.
BTW, the 2.1.1 chapter showed how to convert model, with different branch, but there is no space to show if we don't want to convert mode, just like to run the mode, which code, or which branch should be used. it's very confused.
Thanks,

zhuyuhua-v · 2024-05-23T07:29:42Z

ipex-collect-env.txt please refer the attached FYI. Yes, I've followed the guide, from https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/, and can run 32 output token, but for the larger, will crash. BTW, the 2.1.1 chapter showed how to convert model, with different branch, but there is no space to show if we don't want to convert mode, just like to run the mode, which code, or which branch should be used. it's very confused. Thanks,

We will further investigate why long sentences are causing crashes.
The prolonged conversion time is due to the algorithm used by the quantized model, which requires a long time for iterative generation of accurate quantized data. We are further enhancing support to enable the WoQ LLM solution in ipex to directly handle pre-quantized models provided by the community, simplifying the model quantization process.

huiyan2021 self-assigned this May 21, 2024

huiyan2021 added ARC ARC GPU Crash Execution crashes Windows LLM labels May 21, 2024

huiyan2021 assigned zhuyuhua-v May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crash when out token length is > 64 #633

crash when out token length is > 64 #633

Edward-Lin commented May 21, 2024

Edward-Lin commented May 21, 2024

huiyan2021 commented May 21, 2024 •

edited

Edward-Lin commented May 21, 2024

zhuyuhua-v commented May 23, 2024

crash when out token length is > 64 #633

crash when out token length is > 64 #633

Comments

Edward-Lin commented May 21, 2024

Edward-Lin commented May 21, 2024

huiyan2021 commented May 21, 2024 • edited

Edward-Lin commented May 21, 2024

zhuyuhua-v commented May 23, 2024

huiyan2021 commented May 21, 2024 •

edited