Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash when out token length is > 64 #633

Open
Edward-Lin opened this issue May 21, 2024 · 4 comments
Open

crash when out token length is > 64 #633

Edward-Lin opened this issue May 21, 2024 · 4 comments
Assignees
Labels
ARC ARC GPU Crash Execution crashes LLM Windows

Comments

@Edward-Lin
Copy link

I used to test llama3, when I set max_new_tokens > 64 on windows Ultra Core platform, U9 185 32GB, the program would crash.
attached the code, and logs.
Uploading log_crash.txt…
Uploading run_generation_gpu_woq_for_llama.py.txt…

@Edward-Lin
Copy link
Author

@huiyan2021
Copy link

huiyan2021 commented May 21, 2024

Hi @Edward-Lin,

Thanks for reporting, I will try to reproduce first. Could you also help collect the information by running https://github.com/intel/intel-extension-for-pytorch/blob/main/scripts/collect_env.py and upload the result here? Thanks!

btw, are you referring to the guide here: https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/

@huiyan2021 huiyan2021 self-assigned this May 21, 2024
@huiyan2021 huiyan2021 added ARC ARC GPU Crash Execution crashes Windows LLM labels May 21, 2024
@Edward-Lin
Copy link
Author

ipex-collect-env.txt
please refer the attached FYI.
Yes, I've followed the guide, from https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/, and can run 32 output token, but for the larger, will crash.
BTW, the 2.1.1 chapter showed how to convert model, with different branch, but there is no space to show if we don't want to convert mode, just like to run the mode, which code, or which branch should be used. it's very confused.
Thanks,

@zhuyuhua-v
Copy link

ipex-collect-env.txt please refer the attached FYI. Yes, I've followed the guide, from https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/, and can run 32 output token, but for the larger, will crash. BTW, the 2.1.1 chapter showed how to convert model, with different branch, but there is no space to show if we don't want to convert mode, just like to run the mode, which code, or which branch should be used. it's very confused. Thanks,

  1. We will further investigate why long sentences are causing crashes.
  2. The prolonged conversion time is due to the algorithm used by the quantized model, which requires a long time for iterative generation of accurate quantized data. We are further enhancing support to enable the WoQ LLM solution in ipex to directly handle pre-quantized models provided by the community, simplifying the model quantization process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ARC ARC GPU Crash Execution crashes LLM Windows
Projects
None yet
Development

No branches or pull requests

3 participants