You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ipex-collect-env.txt
please refer the attached FYI.
Yes, I've followed the guide, from https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/, and can run 32 output token, but for the larger, will crash.
BTW, the 2.1.1 chapter showed how to convert model, with different branch, but there is no space to show if we don't want to convert mode, just like to run the mode, which code, or which branch should be used. it's very confused.
Thanks,
ipex-collect-env.txt please refer the attached FYI. Yes, I've followed the guide, from https://intel.github.io/intel-extension-for-pytorch/llm/llama3/xpu/, and can run 32 output token, but for the larger, will crash. BTW, the 2.1.1 chapter showed how to convert model, with different branch, but there is no space to show if we don't want to convert mode, just like to run the mode, which code, or which branch should be used. it's very confused. Thanks,
We will further investigate why long sentences are causing crashes.
The prolonged conversion time is due to the algorithm used by the quantized model, which requires a long time for iterative generation of accurate quantized data. We are further enhancing support to enable the WoQ LLM solution in ipex to directly handle pre-quantized models provided by the community, simplifying the model quantization process.
I used to test llama3, when I set max_new_tokens > 64 on windows Ultra Core platform, U9 185 32GB, the program would crash.
attached the code, and logs.
Uploading log_crash.txt…
Uploading run_generation_gpu_woq_for_llama.py.txt…
The text was updated successfully, but these errors were encountered: