You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am requesting that support for fp16 inference with self-speculative decoding on XPU be supported for the fastchat ipex LLM worker module - it does not appear this is currently supported.
Currently when trying to use --low-bit "fp16" with ipex_llm.serving.fastchat.ipex_llm_worker results in the following error:
/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
2024-04-29 07:20:32,788 - INFO - intel_extension_for_pytorch auto imported
2024-04-29 07:20:33 | INFO | model_worker | Loading the model ['neural-chat-7b-v3-3'] on worker 4e2d5da6, worker type: BigDLLLM worker...
2024-04-29 07:20:33 | INFO | model_worker | Using low bit format: fp16, device: xpu
2024-04-29 07:20:33 | ERROR | ipex_llm.utils.common.log4Error |
****Usage Error
Please use torch_dtype=torch.float16 when setting load_in_low_bit='fp16'.
2024-04-29 07:20:33 | ERROR | ipex_llm.utils.common.log4Error |
Just to provide a bit more information @gc-fu - here, it's providing the torch_dtype value as "auto", but in the fp16 example with self-speculative decoding, it's showing that torch_dtype should be set to torch.float16. Also there are other parameters in the example that aren't being provided when launching via the ipex_llm_worker - specifically "speculative" and "optimize_model" - this is why I marked this as a feature request and not a bug, I thought this mode just isn't supported yet for the ipex_llm_worker module (would be nice if it was though).
@brosenfi
The self-speculative decoding using fastchat worker will be supported in this PR.
But the speculative example only supports running on intel max GPU due to the memory usage limitations. You can try it on max GPU or CPU later.
I am requesting that support for fp16 inference with self-speculative decoding on XPU be supported for the fastchat ipex LLM worker module - it does not appear this is currently supported.
Currently when trying to use --low-bit "fp16" with ipex_llm.serving.fastchat.ipex_llm_worker results in the following error:
The text was updated successfully, but these errors were encountered: