Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker #10905

brosenfi · 2024-04-28T23:25:49Z

I am requesting that support for fp16 inference with self-speculative decoding on XPU be supported for the fastchat ipex LLM worker module - it does not appear this is currently supported.

Currently when trying to use --low-bit "fp16" with ipex_llm.serving.fastchat.ipex_llm_worker results in the following error:

/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
2024-04-29 07:20:32,788 - INFO - intel_extension_for_pytorch auto imported
2024-04-29 07:20:33 | INFO | model_worker | Loading the model ['neural-chat-7b-v3-3'] on worker 4e2d5da6, worker type: BigDLLLM worker...
2024-04-29 07:20:33 | INFO | model_worker | Using low bit format: fp16, device: xpu
2024-04-29 07:20:33 | ERROR | ipex_llm.utils.common.log4Error |

****Usage Error
Please use torch_dtype=torch.float16 when setting load_in_low_bit='fp16'.
2024-04-29 07:20:33 | ERROR | ipex_llm.utils.common.log4Error |

***Call Stack
2024-04-29 07:20:33 | ERROR | stderr | Traceback (most recent call last):
2024-04-29 07:20:33 | ERROR | stderr | File "", line 198, in _run_module_as_main
2024-04-29 07:20:33 | ERROR | stderr | File "", line 88, in _run_code
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py", line 326, in
2024-04-29 07:20:33 | ERROR | stderr | worker = BigDLLLMWorker(
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py", line 88, in init
2024-04-29 07:20:33 | ERROR | stderr | self.model, self.tokenizer = load_model(
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/loader.py", line 67, in load_model
2024-04-29 07:20:33 | ERROR | stderr | model = model_cls.from_pretrained(model_path, **model_kwargs)
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/model.py", line 294, in from_pretrained
2024-04-29 07:20:33 | ERROR | stderr | invalidInputError(
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/common/log4Error.py", line 32, in invalidInputError
2024-04-29 07:20:33 | ERROR | stderr | raise RuntimeError(errMsg)
2024-04-29 07:20:33 | ERROR | stderr | RuntimeError: Please use torch_dtype=torch.float16 when setting load_in_low_bit='fp16'.

The text was updated successfully, but these errors were encountered:

gc-fu · 2024-04-29T01:23:52Z

Hi, I am working on to reproduce this issue.

brosenfi · 2024-04-29T01:51:52Z

Just to provide a bit more information @gc-fu - here, it's providing the torch_dtype value as "auto", but in the fp16 example with self-speculative decoding, it's showing that torch_dtype should be set to torch.float16. Also there are other parameters in the example that aren't being provided when launching via the ipex_llm_worker - specifically "speculative" and "optimize_model" - this is why I marked this as a feature request and not a bug, I thought this mode just isn't supported yet for the ipex_llm_worker module (would be nice if it was though).

gc-fu · 2024-04-29T02:19:41Z

Hi, this issue actually contains two parts:

A bug that is caused by using low_bit fp16 in ipex_llm_worker.
Feature request: support ipex_llm_worker with speculative decoding.

The first part have been fixed by this pr #10907.

The second part will be supported by @hzjane.

hzjane · 2024-04-29T08:06:12Z

@brosenfi
The self-speculative decoding using fastchat worker will be supported in this PR.
But the speculative example only supports running on intel max GPU due to the memory usage limitations. You can try it on max GPU or CPU later.

brosenfi · 2024-05-07T14:16:42Z

Thank you @gc-fu

jason-dai added the user issue label Apr 29, 2024

glorysdj assigned hzjane Apr 29, 2024

hzjane assigned gc-fu Apr 29, 2024

gc-fu mentioned this issue Apr 29, 2024

Fix loader issue when loading fp16 #10907

Merged

gc-fu removed their assignment Apr 29, 2024

gc-fu closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker #10905

Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker #10905

brosenfi commented Apr 28, 2024

gc-fu commented Apr 29, 2024

brosenfi commented Apr 29, 2024

gc-fu commented Apr 29, 2024

hzjane commented Apr 29, 2024 •

edited

brosenfi commented May 7, 2024

Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker #10905

Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker #10905

Comments

brosenfi commented Apr 28, 2024

gc-fu commented Apr 29, 2024

brosenfi commented Apr 29, 2024

gc-fu commented Apr 29, 2024

hzjane commented Apr 29, 2024 • edited

brosenfi commented May 7, 2024

hzjane commented Apr 29, 2024 •

edited