Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker #10905

Closed
brosenfi opened this issue Apr 28, 2024 · 5 comments
Assignees

Comments

@brosenfi
Copy link

I am requesting that support for fp16 inference with self-speculative decoding on XPU be supported for the fastchat ipex LLM worker module - it does not appear this is currently supported.

Currently when trying to use --low-bit "fp16" with ipex_llm.serving.fastchat.ipex_llm_worker results in the following error:

/usr/local/lib/python3.11/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from torchvision.io, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have libjpeg or libpng installed before building torchvision from source?
warn(
2024-04-29 07:20:32,788 - INFO - intel_extension_for_pytorch auto imported
2024-04-29 07:20:33 | INFO | model_worker | Loading the model ['neural-chat-7b-v3-3'] on worker 4e2d5da6, worker type: BigDLLLM worker...
2024-04-29 07:20:33 | INFO | model_worker | Using low bit format: fp16, device: xpu
2024-04-29 07:20:33 | ERROR | ipex_llm.utils.common.log4Error |

****Usage Error
Please use torch_dtype=torch.float16 when setting load_in_low_bit='fp16'.
2024-04-29 07:20:33 | ERROR | ipex_llm.utils.common.log4Error |

***Call Stack
2024-04-29 07:20:33 | ERROR | stderr | Traceback (most recent call last):
2024-04-29 07:20:33 | ERROR | stderr | File "", line 198, in _run_module_as_main
2024-04-29 07:20:33 | ERROR | stderr | File "", line 88, in _run_code
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py", line 326, in
2024-04-29 07:20:33 | ERROR | stderr | worker = BigDLLLMWorker(
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/serving/fastchat/ipex_llm_worker.py", line 88, in init
2024-04-29 07:20:33 | ERROR | stderr | self.model, self.tokenizer = load_model(
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/loader.py", line 67, in load_model
2024-04-29 07:20:33 | ERROR | stderr | model = model_cls.from_pretrained(model_path, **model_kwargs)
2024-04-29 07:20:33 | ERROR | stderr | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/transformers/model.py", line 294, in from_pretrained
2024-04-29 07:20:33 | ERROR | stderr | invalidInputError(
2024-04-29 07:20:33 | ERROR | stderr | File "/usr/local/lib/python3.11/dist-packages/ipex_llm/utils/common/log4Error.py", line 32, in invalidInputError
2024-04-29 07:20:33 | ERROR | stderr | raise RuntimeError(errMsg)
2024-04-29 07:20:33 | ERROR | stderr | RuntimeError: Please use torch_dtype=torch.float16 when setting load_in_low_bit='fp16'.

@gc-fu
Copy link
Contributor

gc-fu commented Apr 29, 2024

Hi, I am working on to reproduce this issue.

@brosenfi
Copy link
Author

Just to provide a bit more information @gc-fu - here, it's providing the torch_dtype value as "auto", but in the fp16 example with self-speculative decoding, it's showing that torch_dtype should be set to torch.float16. Also there are other parameters in the example that aren't being provided when launching via the ipex_llm_worker - specifically "speculative" and "optimize_model" - this is why I marked this as a feature request and not a bug, I thought this mode just isn't supported yet for the ipex_llm_worker module (would be nice if it was though).

@gc-fu
Copy link
Contributor

gc-fu commented Apr 29, 2024

Hi, this issue actually contains two parts:

  1. A bug that is caused by using low_bit fp16 in ipex_llm_worker.
  2. Feature request: support ipex_llm_worker with speculative decoding.

The first part have been fixed by this pr #10907.

The second part will be supported by @hzjane.

@gc-fu gc-fu removed their assignment Apr 29, 2024
@hzjane
Copy link
Contributor

hzjane commented Apr 29, 2024

@brosenfi
The self-speculative decoding using fastchat worker will be supported in this PR.
But the speculative example only supports running on intel max GPU due to the memory usage limitations. You can try it on max GPU or CPU later.

@gc-fu gc-fu closed this as completed May 7, 2024
@brosenfi
Copy link
Author

brosenfi commented May 7, 2024

Thank you @gc-fu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants