LLM: Enable Speculative on Fastchat #10909

hzjane · 2024-04-29T06:10:59Z

Description

Enable Speculative on Fastchat
Llama-7b speculative will use near 16.5GB（fp16+int4） memory on Self-Speculative case，So just can run on max GPU.

1. Why the change?

2. User API changes

3. Summary of the change

4. How to test?

Application test on spr14
Application test on max1100

5. New dependencies

New Python dependencies
- Dependency1
- Dependency2
- ...
New Java/Scala dependencies and their license
- Dependency1 and license1
- Dependency2 and license2
- ...

python/llm/src/ipex_llm/serving/fastchat/README.md

glorysdj · 2024-04-29T08:21:01Z

also update https://github.com/intel-analytics/ipex-llm/blob/main/docs/readthedocs/source/doc/LLM/Quickstart/fastchat_quickstart.md to add speculative support

glorysdj · 2024-04-30T01:44:42Z

docs/readthedocs/source/doc/LLM/Quickstart/fastchat_quickstart.md

+
+```bash
+# Available low_bit format including bf16 on CPU, fp16 on MAX GPU.
+python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "bf16" --trust-remote-code --device "cpu" --speculative


also add gpu example

gc-fu

The code changes LGTM

hzjane added 4 commits April 29, 2024 13:46

init

06ade62

enable streamer

3c05447

update

59308c4

update

79ecb12

hzjane requested review from glorysdj and gc-fu April 29, 2024 07:40

glorysdj reviewed Apr 29, 2024

View reviewed changes

python/llm/src/ipex_llm/serving/fastchat/README.md Outdated Show resolved Hide resolved

hzjane mentioned this pull request Apr 29, 2024

Feature request: Support fp16 with self-speculative decoding on XPU in ipex_llm.serving.fastchat.ipex_llm_worker #10905

Closed

remove deprecated

9b208fd

hzjane requested a review from glorysdj April 29, 2024 08:12

hzjane added 2 commits April 29, 2024 16:32

update

24abd07

update

7fc3234

hzjane marked this pull request as ready for review April 29, 2024 08:43

glorysdj reviewed Apr 30, 2024

View reviewed changes

add gpu example

1349ae1

gc-fu approved these changes May 6, 2024

View reviewed changes

hzjane merged commit 0e0bd30 into intel-analytics:main May 6, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM: Enable Speculative on Fastchat #10909

LLM: Enable Speculative on Fastchat #10909

hzjane commented Apr 29, 2024 •

edited

glorysdj commented Apr 29, 2024

glorysdj Apr 30, 2024

gc-fu left a comment

LLM: Enable Speculative on Fastchat #10909

LLM: Enable Speculative on Fastchat #10909

Conversation

hzjane commented Apr 29, 2024 • edited

Description

1. Why the change?

2. User API changes

3. Summary of the change

4. How to test?

5. New dependencies

glorysdj commented Apr 29, 2024

glorysdj Apr 30, 2024

Choose a reason for hiding this comment

gc-fu left a comment

Choose a reason for hiding this comment

hzjane commented Apr 29, 2024 •

edited