-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM: Enable Speculative on Fastchat #10909
LLM: Enable Speculative on Fastchat #10909
Conversation
also update https://github.com/intel-analytics/ipex-llm/blob/main/docs/readthedocs/source/doc/LLM/Quickstart/fastchat_quickstart.md to add speculative support |
|
||
```bash | ||
# Available low_bit format including bf16 on CPU, fp16 on MAX GPU. | ||
python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path lmsys/vicuna-7b-v1.5 --low-bit "bf16" --trust-remote-code --device "cpu" --speculative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add gpu example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code changes LGTM
Description
Enable Speculative on Fastchat
Llama-7b speculative will use near 16.5GB(fp16+int4) memory on Self-Speculative case,So just can run on max GPU.
1. Why the change?
2. User API changes
3. Summary of the change
4. How to test?
5. New dependencies
- Dependency1
- Dependency2
- ...
- Dependency1 and license1
- Dependency2 and license2
- ...