Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue #10926

Open
Fred-cell opened this issue May 5, 2024 · 1 comment
Open
Assignees

Comments

@Fred-cell
Copy link

Fred-cell commented May 5, 2024

the 2nd latency of llama3-8b-instruct with int4 and bs=1 is larger than bs=2, ipex-llm=2.5.0b20240504
image
image

@lalalapotter
Copy link
Contributor

Already reproduce the issue, and will fix it later. We recommend you use fp16 for non-linear layer, please refer to benchmark scripts all-in-one, and select transformer_int4_fp16_gpu API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants