2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue #10926

Fred-cell · 2024-05-05T14:46:28Z

the 2nd latency of llama3-8b-instruct with int4 and bs=1 is larger than bs=2, ipex-llm=2.5.0b20240504

lalalapotter · 2024-05-07T01:43:47Z

Already reproduce the issue, and will fix it later. We recommend you use fp16 for non-linear layer, please refer to benchmark scripts all-in-one, and select transformer_int4_fp16_gpu API.

glorysdj assigned lalalapotter May 6, 2024

jason-dai added the user issue label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue #10926

2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue #10926

Fred-cell commented May 5, 2024 •

edited

lalalapotter commented May 7, 2024

2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue #10926

2nd latency of llama3-8B-Instruct with int4 & all-in-one tool issue #10926

Comments

Fred-cell commented May 5, 2024 • edited

lalalapotter commented May 7, 2024

Fred-cell commented May 5, 2024 •

edited