-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Qwen1.5-7B wrong outputs with 1024 prompts #10354
Comments
@ivy-lv11 pls take a look at this |
transformers: 4.38.1/4.37.0 |
If BF16 output is wrong, you can verify stock pytorch first (without BigDL). |
ChineseWhen using the 2048 prompt (2048 prompts are truncated to 1024) with original transformers, pytorch and remove low_bit, the output looks normal.
prompt: 红楼梦
|
what is the torch version? torch==2.2.0? |
Yes. |
removing load_in_low_bit and optimize_model runs FP32. If FP32 gave normal outputs, the issue can be related to INT4, which can be compared with Llama.cpp etc. And BF16 can be compared with native Pytorch BF16 support. |
Use transformers and bf16 by
|
After disabling overriding of qwen2 attention forward (qwen1.5 enjoys a model type of qwen2) in convert.py, normal answer can be generated on SPR: 两旁是一副对联:\n假作真时真亦假,无为有处有还无。\n二人进了里面,见是一座楼阁,楼内挂着“薄命司”的牌子。士隐抬头一看,见里面挂着许多签,签上写着名字,旁边注着诗句和判词。他见签上有个“甄英莲”的名字,就抽出来看,上面写着:\n娇嫩花朵偏遭风雨,聪明女儿薄命终身。\n原是仙家遗种,却落在草莽人家。生于富贵,却死于贫贱。这是她的命,无可奈何。士隐看了,叹了一口气,把签放下。又见一个签上写着“贾 Need to check what is wrong in qwen2_attention_forward_origin. |
Test BigDL-LLM 2.5.0b20240311Envirionment:
On arc the output looks normal:
However, when running on CPU the output still looks abnormal.
|
It is found CPU uses different attention module from GPU, Qwen2ForCausalLM(
(model): Qwen2Model(
(embed_tokens): Embedding(151936, 4096)
(layers): ModuleList(
(0-31): 32 x Qwen2DecoderLayer(
(self_attn): Qwen2SdpaAttention(
(q_proj): LowBitLinear(in_features=4096, out_features=4096, bias=True)
(k_proj): LowBitLinear(in_features=4096, out_features=4096, bias=True)
(v_proj): LowBitLinear(in_features=4096, out_features=4096, bias=True)
(o_proj): LowBitLinear(in_features=4096, out_features=4096, bias=False)
(rotary_emb): Qwen2RotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): LowBitLinear(in_features=4096, out_features=11008, bias=False)
(up_proj): LowBitLinear(in_features=4096, out_features=11008, bias=False)
(down_proj): LowBitLinear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm()
(post_attention_layernorm): Qwen2RMSNorm()
)
)
(norm): Qwen2RMSNorm()
)
(lm_head): LowBitLinear(in_features=4096, out_features=151936, bias=False)
) |
Model architectureGPUUse Qwen2attention
CPUuse Qwen2SdpaAttention
|
code: all-in-one benchmark, where prmopt/2048.txt is replaced with the below Chinese ones
in-out pair: 1024-128 (2048 prompts are truncated to 1024)
model: Qwen1.5-7B-Chat
machine: SPR01
红楼梦
INT4/INT8/BF16 all repeat like:
患者
INT4 repeats like below, while BF16 and INT8 give no answer:
The text was updated successfully, but these errors were encountered: