8int加载Qwen-14b-chat,会报错RuntimeError: value cannot be converted to type at::Half without overflow #1475

HelWireless · 2023-11-12T12:16:16Z

加载Qwen7b-chat 和Baichuan13b-chat 8 int 都比较正常，Qwen14b-chat加载目前会报错。报错内容如下：
RuntimeError: value cannot be converted to type at::Half without overflow

使用的4090gpu，win11下的wsl2的环境，训练命令如下：

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path baichuan_model/Qwen-14B-Chat \
    --do_train True \
    --finetuning_type lora \
    --quantization_bit 8 \
    --template qwen \
    --flash_attn False \
    --shift_attn False \
    --dataset_dir data \
    --dataset abc_train_data_v2 \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 5.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 32 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 1 \
    --neft_alpha 0 \
    --train_on_prompt False \
    --upcast_layernorm True \
    --lora_rank 12 \
    --lora_dropout 0.1 \
    --lora_target c_attn \
    --resume_lora_training True \
    --output_dir saves/Qwen-14B-Chat/lora/2023-11-12-20-00-00 \
    --fp16 True \
    --plot_loss True

hiyouga · 2023-11-12T14:35:49Z

用 bf16 试试

HelWireless · 2023-11-13T03:05:30Z

用 bf16 试试

bf 16 也试了，尝试了不少参数更改，但都不太行，自己排查了下，可能是torch版本的原因，数据类型问题。我后续再尝试更换下toruch版本能否解决。

wrl1224 · 2023-11-13T05:31:18Z

我也遇到了同样的问题，尝试使用bf16不起作用，希望能项目能支持量化后模型的微调。

Chen-mingxuan · 2023-12-03T13:29:15Z

我也遇到了我8int加载Qwen-7b-chat报错

amulil · 2023-12-05T06:10:42Z

@Chen-mingxuan @wrl1224 @HelWireless 尝试下 pip install bitsandbytes==0.41.1，我升级之后就没这个问题了

Chen-mingxuan · 2023-12-06T05:04:24Z

我尝试了pip install bitsandbytes==0.41.1但没有作用，可以直接加载Qwen-7B-Chat-Int8进行lora训练，也算是一个平替方法。

Chen-mingxuan · 2023-12-06T05:05:37Z

我尝试了pip install bitsandbytes==0.41.1但没有作用，可以直接加载Qwen-7B-Chat-Int8进行lora训练，也算是一个平替方法。

https://huggingface.co/Qwen/Qwen-7B-Chat-Int8
https://huggingface.co/Qwen/Qwen-7B-Chat-Int4

amulil · 2023-12-06T05:52:34Z

https://huggingface.co/Qwen/Qwen-7B-Chat/discussions/10 可以看看这个，不知道有没有帮助

Chen-mingxuan · 2023-12-06T09:33:19Z

@Chen-mingxuan @wrl1224 @HelWireless 尝试下 pip install bitsandbytes==0.41.1，我升级之后就没这个问题了

我修改了源码modeling_qwen.py中的572行attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min)
改为attention_mask.masked_fill(~causal_mask, -1e4)就可以训练了。
torch.finfo(query.dtype).min 表示选择的数据类型的最小可能值。torch.finfo() 函数不适用于8位或4位的qlora，这个值可能太小，导致溢出错误。例如使用 -1e4 这样的较大值可以避免这种溢出。如果使用-1e10，仍然报相同的错误。

Chen-mingxuan · 2023-12-06T10:46:11Z

@Chen-mingxuan @wrl1224 @HelWireless 尝试下 pip install bitsandbytes==0.41.1，我升级之后就没这个问题了

我修改了源码modeling_qwen.py中的572行attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min) 改为attention_mask.masked_fill(~causal_mask, -1e4)就可以训练了。 torch.finfo(query.dtype).min 表示选择的数据类型的最小可能值。torch.finfo() 函数不适用于8位或4位的qlora，这个值可能太小，导致溢出错误。例如使用 -1e4 这样的较大值可以避免这种溢出。如果使用-1e10，仍然报相同的错误。

经过测试最小值-65504.0 再小-65505.0就报错，对应的是半精度浮点数float16的最小值。

hiyouga added the pending This problem is yet to be addressed. label Nov 12, 2023

hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed. labels Dec 1, 2023

hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2023

hiyouga added good first issue Good for newcomers solved This problem has been already solved. and removed wontfix This will not be worked on labels Dec 5, 2023

hiyouga closed this as completed Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8int加载Qwen-14b-chat,会报错RuntimeError: value cannot be converted to type at::Half without overflow #1475

8int加载Qwen-14b-chat,会报错RuntimeError: value cannot be converted to type at::Half without overflow #1475

HelWireless commented Nov 12, 2023

hiyouga commented Nov 12, 2023

HelWireless commented Nov 13, 2023

wrl1224 commented Nov 13, 2023

Chen-mingxuan commented Dec 3, 2023

amulil commented Dec 5, 2023

Chen-mingxuan commented Dec 6, 2023

Chen-mingxuan commented Dec 6, 2023

amulil commented Dec 6, 2023

Chen-mingxuan commented Dec 6, 2023

Chen-mingxuan commented Dec 6, 2023

8int加载Qwen-14b-chat,会报错RuntimeError: value cannot be converted to type at::Half without overflow #1475

8int加载Qwen-14b-chat,会报错RuntimeError: value cannot be converted to type at::Half without overflow #1475

Comments

HelWireless commented Nov 12, 2023

hiyouga commented Nov 12, 2023

HelWireless commented Nov 13, 2023

wrl1224 commented Nov 13, 2023

Chen-mingxuan commented Dec 3, 2023

amulil commented Dec 5, 2023

Chen-mingxuan commented Dec 6, 2023

Chen-mingxuan commented Dec 6, 2023

amulil commented Dec 6, 2023

Chen-mingxuan commented Dec 6, 2023

Chen-mingxuan commented Dec 6, 2023