Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8int加载Qwen-14b-chat,会报错RuntimeError: value cannot be converted to type at::Half without overflow #1475

Closed
HelWireless opened this issue Nov 12, 2023 · 10 comments
Labels
good first issue Good for newcomers solved This problem has been already solved.

Comments

@HelWireless
Copy link

加载Qwen7b-chat 和Baichuan13b-chat 8 int 都比较正常,Qwen14b-chat加载目前会报错。报错内容如下:
RuntimeError: value cannot be converted to type at::Half without overflow

image

使用的4090gpu,win11下的wsl2的环境,训练命令如下:

CUDA_VISIBLE_DEVICES=0 python src/train_bash.py \
    --stage sft \
    --model_name_or_path baichuan_model/Qwen-14B-Chat \
    --do_train True \
    --finetuning_type lora \
    --quantization_bit 8 \
    --template qwen \
    --flash_attn False \
    --shift_attn False \
    --dataset_dir data \
    --dataset abc_train_data_v2 \
    --cutoff_len 1024 \
    --learning_rate 5e-05 \
    --num_train_epochs 5.0 \
    --max_samples 100000 \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 32 \
    --lr_scheduler_type cosine \
    --max_grad_norm 1.0 \
    --logging_steps 5 \
    --save_steps 100 \
    --warmup_steps 1 \
    --neft_alpha 0 \
    --train_on_prompt False \
    --upcast_layernorm True \
    --lora_rank 12 \
    --lora_dropout 0.1 \
    --lora_target c_attn \
    --resume_lora_training True \
    --output_dir saves/Qwen-14B-Chat/lora/2023-11-12-20-00-00 \
    --fp16 True \
    --plot_loss True 
@hiyouga hiyouga added the pending This problem is yet to be addressed. label Nov 12, 2023
@hiyouga
Copy link
Owner

hiyouga commented Nov 12, 2023

用 bf16 试试

@HelWireless
Copy link
Author

用 bf16 试试

bf 16 也试了,尝试了不少参数更改,但都不太行,自己排查了下,可能是torch版本的原因,数据类型问题。我后续再尝试更换下toruch版本能否解决。

@wrl1224
Copy link

wrl1224 commented Nov 13, 2023

我也遇到了同样的问题,尝试使用bf16不起作用,希望能项目能支持量化后模型的微调。

@hiyouga hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed. labels Dec 1, 2023
@hiyouga hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale Dec 1, 2023
@Chen-mingxuan
Copy link

我也遇到了 我8int加载Qwen-7b-chat报错

@amulil
Copy link

amulil commented Dec 5, 2023

@Chen-mingxuan @wrl1224 @HelWireless 尝试下 pip install bitsandbytes==0.41.1,我升级之后就没这个问题了

@hiyouga hiyouga added good first issue Good for newcomers solved This problem has been already solved. and removed wontfix This will not be worked on labels Dec 5, 2023
@hiyouga hiyouga closed this as completed Dec 5, 2023
@Chen-mingxuan
Copy link

我尝试了pip install bitsandbytes==0.41.1但没有作用,可以直接加载Qwen-7B-Chat-Int8进行lora训练,也算是一个平替方法。

@Chen-mingxuan
Copy link

我尝试了pip install bitsandbytes==0.41.1但没有作用,可以直接加载Qwen-7B-Chat-Int8进行lora训练,也算是一个平替方法。

https://huggingface.co/Qwen/Qwen-7B-Chat-Int8
https://huggingface.co/Qwen/Qwen-7B-Chat-Int4

@amulil
Copy link

amulil commented Dec 6, 2023

https://huggingface.co/Qwen/Qwen-7B-Chat/discussions/10 可以看看这个,不知道有没有帮助

@Chen-mingxuan
Copy link

@Chen-mingxuan @wrl1224 @HelWireless 尝试下 pip install bitsandbytes==0.41.1,我升级之后就没这个问题了

我修改了源码modeling_qwen.py中的572行attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min)
改为attention_mask.masked_fill(~causal_mask, -1e4)就可以训练了。
torch.finfo(query.dtype).min 表示选择的数据类型的最小可能值。torch.finfo() 函数不适用于8位或4位的qlora,这个值可能太小,导致溢出错误。例如使用 -1e4 这样的较大值可以避免这种溢出。如果使用-1e10,仍然报相同的错误。

@Chen-mingxuan
Copy link

@Chen-mingxuan @wrl1224 @HelWireless 尝试下 pip install bitsandbytes==0.41.1,我升级之后就没这个问题了

我修改了源码modeling_qwen.py中的572行attention_mask.masked_fill(~causal_mask, torch.finfo(query.dtype).min) 改为attention_mask.masked_fill(~causal_mask, -1e4)就可以训练了。 torch.finfo(query.dtype).min 表示选择的数据类型的最小可能值。torch.finfo() 函数不适用于8位或4位的qlora,这个值可能太小,导致溢出错误。例如使用 -1e4 这样的较大值可以避免这种溢出。如果使用-1e10,仍然报相同的错误。

经过测试最小值-65504.0 再小-65505.0就报错,对应的是半精度浮点数float16的最小值。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers solved This problem has been already solved.
Projects
None yet
Development

No branches or pull requests

5 participants