使用lora微调时，同时训练了一些层的参数，合并验证报错 #2928

ac-automata · 2024-03-22T07:02:22Z

Reminder

I have read the README and searched the existing issues.

Reproduction

config

{
"alpha_pattern": {},
"auto_mapping": null,
"base_model_name_or_path": "/root/data/Mixtral-HQQ/MixTAO-7Bx2-MoE-v8.1",
"bias": "none",
"fan_in_fan_out": false,
"inference_mode": true,
"init_lora_weights": true,
"layers_pattern": null,
"layers_to_transform": null,
"loftq_config": {},
"lora_alpha": 16,
"lora_dropout": 0.1,
"megatron_config": null,
"megatron_core": "megatron.core",
"modules_to_save": [
"input_layernorm",
"norm",
"gate_proj"
],
"peft_type": "LORA",
"r": 16,
"rank_pattern": {},
"revision": null,
"target_modules": [
"q_proj",
"v_proj"
],
"task_type": "CAUSAL_LM",
"use_dora": false,
"use_rslora": true
}

报错

Traceback (most recent call last):
File "/opt/anaconda3/envs/train/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/opt/anaconda3/envs/train/lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "/root/data/Mixtral-HQQ/LLaMA-Factory/src/llmtuner/train/tuner.py", line 32, in run_exp
run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
File "/root/data/Mixtral-HQQ/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 91, in run_sft
predict_results = trainer.predict(dataset, metric_key_prefix="predict", **gen_kwargs)
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/transformers/trainer_seq2seq.py", line 230, in predict
return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/transformers/trainer.py", line 3309, in predict
output = eval_loop(
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/transformers/trainer.py", line 3422, in evaluation_loop
loss, logits, labels = self.prediction_step(model, inputs, prediction_loss_only, ignore_keys=ignore_keys)
File "/root/data/Mixtral-HQQ/LLaMA-Factory/src/llmtuner/train/sft/trainer.py", line 47, in prediction_step
loss, generated_tokens, _ = super().prediction_step( # ignore the returned labels (may be truncated)
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/transformers/trainer_seq2seq.py", line 296, in prediction_step
generated_tokens = self.model.generate(**generation_inputs, **gen_kwargs)
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/peft/peft_model.py", line 1148, in generate
outputs = self.base_model.generate(*args, **kwargs)
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/transformers/generation/utils.py", line 1592, in generate
return self.sample(
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/transformers/generation/utils.py", line 2696, in sample
outputs = self(
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/transformers/models/mixtral/modeling_mixtral.py", line 1374, in forward
logits = self.lm_head(hidden_states)
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/opt/anaconda3/envs/train/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: expected scalar type Float but found BFloat16

Expected behavior

No response

System Info

No response

Others

No response

codemayq · 2024-03-22T07:09:09Z

需要检查一下训练时使用的精度和合并时的精度是否保持一致，都是 bf16或者fp16，否则可能会有错误。

ac-automata · 2024-03-23T10:12:13Z

需要手动合并吗？我只在预测时加上 LoRA 权重，都使用的bf16

需要检查一下训练时使用的精度和合并时的精度是否保持一致，都是 bf16或者fp16，否则可能会有错误。

@marko1616

* fix packages * Update wechat.jpg * Updated README with new information * Updated README with new information * Updated README with new information * Follow HF_ENDPOINT environment variable * fix hiyouga#2346 * fix hiyouga#2777 hiyouga#2895 * add orca_dpo_pairs dataset * support fsdp + qlora * update readme * update tool extractor * paper release * add citation * move file * Update README.md, fix the release date of the paper * Update README_zh.md, fix the release date of the paper * Update wechat.jpg * fix hiyouga#2941 * fix hiyouga#2928 * fix hiyouga#2936 * fix Llama lora merge crash * fix Llama lora merge crash * fix Llama lora merge crash * pass ruff check * tiny fix * Update requirements.txt * Update README_zh.md * release v0.6.0 * add arg check * Update README_zh.md * Update README.md * update readme * tiny fix * release v0.6.0 (real) * Update wechat.jpg * fix hiyouga#2961 * fix bug * fix hiyouga#2981 * fix ds optimizer * update trainers * fix hiyouga#3010 * update readme * fix hiyouga#2982 * add project * update readme * release v0.6.1 * Update wechat.jpg * fix pile datset hf hub url * upgrade gradio to 4.21.0 * support save args in webui hiyouga#2807 hiyouga#3046 some ideas are borrowed from @marko1616 * Fix Llama model save for full param train * fix blank line contains whitespace * tiny fix * support ORPO * support orpo in webui * update readme * use log1p in orpo loss huggingface/trl#1491 * fix plots * fix IPO and ORPO loss * fix ORPO loss * update webui * support infer 4bit model on GPUs hiyouga#3023 * fix hiyouga#3077 * add qwen1.5 moe * fix hiyouga#3083 * set dev version * Update SECURITY.md * fix hiyouga#3022 * add moe aux loss control hiyouga#3085 * simplify readme * update readme * update readme * update examples * update examples * add zh readme * update examples * update readme * update vllm example * Update wechat.jpg * fix hiyouga#3116 * fix resize vocab at inference hiyouga#3022 * fix requires for windows * fix bug in latest gradio * back to gradio 4.21 and fix chat * tiny fix * update examples * update readme * support Qwen1.5-32B * support Qwen1.5-32B * fix spell error * support hiyouga#3152 * rename template to breeze * rename template to breeze * add empty line * Update wechat.jpg * tiny fix * fix quant infer and qwen2moe * Pass additional_target to unsloth Fixes hiyouga#3200 * Update adapter.py * Update adapter.py * fix hiyouga#3225 --------- Co-authored-by: hiyouga <hiyouga@buaa.edu.cn> Co-authored-by: 刘一博 <liuyibo@khazics-MacBook-Pro.local> Co-authored-by: khazic <khazzz1c@gmail.com> Co-authored-by: SirlyDreamer <45280500+SirlyDreamer@users.noreply.github.com> Co-authored-by: Sanjay Nadhavajhala <sanjay@acorn.io> Co-authored-by: sanjay920 <sanjay.nadhavajhala@gmail.com> Co-authored-by: 0xez <110299556+0xez@users.noreply.github.com> Co-authored-by: marko1616 <marko1616@outlook.com> Co-authored-by: Remek Kinas <62574431+rkinas@users.noreply.github.com> Co-authored-by: Tsumugii24 <2792474059@qq.com> Co-authored-by: li.yunhao <li.yunhao@foxmail.com> Co-authored-by: sliderSun <291952004@qq.com> Co-authored-by: codingma <codingma@163.com> Co-authored-by: Erich Schubert <kno10@users.noreply.github.com>

codemayq added the pending This problem is yet to be addressed label Mar 22, 2024

hiyouga added solved This problem has been already solved and removed pending This problem is yet to be addressed labels Mar 23, 2024

hiyouga closed this as completed in 7afbc85 Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用lora微调时，同时训练了一些层的参数，合并验证报错 #2928

使用lora微调时，同时训练了一些层的参数，合并验证报错 #2928

ac-automata commented Mar 22, 2024

codemayq commented Mar 22, 2024

ac-automata commented Mar 23, 2024

使用lora微调时，同时训练了一些层的参数，合并验证报错 #2928

使用lora微调时，同时训练了一些层的参数，合并验证报错 #2928

Comments

ac-automata commented Mar 22, 2024

Reminder

Reproduction

config

报错

Expected behavior

System Info

Others

codemayq commented Mar 22, 2024

ac-automata commented Mar 23, 2024