-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
预训练codeqwen1.5-7b时显存分布异常,训练一段时间后爆OOM #3908
Comments
如果别的模型在本框架下没有出现显存不均匀的问题,那么可能是模型架构导致 |
我也碰到类似问题,使用一张A400 80G LoRA微调 Qwen 14B, 一段时候后就OOM了。 按理说,LoRA微调14B, 只需要40G左右显存。 |
看起来Qwen系列的模型是重灾区啊 |
我也遇到了这个问题。Mistral-7b-instruct-v0.2 在 4*4090 训练一段时间后 OOM,sft lora。
|
可以尝试把do_eval关掉 |
我的 do_eval 是默认值 False |
Reminder
Reproduction
训练框架为LLaMA-Factory-0.7.0
Expected behavior
codeqwen1.5-7B在进行continue pretrain时所用显存异常地大,且在训练一段时间后出现OOM
System Info
一开始发生OOM时我使用的是2节点,16张GPU
Others
之前我曾进行过多次模型训练,正常情况下训练7B的模型在这个batchsize与cutoff_len下不会爆OOM,并且通过nvidia-smi时能看出显存分配很不均匀。
暂时不清楚是训练框架的原因还是模型架构的原因,希望有大佬能解答。
The text was updated successfully, but these errors were encountered: