-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error: too many resources requested for launch (V100, qwen2-vl) #1867
Comments
+1 |
You can save memory by reducing SIZE_FACTOR=8 and MAX_PIXELS=602112. |
设置完SIZE_FACTOR=8 and MAX_PIXELS=602112 后 还是出现这个问题: |
是的,我也是遇到了这个问题,同样的V100+lora,看GPU占用不是爆显存,实际只占18个G |
我在V100上使用7B和2B模型推理统统复现了相同的错误,有意思的事情是这个错误跟显存无关,似乎是bfloat16数据类型导致的,V100我记得是不支持bf16数据格式的。使用--dtype fp32后推理正常 |
|
加个 |
--dtype fp32 |
微调的话用--dtype fp32 会爆显存的....之前的qwen_vl是可以用bf16微调的 |
oom occured even using qwen2-vl-2b with SIZE_FACTOR=8 MAX_PIXELS=602112 --dtype fp32 |
CUDA_VISIBLE_DEVICES=0,1,2,3 NPROC_PER_NODE=4 swift sft --sft_type 'full' --dtype 'fp16' --use_liger 'True' --model_id_or_path '/share_data/PRDATA/Qwen2-VL-2B-Instruct/' --template_type 'qwen2-vl' --system 'You are a helpful assistant.' --dataset coco-en-mini --learning_rate '1e-05' --gradient_accumulation_steps '16' --eval_steps '500' --save_steps '500' --eval_batch_size '1' --model_type 'qwen2-vl-2b-instruct' --deepspeed default-zero3 --add_output_dir_suffix False --output_dir /root/output/qwen2-vl-2b-instruct/v16 full sft fp16 is ok for my 4 V100, For lora, I got some strange errors and waiting for fix |
v100上可以不用设置SIZE_FACTOR和MAX_PIXELS,直接设置--dtype fp16 就可以训练 |
Can you share the error stack? |
RuntimeError: CUDA error: too many resources requested for launch
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.V100 lora 微调 qwenvl2-2b-instruct 出现上述错误
The text was updated successfully, but these errors were encountered: