-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] OOM on finetuning vicuna-7b llava model on 4*A800 80G, anything wrong with my cfg? #394
Comments
It seems that A800 is supported in flash-attn. Please try this script with flash attention and deepspeed. |
Yeah, I tried. Error mesage is the same~ |
Can you provide the version of your flash attention, transformers, accelerate, and pytorch? Also, is flash attention compiled with the same CUDA version as PyTorch? |
I followed the instruction from the project page. Also, can you please tell if flash-attn and deepspeed is a MUST to do finetune on 4 * A800 80G?
|
Can you try I find that all my current versions are 2.0.4. Also, flash attention is necessary. You may do gradient accumutation with bs4xaccu4, which can make 7B fit in 8x A100s (maybe 4x as well). But flash-attention brings at least 2x speedup in my experiments. So spending some time to make flash-attn work should worth it. |
Get~ Thanks. I don't have the box now. Will report the result back. |
BTW, even I make bs1xaccu16, it was still OOM... so that it looks like flash attn is a MUST. |
I tried... no luck. still the same error:
|
i don't provide a " --deepspeed /path/to/deepspeed.json " in the run script. that should be ok, right? and, the pretrain ckpt comes from a non-flash-attn train.py instead of a flash-attn train_mem.py. should I re-pretrain everything w/ flash-attn enabled from a scratch? |
That is not okay. Please use zero3.json or zero2.
That is not needed. it is a linear layer, so it will be fine. |
WoW! deepspeed w/ zero3.json works great~ Thanks for all the quick responses and your amazing work. |
3699 iters is 3 epochs already. So you do not need to multiply by 3. ~6 hours is expected. |
Also, I tried LLaVA, blip2-flant5-xl/xxl, instructblip-vicuna7b and found that LLaVA works best for the photos taken by my iphone. May I take this conclusion away? :
And, how do you think of ResNet-50/101 as an image encoder? Will it perform similar like ViT? |
@ldfandian hi, I met the loss not converging well issue, can you post your full train log for me? |
@ldfandian Btw, are you using gradient checkpointing under deepspeed zero3? In my env here, it seems zero3 conflicts with checkpointing, but zero2 does not. |
Question
Thanks for the great work~
Also, it looks like A800 cannot enable flash-attn. (error screenshot below)
The text was updated successfully, but these errors were encountered: