Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What the recommended GPU setup for fine-tuning ? #23

Closed
fyang7 opened this issue Sep 9, 2023 · 8 comments
Closed

What the recommended GPU setup for fine-tuning ? #23

fyang7 opened this issue Sep 9, 2023 · 8 comments

Comments

@fyang7
Copy link

fyang7 commented Sep 9, 2023

I run into OOM error with default setup on 8*A100 with train.sh script, could you please share the GPU requirements for fine-tuning ?

@jquesnelle
Copy link
Owner

We were able to train the 7b 64k model on an 8x A100 node -- all other models unfortunately require a multinode setup. We used 64 GPUs, but I expect 16 would suffice for all other models (7b 128k, 13b 64k, 13b 128k)

@fyang7
Copy link
Author

fyang7 commented Sep 9, 2023

Thanks a lot. To confirm, A100 is with 40G or 80G memory for 7b 64k fine-tuning?

@bloc97
Copy link
Collaborator

bloc97 commented Sep 9, 2023

It is 8x80GB for 64k context size

@sadransh
Copy link

sadransh commented Sep 17, 2023

Could you please clarify if this discussion is around full parameter tuning or lora based? @bloc97

@YL-9
Copy link

YL-9 commented Apr 24, 2024

We were able to train the 7b 64k model on an 8x A100 node -- all other models unfortunately require a multinode setup. We used 64 GPUs, but I expect 16 would suffice for all other models (7b 128k, 13b 64k, 13b 128k)

I ran finetune.py using 2x A100 GPUs, and both GPUs loaded up to 14g/80g. After processing the first batch, the memory usage went up to 77g/80g, and then it ran OOM when starting the second batch.
Is this situation normal?

@YL-9
Copy link

YL-9 commented Apr 26, 2024

It is 8x80GB for 64k context size

Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97

@bloc97
Copy link
Collaborator

bloc97 commented Apr 29, 2024

It is 8x80GB for 64k context size

Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97

Yes, and if you enable more modern attention partitioning schemes like RingAttention you can even do longer context.

@YL-9
Copy link

YL-9 commented May 9, 2024

It is 8x80GB for 64k context size

Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97

Yes, and if you enable more modern attention partitioning schemes like RingAttention you can even do longer context.

ok, thank you!
but did you use --deepspeed? or other methods to reduce GPU memory usage.
I used the default settings of train.sh and 4xA100 GPUs, with batch_size=1, num_processes=4, and gradient_accumulate_every=8. this setup results in an OOM issue.
could you provide the detailed configuration? thank you so much. @bloc97

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants