What the recommended GPU setup for fine-tuning ? #23

fyang7 · 2023-09-09T01:24:46Z

I run into OOM error with default setup on 8*A100 with train.sh script, could you please share the GPU requirements for fine-tuning ?

jquesnelle · 2023-09-09T01:27:24Z

We were able to train the 7b 64k model on an 8x A100 node -- all other models unfortunately require a multinode setup. We used 64 GPUs, but I expect 16 would suffice for all other models (7b 128k, 13b 64k, 13b 128k)

fyang7 · 2023-09-09T01:34:10Z

Thanks a lot. To confirm, A100 is with 40G or 80G memory for 7b 64k fine-tuning?

bloc97 · 2023-09-09T05:15:19Z

It is 8x80GB for 64k context size

sadransh · 2023-09-17T06:09:12Z

Could you please clarify if this discussion is around full parameter tuning or lora based? @bloc97

YL-9 · 2024-04-24T15:18:34Z

We were able to train the 7b 64k model on an 8x A100 node -- all other models unfortunately require a multinode setup. We used 64 GPUs, but I expect 16 would suffice for all other models (7b 128k, 13b 64k, 13b 128k)

I ran finetune.py using 2x A100 GPUs, and both GPUs loaded up to 14g/80g. After processing the first batch, the memory usage went up to 77g/80g, and then it ran OOM when starting the second batch.
Is this situation normal?

YL-9 · 2024-04-26T11:58:47Z

It is 8x80GB for 64k context size

Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97

bloc97 · 2024-04-29T00:54:17Z

It is 8x80GB for 64k context size

Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97

Yes, and if you enable more modern attention partitioning schemes like RingAttention you can even do longer context.

YL-9 · 2024-05-09T16:38:49Z

It is 8x80GB for 64k context size

Can this configuration train with a total batch size of 64 (batch_size=1, num_processes=8, gradient_accumulate_every=8)? @bloc97

Yes, and if you enable more modern attention partitioning schemes like RingAttention you can even do longer context.

ok, thank you!
but did you use --deepspeed? or other methods to reduce GPU memory usage.
I used the default settings of train.sh and 4xA100 GPUs, with batch_size=1, num_processes=4, and gradient_accumulate_every=8. this setup results in an OOM issue.
could you provide the detailed configuration? thank you so much. @bloc97

jquesnelle closed this as completed Sep 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What the recommended GPU setup for fine-tuning ? #23

What the recommended GPU setup for fine-tuning ? #23

fyang7 commented Sep 9, 2023

jquesnelle commented Sep 9, 2023

fyang7 commented Sep 9, 2023

bloc97 commented Sep 9, 2023

sadransh commented Sep 17, 2023 •

edited

Loading

YL-9 commented Apr 24, 2024

YL-9 commented Apr 26, 2024

bloc97 commented Apr 29, 2024

YL-9 commented May 9, 2024 •

edited

Loading

What the recommended GPU setup for fine-tuning ? #23

What the recommended GPU setup for fine-tuning ? #23

Comments

fyang7 commented Sep 9, 2023

jquesnelle commented Sep 9, 2023

fyang7 commented Sep 9, 2023

bloc97 commented Sep 9, 2023

sadransh commented Sep 17, 2023 • edited Loading

YL-9 commented Apr 24, 2024

YL-9 commented Apr 26, 2024

bloc97 commented Apr 29, 2024

YL-9 commented May 9, 2024 • edited Loading

sadransh commented Sep 17, 2023 •

edited

Loading

YL-9 commented May 9, 2024 •

edited

Loading