-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to train "bigcode/starcoder" model on 80 A100-80GB GPUs using FSDP #1864
Comments
cc @pacman100 |
Hello, I am able to train Starcoder with 8K seq len on 16 A100 80GB GPUs (2 nodes each having 8 GPUs) + Gradient Checkpointing + Flash Attention V2 without any issues. Code: https://github.com/pacman100/DHS-LLM-Workshop/tree/main/code_assistant/training
Output:
|
Thanks @pacman100 for the reply. |
For extended pretraining refer the code: https://github.com/pacman100/DHS-LLM-Workshop/tree/main/personal_copilot/training The rest remains the same except for the dataset and sample creation which does The above code uses monkey patching for using Flash V2. |
Thanks @pacman100. |
Yes, that is what I have tested above.
Yes, that should be fine. For installing Flash V2, refer: https://github.com/Dao-AILab/flash-attention/tree/main#installation-and-features
No, It was leading to OOM with 8K seq len. DeepSpeed with CPU offloading might work, please test it out and share results with community.
Note:
|
Thanks @pacman100. Also, may be I didn't understood the last statement so asking again when I create packed sequences I add right padding in the last sequence (whose size is not exactly 8192) and I use I am calling the forward function as following:
I hope that is fine ? I don't have to remove Also, your Flash V2 code work If I provide batches as list of tensors and I don't have to provide as list of list or anything like that. I am asking this because I some other support for Flash V2 which requires batches as list of list and without any padding. |
@pacman100 I am able to run train One more question ? Can we also use BetterTransformer from Optimum with accelerate ? I think it currently has only support for Flash attention v1. |
How to use Flash-v2 for fine-tuning? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I am trying to further train
bigcode/starcoder
15 billion parameter model with 8k context length using 80 A100-80GB GPUs (10 nodes and 8 GPUs on each node) using accelerate FSDP. I am using gradient checkpoint and my batch size per device is 1 only.Even after using
fsdp_backward_prefetch_policy: NO_PREFETCH
andfsdp_offload_params: true
I am getting following error OOM error:Following is my FSDP configuration:
I have tried with
accelerate==0.20.3
andaccelerate==0.21.0
. My transformers and Pytorch version are following:transformers 4.29.0 pypi_0 pypi
pytorch 2.0.1 py3.11_cuda11.7_cudnn8.5.0_0 pytorch
During the forward pass the memory on each GPU roughly goes up to 41.3GB and when backward pass starts it goes to 77 GB and then programs crashes.
Please let me know if I missing something here.
Thanks!
The text was updated successfully, but these errors were encountered: