-
Notifications
You must be signed in to change notification settings - Fork 405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to QLoRA training with ZeRO-3 on two or more GPUs? #42
Comments
Hi @Di-Zayn, note that you will need to also modify the configuration used for DeepSpeed ZeRO 3, as the one they share is the one is suited for a VM with 8 x A100 80GB, so to suit your needs you may need to add the flags required to load and train using a lower precision. Anyway not sure about how to fine-tune that using NF4, but maybe https://www.deepspeed.ai/tutorials/MoQ-tutorial/#deepspeed-configuration-file is worth checking? |
I'm getting this issue as well (trying qlora with ZeRO-3 and 4 gpus, same error message), @Di-Zayn were you able to solve it? |
I had similar problems and I decided to use the multi_gpu script and set the param to just use 2 GPUs and everything was working fine: https://github.com/huggingface/alignment-handbook/blob/main/recipes/accelerate_configs/multi_gpu.yaml However, on the Zero code, the starting loss was like 1.7 instead of 1.4 with the multi-gpu script both when using 1 or 2 GPUs I never bothered further experimenting with Zero as I got the results I needed with multi_gpu script |
I was keen on sharding the model across gpus in order to be able to allow for larger models. As an aside, the latest FSDP and qlora examples are working for me - that works for my use case |
I added a 4-bit load after the command LoRA training with ZeRO-3 on two or more GPUs to achieve a mix of QLoRA and ZeRO-3. But the program encountered the following error:
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param..all_gather_coalesced.. at 0x7f2ec8daf900>
The command is:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml --num_processes=2 scripts/run_sft.py recipes/zephyr-7b-beta/sft/config_lora.yaml --load_in_4bit=true
The text was updated successfully, but these errors were encountered: