You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently distributed.sh, disable zero3 and disable fsdp, the vram is quite a lot higher than using accelerate+SFTTrainer natively. I believe it is because each gpu is receiving a model copy on its own (which is why each gpu is always at like 99% vram, as opposed to native accelerate+SFTTrainer where you can visually see n-1 gpus at 0% while one of them is at 100%).
The feature would be an option to run a multigpu training script in this kind of manner. This would lower the vram required.
for example
Llmstudio 512 seqlen mistral with flashatt2 and int4 quant, batchsize 1, lora=true: each gpu has approx 10.6gb vram occupied
Sfttrainer 512 seqlen mistral with flashatt2 and nf4 quantization with bnb, batchsize 1, the same LoraConfig, device map is auto. Three gpus have 2.4gb vram, last gpu has 3.2gb vram. So total vram usage is like 10.4gb vram.
Motivation
This would reduce the vram requirements in a multigpu setup.
The text was updated successfully, but these errors were encountered:
I am not sure I really understand the question. If you disable deepspeed you will not do sharding, but DDP. And DDP will always have full copies of the weights on each GPU.
So if you want sharding, we support deepspeed already.
deepspeed does not support nf4 quantization so the vram requirements when sharding will be higher there. It would be nice to shard without using deepspeed. Native accelerate/transformers can do that, by using device_map=auto, or specifying a device_map. It would be good to have that ability to support the quantization use case.
Accelerate wraps FSDP, deepspeed, DDP and so on. So this is probably a duplicate of #631.
Rewriting to use accelerate could be an option. Last time we did that, we ran into issues as it isn't fully customizable, but might be good to reconsider again.
🚀 Feature
Currently distributed.sh, disable zero3 and disable fsdp, the vram is quite a lot higher than using accelerate+SFTTrainer natively. I believe it is because each gpu is receiving a model copy on its own (which is why each gpu is always at like 99% vram, as opposed to native accelerate+SFTTrainer where you can visually see n-1 gpus at 0% while one of them is at 100%).
The feature would be an option to run a multigpu training script in this kind of manner. This would lower the vram required.
for example
Llmstudio 512 seqlen mistral with flashatt2 and int4 quant, batchsize 1, lora=true: each gpu has approx 10.6gb vram occupied
Sfttrainer 512 seqlen mistral with flashatt2 and nf4 quantization with bnb, batchsize 1, the same LoraConfig, device map is auto. Three gpus have 2.4gb vram, last gpu has 3.2gb vram. So total vram usage is like 10.4gb vram.
Motivation
This would reduce the vram requirements in a multigpu setup.
The text was updated successfully, but these errors were encountered: