Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] add support for multigpu, splitting model across gpus without using deepspeed/fsdp #710

Closed
Quetzalcohuatl opened this issue May 16, 2024 · 3 comments
Labels
type/feature Feature request

Comments

@Quetzalcohuatl
Copy link
Contributor

🚀 Feature

Currently distributed.sh, disable zero3 and disable fsdp, the vram is quite a lot higher than using accelerate+SFTTrainer natively. I believe it is because each gpu is receiving a model copy on its own (which is why each gpu is always at like 99% vram, as opposed to native accelerate+SFTTrainer where you can visually see n-1 gpus at 0% while one of them is at 100%).

The feature would be an option to run a multigpu training script in this kind of manner. This would lower the vram required.

for example

Llmstudio 512 seqlen mistral with flashatt2 and int4 quant, batchsize 1, lora=true: each gpu has approx 10.6gb vram occupied

Sfttrainer 512 seqlen mistral with flashatt2 and nf4 quantization with bnb, batchsize 1, the same LoraConfig, device map is auto. Three gpus have 2.4gb vram, last gpu has 3.2gb vram. So total vram usage is like 10.4gb vram.

Motivation

This would reduce the vram requirements in a multigpu setup.

@Quetzalcohuatl Quetzalcohuatl added the type/feature Feature request label May 16, 2024
@psinger
Copy link
Collaborator

psinger commented May 17, 2024

Hi @Quetzalcohuatl -

I am not sure I really understand the question. If you disable deepspeed you will not do sharding, but DDP. And DDP will always have full copies of the weights on each GPU.

So if you want sharding, we support deepspeed already.

@Quetzalcohuatl
Copy link
Contributor Author

Hi @psinger

deepspeed does not support nf4 quantization so the vram requirements when sharding will be higher there. It would be nice to shard without using deepspeed. Native accelerate/transformers can do that, by using device_map=auto, or specifying a device_map. It would be good to have that ability to support the quantization use case.

@pascal-pfeiffer
Copy link
Collaborator

Accelerate wraps FSDP, deepspeed, DDP and so on. So this is probably a duplicate of #631.

Rewriting to use accelerate could be an option. Last time we did that, we ran into issues as it isn't fully customizable, but might be good to reconsider again.

Closing as duplicate.

@pascal-pfeiffer pascal-pfeiffer closed this as not planned Won't fix, can't repro, duplicate, stale May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature Feature request
Projects
None yet
Development

No branches or pull requests

3 participants