New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model Parallelism and accelerate's usage of DDP aren't compatible #1368
Comments
Yes, Accelerate does not support DDP with model parallelism. I'm not sure your proposed fix would work as DDP will all-reduce the gradients across GPUs except all GPUs don't have the same parameters. For pipeline parallelism as you are trying to acheive, use FSDP or DeepSpeed |
If you make sure each accelerate process gets multiple GPUs, then I think DDP will work as expected - so you have 1 accelerate process and hence 1 DDP model per 4 gpus (for example), then you should get the correct synchronisation. For example that's what this tutorial implies. In my setup preparing the model separately does work as intended, with 4 accelerate processes with 3 GPUs each and the model layers split across those 3 GPUs. I'm launching each those processes with a separate call however, so I'm uncertain how you'd do it with a single call on a multi-GPU machine if you wanted 4 processes with 2GPUs each rather than 8 processes with 1 GPU each. I'll look into trying to get FSDP to work in my set up as well. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@sgugger need your guidance, I wanna use
to train 40b, but also wanna DDP, then how should I achieve it? Thanks |
You can use DDP if your model is only on one device like this. |
@sgugger Thanks for your fast help. But what if the model is too big for one GPU device? |
Then you cannot use DDP + |
I feel like you are not listening. You cannot use |
Sorry for my misunderstanding, I got your point now |
@sgugger Just to make sure my understanding is correct, can we use |
As long as you properly configure DeepSpeed ZeRO-3, you won't need to use |
Just to document my experience on getting DDP + MP (2x2 on 4 gpus) to work with Accelerate (via HF trainer): I modified the current main branch to initialize the DDP model by setting device_ids and output_device to None, as described in the pytorch docs when using multi-device modules. Additionally, I had to remove some ValueErrors that are being raised (for no good reason?). I launch two processes with torchrun; each supposed to use 2 gpus. {'model.embed_tokens': 0, 'model.norm': 2, 'lm_head': 2, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0 , 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0, 'model.layers.16': 2, 'model.layers.17': 2, 'model.layers.18': 2, 'model.layers.19': 2, 'model.layers.20': 2, 'model.layers.21': 2, 'model.layers.22': 2, 'model.layers.23': 2, 'model.layers.24': 2, 'model.layers.25': 2, 'model.layers.26': 2, 'model.layers.27': 2, 'model.layers.28': 2, 'model.layers.29': 2, 'model.layers.30': 2, 'model.layers.31': 2} I set model = transformers.LlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
torch_dtype=torch.bfloat16 if training_args.bf16 else torch.float32,
device_map=device_map
) This works for me and produces exactly the same loss curves compared to using only DDP or only MP. |
@maxidl can you share your modified code? Curious what those exceptions are that exist for "no good reason" |
@muellerzr I do think these error are necessary if one does not also modify the DDP construction. In fact, they are correct if one has created the You can find my fork here: maxidl@332d960 Also, note that I did not run any tests and check whether this breaks other behavior. Now why do I think it is useful to have DDP + MP (in the classic pipeline of layers way): In my case, I am running gpus without fast interconnect (nvlink) which makes FSDP style training very slow. |
Thanks @maxidl, as an approach here's what the team has decided we will do:
Seem reasonable @maxidl? And thank you for this reproducer! |
Sure, that sounds great. Once the changes are in (no rush with that), I might create a tutorial-style GitHub repo for it and do some benchmarking, to be shared via Twitter (sorry, "X" ....). |
Sorry I wanna bring this up again, is it possible to add this functionality as a feature, background is we wanna tune 70b or 8x7b model as a teacher, tried to use FSDP, but lots of feature is not supported in FSDP, DS is even worse, e.g. nested quantization, sliding window attention, the final total saving is actually not that much. the following is my testing code, basically for each node, we have 8 A100_80gb GPU, each training process will take 2 GPU:
btw, beside this, any other memory optimization approach I can take? |
@muellerzr @maxidl, because I loaded the model in 4bit so I also comment out this line: But don't know is there any bad effect, it starts to train at least, could you pls elaborate potential consequences |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
If I use model parallelism (for example using huggingface parallelize), and I'm using accelerate with a standard multi-GPU environment (that uses DDP), then when I prepare the model I get the following error:
I think this is because in line
accelerate/src/accelerate/accelerator.py
Line 1222 in 2708c1a
None
if using model parallelism.You should be able to reproduce this on a 4-gpu machine with something like the following:
I'm currently getting around this by wrapping the model in DDP myself with the correct arguments, and then doing
accelerator._models.append(model)
.Expected behavior
I'd expect accelerate's usage of DDP to be compatible with naïve model parallelism, as DDP is compatible with it.
I think the fix would be to adjust
accelerate/src/accelerate/accelerator.py
Line 1222 in 2708c1a
device_ids
andoutput_device
. I'd be happy to submit a PR to make that change if that seems reasonable.The text was updated successfully, but these errors were encountered: