-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 #2999
Conversation
…nd TensorParallel=2
@tjruwase Hi, Do you have any comments on this PR? |
@YizhouZ, thanks for this PR. Apologies for the delay as we resolve some CI issues. We plan to merge soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strangely ds_process_group should not pick up global rank 0 when MP>1 ; It should be picking up local_ranks for initializing bcast call with MP .
@YizhouZ, do you know why this is not a problem for zero stage 1 or 2? |
@tjruwase Could you please help me trigger the CI? My CLA was reviewed and passed today. Thank you! |
@tjruwase seems like the post_init causes issue with stage 1 as well .(after some tests). |
@abhilash1910, I don't think that is possible since this code path is only for zero stage 3. Can you please share more details of what you are seeing, such as a stack trace? |
I think so too, this should be only in stage 3. However sometimes I do see a hang sometimes in stage 1 (not the same trace or crash, maybe a separate issue). |
Head branch was pushed to by a user without write access
@tjruwase Fixed CI failed case. Please help to check it. Thank you! |
Hi @tjruwase, it seems the current CI failure is not triggered by my changes, I see the previous check is passed but the latest one is failed and the difference is only readme file. And the error msg seems unreasonable and not related to my part:
Could you please check it? Thank you! ------update------ |
@YizhouZ, apologies for the merging delay. I am confident that the CI issues are not due to your PR but due to infastructure problems. I will ensure this PR is merged, so no need to worry about it. Sorry once again for the delay, we really appreciate your contribution. |
Hello,I have other questions, in partition_parameters.py, has funcation apply_with_gather(), there are similar codes of dist.broadcast (param.data, 0, group = param.ds_process_group), isn't this okay? |
Hi, while enabling TensorParallel=2 and ZeroStage3 on multi-node training for Megatron-DeepSpeed, I encountered error on this bcast
RuntimeError: Global rank 0 is not part of group <torch.distributed.ProcessGroupCCL object at 0x14e99ef52c30> raise RuntimeError(f"Global rank {global_rank} is not part of group {group}")
In TensorParallel=2, ds_process_group would be [0, 2, 4, ... ] and [1, 3, 5, ...] which does not contain global rank 0. So I believe this bcast should use local rank 0 inside current self.ds_process_group.