Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 #2999

Merged
merged 39 commits into from
May 15, 2023

Conversation

YizhouZ
Copy link
Contributor

@YizhouZ YizhouZ commented Mar 13, 2023

Hi, while enabling TensorParallel=2 and ZeroStage3 on multi-node training for Megatron-DeepSpeed, I encountered error on this bcast
RuntimeError: Global rank 0 is not part of group <torch.distributed.ProcessGroupCCL object at 0x14e99ef52c30> raise RuntimeError(f"Global rank {global_rank} is not part of group {group}")
In TensorParallel=2, ds_process_group would be [0, 2, 4, ... ] and [1, 3, 5, ...] which does not contain global rank 0. So I believe this bcast should use local rank 0 inside current self.ds_process_group.

@YizhouZ YizhouZ changed the title Try to fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 Mar 20, 2023
@YizhouZ
Copy link
Contributor Author

YizhouZ commented Mar 20, 2023

@tjruwase Hi,
We found a bug in DeepSpeed that when enabling tensor parallel = 2 on Megatron-DeepSpeed 20B 4nodes, would meet below error:
RuntimeError: Global rank 0 is not part of group <torch.distributed.ProcessGroupCCL object at 0x14e99ef52c30> raise RuntimeError(f"Global rank {global_rank} is not part of group {group}")
Under this case, ds_process_group would be [0, 2, 4, ... ] and [1, 3, 5, ...] which does not contain global rank 0. So we believe this bcast should use local rank 0 inside current self.ds_process_group rather than global rank 0.
After this fix, the result will be broadcasting parameters inside each ds_process_group from local rank 0 to the rest of ranks.

Do you have any comments on this PR?

@tjruwase
Copy link
Contributor

@YizhouZ, thanks for this PR. Apologies for the delay as we resolve some CI issues. We plan to merge soon.

@YizhouZ
Copy link
Contributor Author

YizhouZ commented Mar 21, 2023

@YizhouZ, thanks for this PR. Apologies for the delay as we resolve some CI issues. We plan to merge soon.

@tjruwase Thanks!

Copy link
Contributor

@abhilash1910 abhilash1910 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strangely ds_process_group should not pick up global rank 0 when MP>1 ; It should be picking up local_ranks for initializing bcast call with MP .

@tjruwase
Copy link
Contributor

@YizhouZ, do you know why this is not a problem for zero stage 1 or 2?

@YizhouZ
Copy link
Contributor Author

YizhouZ commented Apr 14, 2023

@YizhouZ, do you know why this is not a problem for zero stage 1 or 2?

Hi @tjruwase only stage 3 would trigger this post_init_method, others would not go into this place as far as my test result shows.

@tjruwase tjruwase enabled auto-merge (squash) April 18, 2023 20:13
@YizhouZ
Copy link
Contributor Author

YizhouZ commented Apr 26, 2023

@tjruwase Could you please help me trigger the CI? My CLA was reviewed and passed today. Thank you!

@abhilash1910
Copy link
Contributor

@tjruwase seems like the post_init causes issue with stage 1 as well .(after some tests).

@tjruwase
Copy link
Contributor

tjruwase commented May 3, 2023

@abhilash1910, I don't think that is possible since this code path is only for zero stage 3. Can you please share more details of what you are seeing, such as a stack trace?

@abhilash1910
Copy link
Contributor

abhilash1910 commented May 3, 2023

I think so too, this should be only in stage 3. However sometimes I do see a hang sometimes in stage 1 (not the same trace or crash, maybe a separate issue).
I will revalidate on this and let you know @tjruwase .

auto-merge was automatically disabled May 8, 2023 08:03

Head branch was pushed to by a user without write access

@YizhouZ
Copy link
Contributor Author

YizhouZ commented May 8, 2023

@tjruwase Fixed CI failed case. Please help to check it. Thank you!

@YizhouZ
Copy link
Contributor Author

YizhouZ commented May 9, 2023

Hi @tjruwase, it seems the current CI failure is not triggered by my changes, I see the previous check is passed but the latest one is failed and the difference is only readme file. And the error msg seems unreasonable and not related to my part:

______________________ TestPipeCifar10.test[topo_config0] ______________________
[gw3] linux -- Python 3.8.8 /tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
Worker 0 hung.
----------------------------- Captured stdout call -----------------------------
[2023-05-08 20:11:57,287] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
----------------------------- Captured stderr call -----------------------------
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Process Process-4:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/tests/unit/common.py", line 195, in _dist_init
    dist.barrier()
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper
    return func(*args, **kwargs)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 395, in barrier
    return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 214, in barrier
    return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2526, in barrier
    work = group.barrier(opts=opts)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Could you please check it? Thank you!

------update------
Tried it on cuda device and cannot reproduce errors, all 3 failed tests passed.

@tjruwase
Copy link
Contributor

tjruwase commented May 9, 2023

@YizhouZ, apologies for the merging delay. I am confident that the CI issues are not due to your PR but due to infastructure problems. I will ensure this PR is merged, so no need to worry about it. Sorry once again for the delay, we really appreciate your contribution.

@tjruwase tjruwase added the merge-queue PRs ready to merge label May 13, 2023
@tjruwase tjruwase merged commit 9f4a876 into microsoft:master May 15, 2023
@zte-tcb
Copy link

zte-tcb commented May 17, 2023

Hello,I have other questions, in partition_parameters.py, has funcation apply_with_gather(), there are similar codes of dist.broadcast (param.data, 0, group = param.ds_process_group), isn't this okay?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-queue PRs ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants