Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 #2999

YizhouZ · 2023-03-13T07:16:52Z

Hi, while enabling TensorParallel=2 and ZeroStage3 on multi-node training for Megatron-DeepSpeed, I encountered error on this bcast
RuntimeError: Global rank 0 is not part of group <torch.distributed.ProcessGroupCCL object at 0x14e99ef52c30> raise RuntimeError(f"Global rank {global_rank} is not part of group {group}")
In TensorParallel=2, ds_process_group would be [0, 2, 4, ... ] and [1, 3, 5, ...] which does not contain global rank 0. So I believe this bcast should use local rank 0 inside current self.ds_process_group.

…nd TensorParallel=2

YizhouZ · 2023-03-20T08:33:45Z

@tjruwase Hi,
We found a bug in DeepSpeed that when enabling tensor parallel = 2 on Megatron-DeepSpeed 20B 4nodes, would meet below error:
RuntimeError: Global rank 0 is not part of group <torch.distributed.ProcessGroupCCL object at 0x14e99ef52c30> raise RuntimeError(f"Global rank {global_rank} is not part of group {group}")
Under this case, ds_process_group would be [0, 2, 4, ... ] and [1, 3, 5, ...] which does not contain global rank 0. So we believe this bcast should use local rank 0 inside current self.ds_process_group rather than global rank 0.
After this fix, the result will be broadcasting parameters inside each ds_process_group from local rank 0 to the rest of ranks.

Do you have any comments on this PR?

tjruwase · 2023-03-20T11:29:00Z

@YizhouZ, thanks for this PR. Apologies for the delay as we resolve some CI issues. We plan to merge soon.

YizhouZ · 2023-03-21T02:05:33Z

@YizhouZ, thanks for this PR. Apologies for the delay as we resolve some CI issues. We plan to merge soon.

@tjruwase Thanks!

abhilash1910

Strangely ds_process_group should not pick up global rank 0 when MP>1 ; It should be picking up local_ranks for initializing bcast call with MP .

tjruwase · 2023-04-13T12:10:29Z

@YizhouZ, do you know why this is not a problem for zero stage 1 or 2?

YizhouZ · 2023-04-14T07:43:32Z

@YizhouZ, do you know why this is not a problem for zero stage 1 or 2?

Hi @tjruwase only stage 3 would trigger this post_init_method, others would not go into this place as far as my test result shows.

YizhouZ · 2023-04-26T02:38:57Z

@tjruwase Could you please help me trigger the CI? My CLA was reviewed and passed today. Thank you!

abhilash1910 · 2023-05-03T08:52:45Z

@tjruwase seems like the post_init causes issue with stage 1 as well .(after some tests).

tjruwase · 2023-05-03T11:32:03Z

@abhilash1910, I don't think that is possible since this code path is only for zero stage 3. Can you please share more details of what you are seeing, such as a stack trace?

abhilash1910 · 2023-05-03T12:30:08Z

I think so too, this should be only in stage 3. However sometimes I do see a hang sometimes in stage 1 (not the same trace or crash, maybe a separate issue).
I will revalidate on this and let you know @tjruwase .

YizhouZ · 2023-05-08T08:05:31Z

@tjruwase Fixed CI failed case. Please help to check it. Thank you!

YizhouZ · 2023-05-09T02:10:30Z

Hi @tjruwase, it seems the current CI failure is not triggered by my changes, I see the previous check is passed but the latest one is failed and the difference is only readme file. And the error msg seems unreasonable and not related to my part:

______________________ TestPipeCifar10.test[topo_config0] ______________________
[gw3] linux -- Python 3.8.8 /tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
Worker 0 hung.
----------------------------- Captured stdout call -----------------------------
[2023-05-08 20:11:57,287] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
----------------------------- Captured stderr call -----------------------------
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Process Process-4:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/tests/unit/common.py", line 195, in _dist_init
    dist.barrier()
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper
    return func(*args, **kwargs)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 395, in barrier
    return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 214, in barrier
    return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2526, in barrier
    work = group.barrier(opts=opts)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Could you please check it? Thank you!

------update------
Tried it on cuda device and cannot reproduce errors, all 3 failed tests passed.

tjruwase · 2023-05-09T11:55:12Z

@YizhouZ, apologies for the merging delay. I am confident that the CI issues are not due to your PR but due to infastructure problems. I will ensure this PR is merged, so no need to worry about it. Sorry once again for the delay, we really appreciate your contribution.

zte-tcb · 2023-05-17T08:19:00Z

Hello，I have other questions, in partition_parameters.py, has funcation apply_with_gather(), there are similar codes of dist.broadcast (param.data, 0, group = param.ds_process_group), isn't this okay?

* try to fix broadcast error on multi-node training with ZeroStage3 a…

1f5a38a

…nd TensorParallel=2

YizhouZ requested review from jeffra, tjruwase, samyam and mrwyattii as code owners March 13, 2023 07:16

YizhouZ and others added 3 commits March 14, 2023 09:36

Merge branch 'master' into yizhou/fix

e2d2736

Merge branch 'master' into yizhou/fix

aad8c5e

Merge branch 'master' into yizhou/fix

53d5414

YizhouZ changed the title ~~Try to fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2~~ Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 Mar 20, 2023

* fix format error

6db8e2e

tjruwase approved these changes Mar 20, 2023

View reviewed changes

YizhouZ added 3 commits March 22, 2023 16:22

Merge branch 'master' into yizhou/fix

61f09d7

Merge branch 'master' into yizhou/fix

28778c0

Merge branch 'master' into yizhou/fix

9bf11e2

abhilash1910 reviewed Mar 27, 2023

View reviewed changes

YizhouZ added 8 commits March 28, 2023 09:41

Merge branch 'master' into yizhou/fix

7f4e3fa

Merge branch 'master' into yizhou/fix

24b9863

Merge branch 'master' into yizhou/fix

eb7d862

Merge branch 'master' into yizhou/fix

d397366

Merge branch 'master' into yizhou/fix

3987cd2

Merge branch 'master' into yizhou/fix

330363d

Merge branch 'master' into yizhou/fix

6f23023

Merge branch 'master' into yizhou/fix

50bd160

YizhouZ added 2 commits April 14, 2023 15:43

Merge branch 'master' into yizhou/fix

2ec0600

* fix format issue

c43d7cd

tjruwase enabled auto-merge (squash) April 18, 2023 20:13

YizhouZ added 4 commits April 19, 2023 09:43

Merge branch 'master' into yizhou/fix

7c50936

Merge branch 'master' into yizhou/fix

ad8fc9b

Merge branch 'master' into yizhou/fix

03972ca

Merge branch 'master' into yizhou/fix

9e15ab0

tjruwase added 2 commits April 26, 2023 11:17

Merge branch 'master' into yizhou/fix

2b5499c

Merge branch 'master' into yizhou/fix

e0803ac

Merge branch 'master' into yizhou/fix

705cce5

Merge branch 'master' into yizhou/fix

d55f6bb

tjruwase and others added 3 commits May 5, 2023 11:36

Merge branch 'master' into yizhou/fix

62601e8

Merge branch 'master' into yizhou/fix

144c85e

fix default pg error

ac9aff0

auto-merge was automatically disabled May 8, 2023 08:03
Head branch was pushed to by a user without write access

tjruwase added 2 commits May 8, 2023 11:41

Merge branch 'master' into yizhou/fix

d9ff81a

Merge branch 'master' into yizhou/fix

74fc1cc

Merge branch 'master' into yizhou/fix

9d54e1a

Merge branch 'master' into yizhou/fix

35d4e69

tjruwase added the merge-queue PRs ready to merge label May 13, 2023

Merge branch 'master' into yizhou/fix

4f9dc6e

tjruwase merged commit 9f4a876 into microsoft:master May 15, 2023

zte-tcb mentioned this pull request May 19, 2023

RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7fbb8f4817f0> microsoft/Megatron-DeepSpeed#124

Open

YizhouZ deleted the yizhou/fix branch November 8, 2023 05:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 #2999

Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 #2999

YizhouZ commented Mar 13, 2023

YizhouZ commented Mar 20, 2023 •

edited

Loading

tjruwase commented Mar 20, 2023

YizhouZ commented Mar 21, 2023 •

edited

Loading

abhilash1910 left a comment

tjruwase commented Apr 13, 2023

YizhouZ commented Apr 14, 2023

YizhouZ commented Apr 26, 2023

abhilash1910 commented May 3, 2023

tjruwase commented May 3, 2023

abhilash1910 commented May 3, 2023 •

edited

Loading

YizhouZ commented May 8, 2023 •

edited

Loading

YizhouZ commented May 9, 2023 •

edited

Loading

tjruwase commented May 9, 2023

zte-tcb commented May 17, 2023

Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 #2999

Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 #2999

Conversation

YizhouZ commented Mar 13, 2023

YizhouZ commented Mar 20, 2023 • edited Loading

tjruwase commented Mar 20, 2023

YizhouZ commented Mar 21, 2023 • edited Loading

abhilash1910 left a comment

Choose a reason for hiding this comment

tjruwase commented Apr 13, 2023

YizhouZ commented Apr 14, 2023

YizhouZ commented Apr 26, 2023

abhilash1910 commented May 3, 2023

tjruwase commented May 3, 2023

abhilash1910 commented May 3, 2023 • edited Loading

YizhouZ commented May 8, 2023 • edited Loading

YizhouZ commented May 9, 2023 • edited Loading

tjruwase commented May 9, 2023

zte-tcb commented May 17, 2023

YizhouZ commented Mar 20, 2023 •

edited

Loading

YizhouZ commented Mar 21, 2023 •

edited

Loading

abhilash1910 commented May 3, 2023 •

edited

Loading

YizhouZ commented May 8, 2023 •

edited

Loading

YizhouZ commented May 9, 2023 •

edited

Loading