avoid setting device_id to `init_process_group` by kaixuanliu · Pull Request #7542 · deepspeedai/DeepSpeed

kaixuanliu · 2025-09-03T04:50:39Z

In some usecases such as vllm, we need to new distributed group not only on gpu, but also on cpu, if we set device_id here, it will prevent us from new distributed group on cpu: L230 . This PR fixes this bug.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

yao-matrix · 2025-09-03T16:41:09Z

@delock , pls help take a look, thx

stas00 · 2025-09-04T00:08:07Z

If you undo #7266 you will get a rain of warnings (per rank) from recent pytorch versions that collectives could be doing the wrong thing and hang - see the PR I linked to.

This is not a bug to fix. Some other approach needs to be used to address your need without breaking the main use-case. I will defer to @tjruwase on design. Most likely some flag needs to be passed to decide what to do.

kaixuanliu · 2025-09-04T01:25:19Z

Hi @stas00 , I read your PR, and it seems you meet hang issue when adding device_id argument w/ PT from 2.6.0 to 2.7.1, right? So the hang issue here(in this PR we do not pass device_id) is not a problem I suppose. As for the warning issue, can you give a concrete example to reproduce? Maybe I can also take a look.

stas00 · 2025-09-04T01:42:52Z

You misunderstood the purpose of #7266.

Pytorch started issuing a warning around v2.7.0 and wanted torch.dist users to set device_id or else... So the PR addressed that issue.
In the process of testing the fix myself and others have found that setting device_id actually lead to hanging and it proved to be pytorch version-specific (the issue really comes from how pytorch interacts with nccl). The problem has been resolved in pytorch-2.7.1. thus the PR carefully chooses when it's safe to set device_id, which should not cause the hanging.

Now you're requesting that device_id won't be set, which is a problem for the general case since:

pytorch's warning is alarming to the user
the hanging might actually happen, though I am yet to see it myself

What I'm suggesting is that in order to meet everybody's needs I propose to add a new config variable that will control this behavior - by default it should set device_id and in those cases where it must not be set the user can set it. Does it make sense?

delock · 2025-09-04T04:53:37Z

@kaixuanliu what will we encounter if set device_id when init_process_group on CPU?

kaixuanliu · 2025-09-04T05:00:51Z

@delock it will try to new group using gloo backend on XPU and crash, here is the log:

[rank7]: Traceback (most recent call last):
[rank7]:   File "/root/test_vllm.py", line 12, in <module>
[rank7]:     vllm_model = LLM(
[rank7]:                  ^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py", line 270, in __init__
[rank7]:     self.llm_engine = LLMEngine.from_engine_args(
[rank7]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 490, in from_engine_args
[rank7]:     return engine_cls.from_vllm_config(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 127, in from_vllm_config
[rank7]:     return cls(vllm_config=vllm_config,
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 104, in __init__
[rank7]:     self.engine_core = EngineCoreClient.make_client(
[rank7]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 82, in make_client
[rank7]:     return InprocClient(vllm_config, executor_class, log_stats)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 245, in __init__
[rank7]:     self.engine_core = EngineCore(*args, **kwargs)
[rank7]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 80, in __init__
[rank7]:     self.model_executor = executor_class(vllm_config)
[rank7]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
[rank7]:     self._init_executor()
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 128, in _init_executor
[rank7]:     self.collective_rpc("init_device")
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
[rank7]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3031, in run_method
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 603, in init_device
[rank7]:     self.worker.init_device()  # type: ignore
[rank7]:     ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/xpu_worker.py", line 164, in init_device
[rank7]:     init_worker_distributed_environment(self.vllm_config, self.rank,
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 609, in init_worker_distributed_environment
[rank7]:     init_distributed_environment(parallel_config.world_size, rank,
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1025, in init_distributed_environment
[rank7]:     _WORLD = init_world_group(ranks, local_rank, backend)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 865, in init_world_group
[rank7]:     return GroupCoordinator(
[rank7]:            ^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 230, in __init__
[rank7]:     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
[rank7]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
[rank7]:     func_return = func(*args, **kwargs)
[rank7]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 5254, in new_group
[rank7]:     return _new_group_with_tag(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 5344, in _new_group_with_tag
[rank7]:     pg, pg_store = _new_process_group_helper(
[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2123, in _new_process_group_helper
[rank7]:     if device_id and pg._get_backend(device_id).supports_splitting:
[rank7]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: RuntimeError: No backend type associated with device type xpu

stas00 · 2025-09-04T05:18:32Z

wait a sec, are you saying the problem happens when you run on xpu? You didn't say that in the OP.

If so why not check if the hw is xpus and then not set device_id if it's not good there? Or am I misunderstanding the particulars of your use case?

For example does the problem go away if you set export CUDA_VISIBLE_DEVICES= if you don't want the gpus?

kaixuanliu · 2025-09-04T05:47:31Z

@stas00 , on CUDA this will not crash, as CUDA also supports gloo while XPU does not. However, L230 target to new group on CPU, if we set device_id in init_process_group, it will new distributed group on CUDA actually, I am not sure if it is a good choice. I accept to add xpu check in L159, although it is somewhat wierd... @delock @tjruwase WDYT? In the VLLM case I mentioned, we need both gpus and CPU. So it is not suitable just export CUDA_VISIBLE_DEVICES=

delock · 2025-09-04T05:47:32Z

Do you mean you wish to call init_process_group for CPU, but if device_id not equal to None, Pytorch will think you want to call init_process_group for XPU, and get an error?

@delock it will try to new group using gloo backend on XPU and crash, here is the log:

[rank7]: Traceback (most recent call last):
[rank7]:   File "/root/test_vllm.py", line 12, in <module>
[rank7]:     vllm_model = LLM(
[rank7]:                  ^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/llm.py", line 270, in __init__
[rank7]:     self.llm_engine = LLMEngine.from_engine_args(
[rank7]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 490, in from_engine_args
[rank7]:     return engine_cls.from_vllm_config(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 127, in from_vllm_config
[rank7]:     return cls(vllm_config=vllm_config,
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 104, in __init__
[rank7]:     self.engine_core = EngineCoreClient.make_client(
[rank7]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 82, in make_client
[rank7]:     return InprocClient(vllm_config, executor_class, log_stats)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 245, in __init__
[rank7]:     self.engine_core = EngineCore(*args, **kwargs)
[rank7]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 80, in __init__
[rank7]:     self.model_executor = executor_class(vllm_config)
[rank7]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
[rank7]:     self._init_executor()
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 128, in _init_executor
[rank7]:     self.collective_rpc("init_device")
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
[rank7]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3031, in run_method
[rank7]:     return func(*args, **kwargs)
[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 603, in init_device
[rank7]:     self.worker.init_device()  # type: ignore
[rank7]:     ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/xpu_worker.py", line 164, in init_device
[rank7]:     init_worker_distributed_environment(self.vllm_config, self.rank,
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 609, in init_worker_distributed_environment
[rank7]:     init_distributed_environment(parallel_config.world_size, rank,
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 1025, in init_distributed_environment
[rank7]:     _WORLD = init_world_group(ranks, local_rank, backend)
[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 865, in init_world_group
[rank7]:     return GroupCoordinator(
[rank7]:            ^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py", line 230, in __init__
[rank7]:     cpu_group = torch.distributed.new_group(ranks, backend="gloo")
[rank7]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
[rank7]:     func_return = func(*args, **kwargs)
[rank7]:                   ^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 5254, in new_group
[rank7]:     return _new_group_with_tag(
[rank7]:            ^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 5344, in _new_group_with_tag
[rank7]:     pg, pg_store = _new_process_group_helper(
[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2123, in _new_process_group_helper
[rank7]:     if device_id and pg._get_backend(device_id).supports_splitting:
[rank7]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank7]: RuntimeError: No backend type associated with device type xpu

kaixuanliu · 2025-09-04T05:52:59Z

@delock yes, pls refer to the VLLM code: L226-L230, we need to new group both on CPU and XPU(CUDA). If we explicitly set device_id here, it will only new group on XPU(CUDA), which will cause error.

kaixuanliu · 2025-09-04T13:09:21Z

After consideration, we think it's best to solve this bug in pytorch side. And here we can make a WA to add device_id explicitly only for CUDA device. Do you think it is OK? @stas00

stas00 · 2025-09-04T16:52:27Z

I think your proposal is a good start, @kaixuanliu - we can always expand the use case as we go.

Does WA stands for workaround? Could you write it out in the comment as a full word as most readers won't know what WA stands for.

stas00 · 2025-09-04T16:52:58Z

you can run make format to get the formatting fixed for you automatically.

the test failing in modal-torch-latest is unrelated - you can ignore it.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

deepspeed/comm/torch.py

stas00

Looks good now.

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

In some usecases such as vllm, we need to new distributed group not only on gpu, but also on cpu, if we set `device_id` here, it will prevent us from new distributed group on cpu: [L230](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L230) . This PR fixes this bug. --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Signed-off-by: Flakes342 <ayushtanwar1729@gmail.com>

In some usecases such as vllm, we need to new distributed group not only on gpu, but also on cpu, if we set `device_id` here, it will prevent us from new distributed group on cpu: [L230](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L230) . This PR fixes this bug. --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

kaixuanliu requested a review from GuanhuaWang as a code owner September 3, 2025 04:50

avoid setting device_id to init_process_group

0a7c001

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'master' into fix-init-process-group

9d9fbab

sfc-gh-truwase requested review from stas00 and removed request for GuanhuaWang September 3, 2025 23:45

kaixuanliu and others added 3 commits September 4, 2025 13:20

add WA

ac46460

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

wrong input

c2f2042

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Merge branch 'master' into fix-init-process-group

3b05775

stas00 reviewed Sep 5, 2025

View reviewed changes

deepspeed/comm/torch.py Outdated Show resolved Hide resolved

stas00 approved these changes Sep 5, 2025

View reviewed changes

stas00 enabled auto-merge (squash) September 5, 2025 06:03

stas00 merged commit 08879a3 into deepspeedai:master Sep 5, 2025
12 checks passed

kaixuanliu added 2 commits September 5, 2025 06:28

fix format

30f83be

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

adjust

b32513c

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>

Conversation

kaixuanliu commented Sep 3, 2025

Uh oh!

yao-matrix commented Sep 3, 2025

Uh oh!

stas00 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaixuanliu commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

delock commented Sep 4, 2025

Uh oh!

kaixuanliu commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 4, 2025

Uh oh!

kaixuanliu commented Sep 4, 2025

Uh oh!

delock commented Sep 4, 2025

Uh oh!

kaixuanliu commented Sep 4, 2025

Uh oh!

kaixuanliu commented Sep 4, 2025

Uh oh!

stas00 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

stas00 commented Sep 4, 2025 •

edited

Loading

kaixuanliu commented Sep 4, 2025 •

edited

Loading

stas00 commented Sep 4, 2025 •

edited

Loading

kaixuanliu commented Sep 4, 2025 •

edited

Loading

stas00 commented Sep 4, 2025 •

edited

Loading

stas00 commented Sep 4, 2025 •

edited

Loading