avoid setting device_id to init_process_group#7542
avoid setting device_id to init_process_group#7542stas00 merged 7 commits intodeepspeedai:masterfrom
init_process_group#7542Conversation
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
|
@delock , pls help take a look, thx |
|
If you undo #7266 you will get a rain of warnings (per rank) from recent pytorch versions that collectives could be doing the wrong thing and hang - see the PR I linked to. This is not a bug to fix. Some other approach needs to be used to address your need without breaking the main use-case. I will defer to @tjruwase on design. Most likely some flag needs to be passed to decide what to do. |
|
Hi @stas00 , I read your PR, and it seems you meet hang issue when adding |
|
You misunderstood the purpose of #7266.
Now you're requesting that
What I'm suggesting is that in order to meet everybody's needs I propose to add a new config variable that will control this behavior - by default it should set |
|
@kaixuanliu what will we encounter if set |
|
@delock it will try to new group using gloo backend on XPU and crash, here is the log: |
|
wait a sec, are you saying the problem happens when you run on If so why not check if the hw is xpus and then not set For example does the problem go away if you set |
|
@stas00 , on CUDA this will not crash, as CUDA also supports |
|
Do you mean you wish to call
|
|
After consideration, we think it's best to solve this bug in pytorch side. And here we can make a WA to add |
|
I think your proposal is a good start, @kaixuanliu - we can always expand the use case as we go. Does WA stands for workaround? Could you write it out in the comment as a full word as most readers won't know what WA stands for. |
|
you can run the test failing in modal-torch-latest is unrelated - you can ignore it. |
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
In some usecases such as vllm, we need to new distributed group not only on gpu, but also on cpu, if we set `device_id` here, it will prevent us from new distributed group on cpu: [L230](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L230) . This PR fixes this bug. --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Signed-off-by: Flakes342 <ayushtanwar1729@gmail.com>
In some usecases such as vllm, we need to new distributed group not only on gpu, but also on cpu, if we set `device_id` here, it will prevent us from new distributed group on cpu: [L230](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L230) . This PR fixes this bug. --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
In some usecases such as vllm, we need to new distributed group not only on gpu, but also on cpu, if we set
device_idhere, it will prevent us from new distributed group on cpu: L230 . This PR fixes this bug.