Skip to content

Conversation

@Flakes342
Copy link
Contributor

Fixes #7535

Description

This PR fixes a bug in inference/engine.py where num_experts (moe_experts) was incorrectly passed as the expert parallel group size (ep_size) when creating expert parallel groups.

Currently:

if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.moe_experts)

This causes invalid behavior whenever num_experts > world_size, because _create_ep_parallel_group expects a group size, not the total number of experts as pointed out by @Arnoochka

Root Cause

num_experts = number of experts inside the MoE layer.

ep_size = how many GPUs to group together for expert parallelism.

These were mixed up in the code.

##Fix

Replaced the incorrect call with the proper ep_size argument:

if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.ep_size)

Additionally, added a safety check in _create_ep_parallel_group to catch invalid configurations:

num_ep_groups = dist.get_world_size() // moe_ep_size
if num_ep_groups == 0:
    raise ValueError(
        f"Invalid ep_size={moe_ep_size} for world_size={dist.get_world_size()}"
    )

Backward compatibility

  • If a user was already running with ep_size >= num_experts, the old code worked fine which would still work fine.
  • Only the previously broken case (num_experts > world_size) now works correctly.

Signed-off-by: Flakes342 <ayushtanwar1729@gmail.com>
@Flakes342
Copy link
Contributor Author

Hi @tohtana, the previous PR was getting very messy with all the unaccounted contributions by other members. I raised this new small PR with same changes just to keep things consistent.
I feel all tests but nv-mii are passing now, please have a look here once and let me know. Thanks!

@tohtana tohtana merged commit 8cbbbb5 into deepspeedai:master Sep 10, 2025
12 of 13 checks passed
@tohtana
Copy link
Collaborator

tohtana commented Sep 10, 2025

Thank you @Flakes342, I merged it.

mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this pull request Oct 4, 2025
…e) (deepspeedai#7551)

Fixes deepspeedai#7535 

## Description
This PR fixes a bug in inference/engine.py where num_experts
(moe_experts) was incorrectly passed as the expert parallel group size
(ep_size) when creating expert parallel groups.

Currently:
```
if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.moe_experts)
```
This causes **invalid** behavior whenever `num_experts > world_size`,
because `_create_ep_parallel_group` expects a group size, not the total
number of experts as pointed out by @Arnoochka

## Root Cause

num_experts = number of experts inside the MoE layer.

ep_size = how many GPUs to group together for expert parallelism.

These were mixed up in the code.

##Fix

Replaced the incorrect call with the proper ep_size argument:
```
if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.ep_size)
```


Additionally, added a safety check in _create_ep_parallel_group to catch
invalid configurations:

```
num_ep_groups = dist.get_world_size() // moe_ep_size
if num_ep_groups == 0:
    raise ValueError(
        f"Invalid ep_size={moe_ep_size} for world_size={dist.get_world_size()}"
    )
```
## Backward compatibility
- If a user was already running with ep_size >= num_experts, the old
code worked fine which would still work fine.
- Only the previously broken case (num_experts > world_size) now works
correctly.

Signed-off-by: Flakes342 <ayushtanwar1729@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] InferenceEngine._create_ep_parallel_group uses num_experts instead of ep_size, causing incorrect behavior

2 participants