[MoE] Fix misuse of num_experts as expert parallel group size (ep_size) #7551

Flakes342 · 2025-09-09T21:18:49Z

Description

This PR fixes a bug in inference/engine.py where num_experts (moe_experts) was incorrectly passed as the expert parallel group size (ep_size) when creating expert parallel groups.

Currently:

if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.moe_experts)

This causes invalid behavior whenever num_experts > world_size, because _create_ep_parallel_group expects a group size, not the total number of experts as pointed out by @Arnoochka

Root Cause

num_experts = number of experts inside the MoE layer.

ep_size = how many GPUs to group together for expert parallelism.

These were mixed up in the code.

##Fix

Replaced the incorrect call with the proper ep_size argument:

if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.ep_size)

Additionally, added a safety check in _create_ep_parallel_group to catch invalid configurations:

num_ep_groups = dist.get_world_size() // moe_ep_size
if num_ep_groups == 0:
    raise ValueError(
        f"Invalid ep_size={moe_ep_size} for world_size={dist.get_world_size()}"
    )

Backward compatibility

If a user was already running with ep_size >= num_experts, the old code worked fine which would still work fine.
Only the previously broken case (num_experts > world_size) now works correctly.

Signed-off-by: Flakes342 <ayushtanwar1729@gmail.com>

Flakes342 · 2025-09-10T03:32:06Z

Hi @tohtana, the previous PR was getting very messy with all the unaccounted contributions by other members. I raised this new small PR with same changes just to keep things consistent.
I feel all tests but nv-mii are passing now, please have a look here once and let me know. Thanks!

tohtana · 2025-09-10T05:32:04Z

Thank you @Flakes342, I merged it.

@Arnoochka

…e) (deepspeedai#7551) Fixes deepspeedai#7535 ## Description This PR fixes a bug in inference/engine.py where num_experts (moe_experts) was incorrectly passed as the expert parallel group size (ep_size) when creating expert parallel groups. Currently: ``` if moe and dist.get_world_size() > 1: self._create_ep_parallel_group(config.moe.moe_experts) ``` This causes **invalid** behavior whenever `num_experts > world_size`, because `_create_ep_parallel_group` expects a group size, not the total number of experts as pointed out by @Arnoochka ## Root Cause num_experts = number of experts inside the MoE layer. ep_size = how many GPUs to group together for expert parallelism. These were mixed up in the code. ##Fix Replaced the incorrect call with the proper ep_size argument: ``` if moe and dist.get_world_size() > 1: self._create_ep_parallel_group(config.moe.ep_size) ``` Additionally, added a safety check in _create_ep_parallel_group to catch invalid configurations: ``` num_ep_groups = dist.get_world_size() // moe_ep_size if num_ep_groups == 0: raise ValueError( f"Invalid ep_size={moe_ep_size} for world_size={dist.get_world_size()}" ) ``` ## Backward compatibility - If a user was already running with ep_size >= num_experts, the old code worked fine which would still work fine. - Only the previously broken case (num_experts > world_size) now works correctly. Signed-off-by: Flakes342 <ayushtanwar1729@gmail.com>

Flakes342 requested review from hwchen2017 and tohtana as code owners September 9, 2025 21:18

[MoE] Fix misuse of num_experts as expert parallel group size (ep_size)

e65b6e9

Signed-off-by: Flakes342 <ayushtanwar1729@gmail.com>

Flakes342 force-pushed the master branch from 3ade203 to e65b6e9 Compare September 9, 2025 21:25

tohtana approved these changes Sep 10, 2025

View reviewed changes

tohtana merged commit 8cbbbb5 into deepspeedai:master Sep 10, 2025
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MoE] Fix misuse of num_experts as expert parallel group size (ep_size) #7551

[MoE] Fix misuse of num_experts as expert parallel group size (ep_size) #7551

Uh oh!

Flakes342 commented Sep 9, 2025

Uh oh!

Flakes342 commented Sep 10, 2025

Uh oh!

Uh oh!

tohtana commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[MoE] Fix misuse of num_experts as expert parallel group size (ep_size) #7551

[MoE] Fix misuse of num_experts as expert parallel group size (ep_size) #7551

Uh oh!

Conversation

Flakes342 commented Sep 9, 2025

Description

Root Cause

Backward compatibility

Uh oh!

Flakes342 commented Sep 10, 2025

Uh oh!

Uh oh!

tohtana commented Sep 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants