Skip to content

Support more MoE expert routing load imbalance patterns#6

Merged
XZman merged 3 commits into
mainfrom
moe-expert-routing-load-imbalance
Jun 22, 2026
Merged

Support more MoE expert routing load imbalance patterns#6
XZman merged 3 commits into
mainfrom
moe-expert-routing-load-imbalance

Conversation

@XZman

@XZman XZman commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

XZman and others added 3 commits June 19, 2026 21:04
Port the MoE expert load-imbalance features into NeuSim's DeepSeek path:

- MoELLMConfig: add expert_load_imbalance_factor (-1.0 sentinel -> worst-case
  E/K), all_to_all_load_imbalance_aware (default off), and num_worst_case_experts,
  plus get_effective_expert_tokens(), the effective-factor property, and a
  model validator.
- create_all_to_all_op: add receiver_skew to scale the bandwidth-bound ICI time;
  _all_to_all_receiver_skew() derives the dispatch/combine incast skew under
  expert load imbalance.
- create_ffn_deepseek_moe: replace the per-expert loop with a worst-case-device
  model (W hot experts + remaining experts) and apply the all-to-all skew to the
  dispatch/combine exchanges.

Diverge intentionally from the trace_util source by computing the skew ratio and
the remaining-expert token split in real (un-floored) units, applying the >=1
floor only to the matmul seqlen. This fixes a decode/small-token over-inflation
(balanced load reported skew 32x at T=1; decode FFN modeled every expert active
instead of the real routed count). Prefill / large-T behavior is unchanged.

Default config (flags off) leaves the all-to-all latency identical; only the
DeepSeek expert-compute model changes. Regression vs HEAD: DeepSeek decode
~-9% (de-inflation), prefill ~flat; all non-DeepSeek experiments byte-identical.

Adds 22 tests (15 in test_moe_routing.py, 7 MoELLMConfig tests).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Y8Ju7ry5zebVoCNLAKBPT8
Four small, behavior-preserving fixes from the xhigh review (no shipped-config
output changes; regression vs the migration commit matches across all 7
experiments):

1. MoELLMConfig.__hash__: include expert_load_imbalance_factor,
   all_to_all_load_imbalance_aware, and num_worst_case_experts so configs that
   generate different op graphs no longer collide on the config hash.

2. Unify the worst-case-device expert count between the skew model and the
   compute model via _num_experts_on_worst_case_device() = ceil(E/EP). Previously
   the skew used E/EP (float) and the compute used E//EP (floor): they disagreed
   when EP did not divide E, dropped the remainder experts, and modeled ZERO MoE
   compute when EP > E. ceil is also the correct count for the busiest device.

3. _validate_expert_load_imbalance_factor: guard num_routed_experts <= 0 and
   num_activated_routed_experts_per_token <= 0 with a clear ValueError before the
   E/K division, instead of a bare ZeroDivisionError at construction.

4. Clarify the num_worst_case_experts docstring: W only sets how many experts
   are hot; the per-hot-expert load is governed by expert_load_imbalance_factor,
   so W=K yields the documented absolute worst case only when f is also at E/K.

Adds 6 tests (hash distinctness, K=0/E=0 guard, ceil helper edges incl. EP>E,
and an EP-indivisible skew anchor).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Y8Ju7ry5zebVoCNLAKBPT8
Remove the all_to_all_load_imbalance_aware opt-in flag and always apply the
dispatch/combine receiver skew for MoE. The skew still degrades to 1.0 when
expert parallelism is off or the load is balanced (expert_load_imbalance_factor
= 1.0); the default factor sentinel (-1.0) resolves to the E/K worst case, so
the all-to-all path now matches the compute path's default instead of silently
assuming a balanced exchange.

- MoELLMConfig: drop the all_to_all_load_imbalance_aware field (and its hash
  entry); _all_to_all_receiver_skew no longer gates on it.
- Update docstrings/comments and tests accordingly.

Regression vs the prior branch HEAD: change is confined to DeepSeek. Prefill
(bandwidth-bound) dispatch/combine all-to-all rises with the now-applied skew
(EP=2 ~1.16x, EP=4 ~1.48x) -> TTFT ~+1.5%; decode (latency-bound) is unchanged;
all non-DeepSeek experiments are byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Y8Ju7ry5zebVoCNLAKBPT8
@XZman XZman marked this pull request as ready for review June 22, 2026 22:35
@XZman XZman merged commit 58fbd3f into main Jun 22, 2026
6 checks passed
@XZman XZman deleted the moe-expert-routing-load-imbalance branch June 22, 2026 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant