Support more MoE expert routing load imbalance patterns#6
Merged
Conversation
Port the MoE expert load-imbalance features into NeuSim's DeepSeek path: - MoELLMConfig: add expert_load_imbalance_factor (-1.0 sentinel -> worst-case E/K), all_to_all_load_imbalance_aware (default off), and num_worst_case_experts, plus get_effective_expert_tokens(), the effective-factor property, and a model validator. - create_all_to_all_op: add receiver_skew to scale the bandwidth-bound ICI time; _all_to_all_receiver_skew() derives the dispatch/combine incast skew under expert load imbalance. - create_ffn_deepseek_moe: replace the per-expert loop with a worst-case-device model (W hot experts + remaining experts) and apply the all-to-all skew to the dispatch/combine exchanges. Diverge intentionally from the trace_util source by computing the skew ratio and the remaining-expert token split in real (un-floored) units, applying the >=1 floor only to the matmul seqlen. This fixes a decode/small-token over-inflation (balanced load reported skew 32x at T=1; decode FFN modeled every expert active instead of the real routed count). Prefill / large-T behavior is unchanged. Default config (flags off) leaves the all-to-all latency identical; only the DeepSeek expert-compute model changes. Regression vs HEAD: DeepSeek decode ~-9% (de-inflation), prefill ~flat; all non-DeepSeek experiments byte-identical. Adds 22 tests (15 in test_moe_routing.py, 7 MoELLMConfig tests). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Y8Ju7ry5zebVoCNLAKBPT8
Four small, behavior-preserving fixes from the xhigh review (no shipped-config output changes; regression vs the migration commit matches across all 7 experiments): 1. MoELLMConfig.__hash__: include expert_load_imbalance_factor, all_to_all_load_imbalance_aware, and num_worst_case_experts so configs that generate different op graphs no longer collide on the config hash. 2. Unify the worst-case-device expert count between the skew model and the compute model via _num_experts_on_worst_case_device() = ceil(E/EP). Previously the skew used E/EP (float) and the compute used E//EP (floor): they disagreed when EP did not divide E, dropped the remainder experts, and modeled ZERO MoE compute when EP > E. ceil is also the correct count for the busiest device. 3. _validate_expert_load_imbalance_factor: guard num_routed_experts <= 0 and num_activated_routed_experts_per_token <= 0 with a clear ValueError before the E/K division, instead of a bare ZeroDivisionError at construction. 4. Clarify the num_worst_case_experts docstring: W only sets how many experts are hot; the per-hot-expert load is governed by expert_load_imbalance_factor, so W=K yields the documented absolute worst case only when f is also at E/K. Adds 6 tests (hash distinctness, K=0/E=0 guard, ceil helper edges incl. EP>E, and an EP-indivisible skew anchor). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Y8Ju7ry5zebVoCNLAKBPT8
Remove the all_to_all_load_imbalance_aware opt-in flag and always apply the dispatch/combine receiver skew for MoE. The skew still degrades to 1.0 when expert parallelism is off or the load is balanced (expert_load_imbalance_factor = 1.0); the default factor sentinel (-1.0) resolves to the E/K worst case, so the all-to-all path now matches the compute path's default instead of silently assuming a balanced exchange. - MoELLMConfig: drop the all_to_all_load_imbalance_aware field (and its hash entry); _all_to_all_receiver_skew no longer gates on it. - Update docstrings/comments and tests accordingly. Regression vs the prior branch HEAD: change is confined to DeepSeek. Prefill (bandwidth-bound) dispatch/combine all-to-all rises with the now-applied skew (EP=2 ~1.16x, EP=4 ~1.48x) -> TTFT ~+1.5%; decode (latency-bound) is unchanged; all non-DeepSeek experiments are byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Y8Ju7ry5zebVoCNLAKBPT8
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.