Reduce MoE activation memory in DualPipeV#16
Merged
haok1402 merged 1 commit intomlc-ai:mainfrom Apr 13, 2026
Merged
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements significant memory optimizations and architectural enhancements for the DualPipeV pipeline, including a fused cross-entropy Triton kernel, a memory-efficient padded index gather operator, and a balanced layer partitioning algorithm. It also introduces manual storage management and deferred tensor freeing to reduce peak memory consumption. Review feedback highlights critical safety issues where early storage resizing or deferred freeing could cause crashes when expert parallelism is disabled, and suggests performance refinements for the Triton kernels.
Collaborator
|
The cross entropy loss using 4k sequence looks correct after a couple steps from the released checkpoint. |
haok1402
approved these changes
Apr 13, 2026
dbf6819 to
a0e7d44
Compare
- Remove EpilogOuts/epilog_b (dead code; logits not needed after loss) - Remove Stage2/Stage4 args and outs (only ctx needed for a2a backward) - Add padded_index_gather: avoids saving input for backward + reduces CUDA allocator fragmentation via fixed-alignment output buffers - Free stage-boundary tensor storage early via untyped_storage().resize_(0) after async all-to-all consumers complete (sorted_tokens after Stage 2, gathered_tokens after Stage 3, stage3 moe_outs after Stage 4) - Add fwd_comm_deferred_free to ExecutionCtx for clean deferred-free in the overlapped path without changing stage function signatures - Add layer_partition() for memory-aware pipeline stage assignment - Add per-layer saved-tensor profiling with weight/activation distinction Co-Authored-By: Hao Kang <89672451+haok1402@users.noreply.github.com>
a0e7d44 to
bdd26c9
Compare
haok1402
pushed a commit
that referenced
this pull request
Apr 24, 2026
Route the two GPT-OSS expert GEMMs (gate_up_proj, down_proj) through GroupLinearFunc.apply instead of F.grouped_mm, so the wgrad-delay path added in #28 applies to gpt-oss too. The per-expert bias add stays on the caller side (bias[group_ids]) rather than being folded into the autograd Function: index_add_ for bgrad is cheap compared to the grouped_mm wgrad, so deferring it buys little but complicates both the Function signature and the gpt-oss call sites. Add test_gpt_oss_experts_weight_grad_store_matches_direct as an integration check: it runs GptOssExperts with WeightGradStore on vs off and asserts that forward output and input/weight/bias gradients all match (weights tightly through deterministic grouped_mm, biases with ~5% bf16 slack since CUDA index_add_ is non-deterministic). It also verifies that gate_up_proj/down_proj grads are deferred before flush/pop while the bias grads remain eager, documenting the split. Drive-by: fix the stale assertion in test_scatter_for_grouped_gemm that was left over from #16. The scatter output has been rounded up to _GEMM_ALLOC_ALIGNMENT (for CUDA allocator locality) since that commit and the tail rows are zeroed by the kernel, but the test still required out.shape[0] == offs[-1]. Replace it with the actual contract: shape is at least offs[-1], aligned to _GEMM_ALLOC_ALIGNMENT, with the over-allocated tail all-zero.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Co-authored-by: Hao Kang 89672451+haok1402@users.noreply.github.com