# OS INSPIRED COMPLETE KERNEL FUSION

### A Thesis

Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Master of Science

by Osayamen Jonathan Aimuyo August 2025 © 2025 Osayamen Jonathan Aimuyo ALL RIGHTS RESERVED

#### **ABSTRACT**

Distributed Machine Learning (DML) is increasingly recognized as a communication-bound workload, with most existing work aiming to alleviate this bottleneck through communication-computation overlap. However, current approaches—largely reliant on CPU-managed operator scheduling and synchronous communication collectives—leave significant performance on the table.

This thesis identifies, analyzes and addresses the inefficiencies arising from the interplay between CPU-driven collective communication and highly parallel GPU computation. We demonstrate that the standard bulk-synchronous communication model underutilizes GPU interconnect bandwidth and is especially vulnerable to straggler-induced performance degradation. Focusing on dynamic workloads such as Mixture-of-Experts (MoE), we highlight two critical bottlenecks. First, we observe *payload inefficiency* at the application layer, where existing collective primitives force unnecessary padding of GPU network buffers. Second, we expose how CPU-driven execution limits the exploitation of *task locality* and introduces artificial synchronization barriers across distributed GPU tasks.

To overcome these limitations, we propose a model of *complete GPU residency*, where inter-GPU communication is integrated directly into GPU kernels. We realize this vision in *FlashDMoE*: a persistent, in-kernel, actor-style operating system with packet switching that enables complete operator fusion for Distributed MoE (DMoE) into a *single kernel*, the first of its kind. *FlashDMoE* features a modular, message-driven architecture that supports lockless execution

across tens of thousands of GPU threads and across distributed GPUs as well. We demonstrate how *FlashDMoE* addresses the all-to-all communication bottleneck in expert parallelism and enables high-throughput, GPU-initiated communication. Evaluated against state-of-the-art distributed MoE frameworks, *FlashDMoE* achieves up to 15× speedup, establishing a new, scalable design methodology for distributed GPU systems.

### **BIOGRAPHICAL SKETCH**

Osayamen Jonathan Aimuyo is a computer science researcher specializing in distributed and parallel computing systems and algorithms. His work focuses on low-level optimizations for computational and communication bottlenecks in large-scale machine learning training and inference on accelerators. Jonathan earned a BS in computer engineering, summa cum laude, with Tau Beta Pi and Phi Kappa Phi Honors from the University of Texas at Dallas (class of 2023), where he was a Presidential Achievement Scholar and a semifinalist for the national Jack Kent Cooke Transfer Scholarship (2019). He is currently a second-year CS MS student at Cornell University, advised by Dr. Rachee Singh. While at Cornell, he received an Honorable Mention recognition for the National Science Foundation Graduate Research Fellowships Program (NSF GRFP 2025). In industry, he has contributed to large-scale distributed systems through internships at Microsoft ('22 - '24), Chime Financial ('22), and JPMorgan Chase ('20 -'21). After graduating from Cornell, he will intern at NVIDIA with the CUDA Math Libraries team, before beginning his PhD in Computer Science at Stanford University in Fall 2025.



#### **ACKNOWLEDGEMENTS**

First and foremost, I thank God, the Author of all things good, for sustaining me through every moment of this journey. He remained faithful in seasons of clarity and in shadows of doubt. Whatever merit this work holds is but a fragment of the grace I have received.

To my advisor, **Dr. Rachee Singh**, thank you for the trust, the freedom, and the unwavering support! You gave me the space to think creatively, and your mentorship has been a true gift. I am especially grateful for your generosity, both professionally and personally. Your unique willingness to support your students holistically has helped me blossom as a well-rounded researcher. I also appreciate the lab resources (especially the GPUs) you made readily accessible. Much of the preliminary exploration and hypothesis testing for this research took place on them, and they contributed immense value to this thesis.

To my minor field member, **Dr. Giulia Guidi**, thank you for being an excellent mentor and for your generosity with time and resources. Your provision of high-performance computing infrastructure made a tangible difference in this work, and I recognize the immense privilege it is to have had access to such powerful systems.

To my second minor field member, **Dr. Chris De Sa**, I am deeply grateful for the opportunity to have learned Machine Learning Systems (MLSys) from you. Your passion and enthusiasm for the field is infectious, and your teaching is exceptional. If I could take your MLSys class again, I would in a heartbeat!

To the **Christ Chapel Choir ('24/25)**, thank you for being my second family away from home. I joined you not knowing how to read sheet music, yet you welcomed me warmly and gave me room to grow rapidly. That kindness means more to me than I can express. One of my fondest memories was singing Han-

del's *Hallelujah Chorus* together; it was both exhilarating and quite fun! I am especially thankful to Dr. Art Ostrander (Choir Director) for his technical precision and leadership; Carrie Ostrander (Soprano); Dr. Deborah Martin (Alto and pianist); Jyying Juliana Kan (Soprano); Margaret Brodhead (Pianist); Jim and Cindy Van Duren (Bass and Soprano); Amy Blumenthal (Alto); and Debbie Axtell (Alto). You have all helped shape the slightly more competent musician and vocalist I am today.

To my colleagues at Cornell who made this journey richer: the CS MS classes of '24 and '25, Dr. Singh's research group, Julian Bellavita, and the broader systems community, thank you for the fun events, collaboration, and conversation.

Finally, to my parents—your quiet strength and unwavering support have been immeasurably valuable to all my academic endeavors. I would not be here without you. Thank you.

## TABLE OF CONTENTS

|   | _    | graphical Sketch                                       |     |
|---|------|--------------------------------------------------------|-----|
|   |      |                                                        | V   |
|   |      |                                                        | vii |
|   |      | of Tables                                              | i۶  |
|   |      | of Figures                                             | λ   |
|   | List | or rigures                                             | ^   |
| 1 | Intr | oduction                                               | 1   |
|   | 1.1  | Computational Cost Equivalence                         | 2   |
|   | 1.2  | Communication Overheads in Distributed MoE             | 3   |
|   | 1.3  | Kernel Launch Overheads in Distributed MoE             | 4   |
|   | 1.4  | This work's Contributions: DMoE in a single kernel     | 5   |
|   |      | 1.4.1 Warp Specialization and Tile Parallelism         | 6   |
|   |      | 1.4.2 Asynchronous and payload-efficient communication | 7   |
|   |      | 1.4.3 Technical Challenges                             | 8   |
|   |      | 1.4.4 Research Papers                                  | 8   |
|   |      |                                                        |     |
| 2 | Rela | ated Work                                              | 9   |
| 3 | Mot  | rivation                                               | 10  |
|   | 3.1  | Straggler Effect                                       | 10  |
|   | 3.2  |                                                        | 12  |
| 4 | Met  | hod                                                    | 13  |
|   | 4.1  | Actor Model                                            | 14  |
|   | 4.2  | Inter-Actor Interactions                               | 15  |
|   | 4.3  | Tiling                                                 | 16  |
| 5 | Prog | gramming Abstractions                                  | 20  |
|   | 5.1  |                                                        | 20  |
|   | 5.2  |                                                        | 20  |
|   | 5.3  | Symmetric Tensor Layout for Inter-GPU Communication    | 22  |
|   | 5.4  |                                                        | 26  |
| 6 | Eva  | luation                                                | 28  |
|   | 6.1  | Setup                                                  | 28  |
|   | 6.2  | Scalability with Tokens and Experts                    | 29  |
|   | 6.3  | ,                                                      | 30  |
|   | 6.4  | J                                                      | 31  |
|   | 6.5  |                                                        | 32  |
|   | 6.6  |                                                        | 32  |
|   |      |                                                        | 32  |
|   |      | 6.6.2 Results                                          | 33  |

|    | <ul><li>6.7 Memory Overhead</li></ul> |    |
|----|---------------------------------------|----|
| 7  | Limitations and Future Work           | 36 |
| 8  | Conclusion                            | 38 |
| Bi | ibliography                           | 39 |
| A  | FP16 Memory Throughput                | 46 |

## LIST OF TABLES

| 1.1 | <b>Kernel Fusion Comparison.</b> Our method is the first to fully fuse                                        |    |
|-----|---------------------------------------------------------------------------------------------------------------|----|
|     | the DMoE layer into a single GPU kernel. We report GPU operations from detailed profiling with Nsight Systems | 4  |
| 6.1 | Implementation metrics of <i>FlashDMoE</i> using fully inlined NVSHMEM                                        | 28 |
| 6.2 | Memory overhead (tile size $bM = 128$ , $Size(T) = Tokens * 4KB$ ).                                           | 34 |
| 6.3 | Latency (ms) comparison across different numbers of tokens (Figure 6.1a)                                      | 35 |
| 6.4 | Latency (ms) comparison across different numbers of experts (Figure 6.1b)                                     | 35 |
| 6.5 | Latency (ms) comparison across different numbers of GPUs (Figure 6.2).                                        | 35 |

## LIST OF FIGURES

| 1.1 | Transformer blocks (a) without MoE, (b) with MoE, and (c) with distributed MoE and expert parallelism. T, E, and O represent input tokens, experts, and output activations, respectively                                                                                                                                                                                                                                                                                                  | 1  |
|-----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 1.2 | Comparing <i>FlashDMoE</i> with state-of-the-art techniques that either do not overlap communication and computation (left, top) or do some overlap (left, middle). <i>FlashDMoE</i> is a persistent kernel that fuses all computation and communication of the MoE operator (left, bottom). <i>FlashDMoE</i> implements device-initiated computation (gate, expert FFN, scale) and communication tasks (right).                                                                          | Δ  |
| 1.3 | FlashDMoE Actors. The Subscriber observes received remote packets (blob of tiles), decodes them into tasks and notifies the scheduler of the enqueued tasks. We assign three CUDA warps to this role. The Scheduler maintains a ready queue of Processors from which it schedules tasks. The Processors performs the operation encoded in its task descriptor and eagerly communicates                                                                                                    |    |
| 1.4 | its output, if necessary                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 7  |
| 3.1 | Straggler effect of synchronous ALLTOALL. $M \times N$ A100 or V100 denotes $N$ GPUs within a node across $M$ nodes. Every GPU communicates with every other GPU per ALLTOALL step. We capture the distribution of delay induced by stragglers across many steps. <b>Actual Time</b> $t_a$ denotes the fastest kernel execution time across all GPUs, conversely <b>Total Time</b> $t$ is the maximum recorded step time, while $Delay$ is the maximum difference between $t$ and $t_a$ . | 10 |
| 3.2 | Overlapped Schedule (bottom) showing how idle time from the sequential schedule (top) is repurposed for computation. <i>FlashDMoE</i> implements the overlapped schedule                                                                                                                                                                                                                                                                                                                  | 11 |
| 3.3 | Kernel Launch overhead (CUDA API row) juxtaposed with runtime latency. Compared to DeepEP that launches 432 kernels, <i>FlashDMoE</i> launches a single one.                                                                                                                                                                                                                                                                                                                              | 12 |
| 4.1 | FlashDMoE Fused Kernel. Green arrows demonstrate block or warp specialization.                                                                                                                                                                                                                                                                                                                                                                                                            | 13 |

| 4.2                                           | <i>GPU Memory Hierarchy</i> . The inverted pyramid (left) shows the load/store access latency [28, 1, 38]. Table above outlines the capacity for different memory tiers (for A100 GPUs). The shared memory and register capacity are static configurations for <i>FlashDMoE</i> . The right figure shows accessibility scopes: on-chip <b>registers</b> are scoped to a thread; on-chip <b>shared memory</b> is visible to all threads in a block; and off-chip <b>global memory</b> is accessible by all threads on device.  DMoE Functional Dependencies Expressed as a Chain of Actor Interactions. We denote $S_b$ , $S_h$ , and $P$ as the Subscriber, Scheduler and Processor actors, respectively. For any actor $a \in \{S_b, S_b, P\}$ , $a^i$ identifies an actor on GPU $i$ . We define $D_i^j$ as the operator, where GPU $j$ dispatches packets of tiles to GPU $i$ , This diagram expresses task dependencies at the granularity of a tile, namely $GEMM_0$ , $GEMM_1$ , combine and communication produce an output tile. | 15<br>16       |
|-----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|
|                                               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                |
| <ul><li>5.1</li><li>5.2</li><li>5.3</li></ul> | Task Struct. TaskType $\in \{GEMM_0, GEMM_1, Combine\}$ Symmetric Tensor Layout across 2 Expert-parallel Processes State machine for DMA (top) and RDMA (bottom) communication                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 21<br>22<br>24 |
| 6.1<br>6.2<br>6.3<br>6.4                      | Scalability with respect to the number of tokens and experts. All experiments use 8 GPUs.  Scalability with respect to the number of GPUs (E = 32, T = 8K).  Factors leading to performance improvement of <i>FlashDMoE</i> .  Multi-node Latency evaluation. Embbeding dimension is 1024 and FFN intermediate size is 4096. We define Maximal Incast Volume (MIV) as the worst case upper bound for data volume that a NIC receives in a single incast occurence.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 29<br>30<br>31 |
| A.1                                           | Here, we report the total GPU memory throughput for both FP16 (top) and FP32 (bottom) variants of <i>FlashDMoE</i> . Notably, the FP16 implementation issues approximately 2× more shared memory instructions compared to its FP32 counterpart under identical workloads. We attribute this inefficiency to suboptimal shared memory layouts in <i>FlashDMoE</i> when operating on half-precision data. While this bottleneck is addressable through improved layout strategies, we leave its resolution to future work due to time constraints.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | 46             |

#### CHAPTER 1

#### **INTRODUCTION**

State-of-the-art large language models (LLMs), including DeepSeek-v3 [10], LLama4 [3], DBRX [42] and Snowflake Arctic [43], have adopted the Mixture-of-Experts (MoE) [44] architecture for its computational efficiency [41] and reliable performance across language modeling tasks [10, 3, 21].



Figure 1.1: Transformer blocks (a) without MoE, (b) with MoE, and (c) with distributed MoE and expert parallelism. T, E, and O represent input tokens, experts, and output activations, respectively.

Depicted in Figure 1.1(a), the conventional Transformer block consists of a self-attention module followed by a feed-forward network (FFN) [49]. In contrast, MoE architectures replace this single FFN with identically sized FFNs, otherwise known as experts, (Figure 1.1(b)). A trainable neural network, known as a gate function, sparsely activates these experts by dynamically routing input tokens to selected experts at runtime. This increase in model parameters (more FFNs) improves model quality without a *corresponding increase in computational cost*.

## 1.1 Computational Cost Equivalence

The preceding claim seems counterintuitive because *shouldn't the increase in the number of experts yield a proportional increase in the model's computational operations?* The answer is no, due to how tokens are *distributed* across experts in comparison to the singular FFN. For example, consider a token matrix T as defined below where S is the sequence length and H the embedding dimension.

$$T \in \mathbb{R}^{S \times H}$$

The typical FFN operator, defined below,

$$FFN(x) = W_2 \cdot \phi(xW_1 + b_1) + b_2 \tag{1.1}$$

comprises two linear transformations on learnable weight matrices  $W_1 \in \mathbb{R}^{H \times P}$ ,  $W_2 \in \mathbb{R}^{P \times H}$  each followed by additions with bias terms  $b_1 \in \mathbb{R}^{1 \times P}$ ,  $b_2 \in \mathbb{R}^{1 \times H}$  and separated by a nonlinear activation  $\phi$  (e.g., GELU [16] or ReLU [30]). Here dimension P is an intermediate projection for the FFN, typically  $P = 4 \cdot H$  [27]. If we define  $\mathcal{F}_{FFN}$  as the Floating Point OPerations (FLOPs) needed to compute a forward pass of the FFN, then using Equation 1.1 we have the resulting expression.

$$\mathcal{F}_{FFN} = \mathcal{F}_{L_0} + \mathcal{F}_{L_1} \tag{1.2}$$

where  $\mathcal{F}_{L_i}$  is the FLOPs cost for computing linear transformation i. These linear transformations are **GE**neral **M**atrix **M**ultiplications (GEMMs). We know that multiplying two matrices of sizes (M, K) and (K, N) demands 2MNK FLOPs, therefore we can expand Equation 1.2 as

$$\mathcal{F}_{FFN} = 2SHP + 2SHP = 4SHP \tag{1.3}$$

An MoE model differs from the dense transformer by *restricting* the number of tokens [25, 11] routed to an FFN (interchangeably called expert). Specifically,

for a model with  $N_e$  experts, each expert has a fixed capacity for tokens  $S_e$  defined as follows

$$S_e = \frac{S}{N_e} \tag{1.4}$$

With the above, we can compute  $\mathcal{F}_{MoE}$ . Intuitively, this quantity would be the aggregate of  $\mathcal{F}_{FFN_j}$  where  $j \in \{0, \dots, N_e - 1\}$ .

$$\mathcal{F}_{MoE} = \sum_{j=0}^{N_e - 1} \mathcal{F}_{FFN_j} \tag{1.5}$$

Observe that  $\mathcal{F}_{FFN_j}$  is derivable from Equation 1.2 by replacing S with  $S_e$ . Applying this observation and evaluating 1.5 gives the below result

$$\mathcal{F}_{MoE} = N_e \cdot 4S_e HP \tag{1.6}$$

Substituting with 1.4, yields the below which proves that the computational cost is equivalent between the MoE and dense transformer models!

$$\mathcal{F}_{MoE} = 4SHP = \mathcal{F}_{FFN} \tag{1.7}$$

This relationship presents empirically as uniform latency when  $N_e$  increases but only till a certain threshold. Exceeding this limit causes the latency to increase proportionally; existing work gives no explanation for this phenomenon, but we hypothesize GPU L1/L2 cache thrashing to be the culprit.

### 1.2 Communication Overheads in Distributed MoE

As MoE model sizes grow, GPU memory constraints prevent hosting all experts on a single device. The standard practice is to distribute experts across multiple GPUs using expert parallelism (EP), which requires the gate function to route tokens via AllToAll communication [10, 43, 42, 46]. Overall, each MoE



Figure 1.2: Comparing *FlashDMoE* with state-of-the-art techniques that either do not overlap communication and computation (left, top) or do some overlap (left, middle). *FlashDMoE* is a persistent kernel that fuses all computation and communication of the MoE operator (left, bottom). *FlashDMoE* implements device-initiated computation (gate, expert FFN, scale) and communication tasks (right).

layer executes two ALLTOALL operations during inference, introducing significant communication overhead. ALLTOALL communication is challenging to optimize on GPU networks and is highly sensitive to straggler delays—a phenomenon where a single *straggler* GPU delays all others from making progress. In practice, these communication operations can account for up to 40% of the total runtime during inference or training [26, 22].

### 1.3 Kernel Launch Overheads in Distributed MoE

| Works                        | Launched GPU Ops |  |
|------------------------------|------------------|--|
| FlashDMoE                    | 1                |  |
| COMET [55]                   | 33               |  |
| Megatron-LM CUTLASS [36, 32] | 85               |  |
| Megatron-LM TE [36, 32]      | 261              |  |
| Megatron-LM + DeepEP [10]    | 432              |  |
| DeepSpeedMoE [41]            | 550              |  |

Table 1.1: **Kernel Fusion Comparison.** Our method is the first to fully fuse the DMoE layer into a single GPU kernel. We report GPU operations from detailed profiling with Nsight Systems.

To mitigate these communication bottlenecks, recent work pipelines computation with communication tasks (Figure 1.2, middle left). However, the effectiveness of these solutions is limited by the overhead of launching many kernels from the CPU. Specifically, MoE layers interleave multiple computation kernels (such as gate and expert computations) and communication operations, forcing the CPU to launch many GPU kernels per forward pass (see Table 1.1). Frequent kernel launches negatively affect performance by: (1) creating non-deterministic kernel start times across GPUs, exacerbating straggler issues; (2) introducing unnecessary synchronization points, causing GPUs to wait on peers or the CPU before proceeding; and (3) incurring repeated global memory round trips at kernel boundaries. Although CUDA graphs [53] can partially mitigate the first issue in static workloads, they are incompatible with MoE's dynamic expert routing patterns. Addressing the remaining issues requires novel solutions, which we provide in this work through complete kernel fusion and asynchronous device-initiated communication.

## 1.4 This work's Contributions: DMoE in a single kernel

To overcome these fundamental inefficiencies in state-of-the-art MoE models, we develop *FlashDMoE*, a novel MoE architecture that integrates all computation and communication tasks into a single persistent GPU kernel, namely a kernel that remains active for the entirety of the MoE operator (Figure 1.2 bottom left). Instead of multiple kernel launches coordinated by the CPU, *FlashDMoE* requires launching only one kernel, significantly reducing the involvement of the CPU in the MoE operator. Within the fused kernel, *FlashDMoE* implements a concurrent-programming model to achieve fine-grained parallelization of com-

putation and communication tasks of the MoE operator. This design enables *FlashDMoE* to efficiently utilize GPU resources by reducing idle time.

## 1.4.1 Warp Specialization and Tile Parallelism.



Figure 1.3: *FlashDMoE* Actors. The Subscriber observes received remote *packets* (blob of tiles), decodes them into *tasks* and notifies the scheduler of the enqueued tasks. We assign three CUDA warps to this role. The Scheduler maintains a *ready queue* of Processors from which it schedules tasks. The Processors performs the operation encoded in its task descriptor and eagerly communicates its output, if necessary.

FlashDMoE implements tile-level parallelism, meaning it partitions input to-ken matrices into smaller, independent units called tiles, which are processed concurrently by GPU thread warps. These warps specialize as processors since they process input to compute the gate function and expert FFNs. A handful of warps perform specialized administrative tasks of (1) scheduling computational tasks by mapping them to warps (scheduler), and (2) communicating with other GPUs (subscriber). This design allows FlashDMoE to dynamically assign tasks to GPU warps based on warp availability and the current workload, ensuring that no warp remains idle while useful work can be done. FlashDMoE selects tile dimensions to maximize GPU arithmetic intensity and minimize register pressure.

## 1.4.2 Asynchronous and payload-efficient communication.



Figure 1.4: Depiction of a Partitioned Global Adress Space (PGAS) [54] across a *team* of **P**rocessing Elements (PEs). The symmetric heap is an unifromly sized memroy region resident on each PE and accessible by every other PE via one-sided **D**irect **M**emory **A**ccess (DMA) or **R**emote DMA (RDMA) operations.

By redesigning the MoE operator from the ground up, *FlashDMoE* resolves fundamental inefficiencies inherent in the conventional MoE execution pipeline. One inefficiency is due to workload imbalance across GPUs due to skewed popularity of experts. Existing implementations [41] indiscriminately perform ALL-TOALL communications, transferring null values to GPUs that host unpopular experts, resulting in wasted communication bandwidth and unnecessary computations on null matrices. *FlashDMoE* introduces *payload-efficient* communication by sending tokens only to GPUs with actively selected experts, thereby conserving both communication and computational resources.

## 1.4.3 Technical Challenges

Realizing the single-kernel design of *FlashDMoE* required solving several technical challenges. *FlashDMoE*'s design is a radical departure from traditional synchronous AllToAll collectives, where GPUs remain idle until the slowest GPU completes its communication. Instead, *FlashDMoE* establishes a global address space via NVSHMEM [37] across all GPUs to achieve asynchronous communication between GPUs while allowing them to continue performing useful computation tasks. To implement device-initiated computation primitives, *FlashDMoE* develops custom high-performance GEMM operations for the MoE operator in CUTLASS [48].

## 1.4.4 Research Papers

This thesis comprises a first-author publication [4] at ACM SIGMETRICS'24 and another first-author submission in review at NeurIPS '25.

#### **CHAPTER 2**

#### **RELATED WORK**

Computation-Communication Overlap. To reduce the communication overheads of synchronization in distributed DNN training, many research efforts have been focused on increasing the overlap of computation and communication. For generic Transformer-based models without MoE layers, many works [20, 52, 7, 39, 24, 47, 29, 50, 40] have provided insights and techniques to partition and schedule computation and communication operations, a imed at finer-grained overlapping. To address the challenges posed by ALLTOALL communication and expert parallelism in MoE training, Tutel [19] and Faster-MoE [14] overlap ALLTOALL with expert computation. Lancet [23] additionally enables both non-MoE computation in forward pass and weight gradient computation in backward pass to be overlapped with ALLTOALL. Despite overlapping, the performance of these approaches is limited in practice due to blocking synchronous collective communication with barriers. In contrast, FlashDMoE fundamentally eliminates the inefficiencies with asynchronous, device-initiated data transfers overlapped with tiled computation all within a single kernel, further differentiating itself from SOTA works [56, 55, 10] who also use this form of kernel-initiated communication but at a coarse-grained granularity and without complete kernel fusion.

#### CHAPTER 3

#### **MOTIVATION**

## 3.1 Straggler Effect



Figure 3.1: Straggler effect of synchronous ALLTOALL.  $M \times N$  A100 or V100 denotes N GPUs within a node across M nodes. Every GPU communicates with every other GPU per ALLTOALL step. We capture the distribution of delay induced by stragglers across many steps. Actual Time  $t_a$  denotes the fastest kernel execution time across all GPUs, conversely Total Time t is the maximum recorded step time, while Delay is the maximum difference between t and  $t_a$ .



Figure 3.2: Overlapped Schedule (bottom) showing how idle time from the sequential schedule (top) is repurposed for computation. *FlashDMoE* implements the overlapped schedule.

ALLTOALL communication as currently used in MoE frameworks is a *synchronous* collective operation among all participating GPUs. In this setting, disparities in processing speeds or kernel scheduling among workers induce a straggler effect detrimental to the collective operation's performance. Specifically, as shown in Figure 3.1, for distributed training of a 1.3B MoE model across 32 A100 GPUs, we see P95 communication performance degradation of **1.32X** when compared to the mean actual kernel time from Figure 3.1b. This performance reduction is rather tame as the underlying hardware is a supercomputer that is well-tuned against "software jitter" [33]. On the other hand, the performance loss is more severe in a single node commodity Virtual Machine (VM)

of 8 V100 GPUs with higher bandwidth, where we observe p95 performance reduction of **11X**. In line with prior work [6, 8] from the HPC community, we argue that obviating the inherent barrier in this synchronous collective communication would allow GPUs to repurpose this observed idle time for useful computation.

### 3.2 Kernel Launch Overheads



Figure 3.3: Kernel Launch overhead (CUDA API row) juxtaposed with runtime latency. Compared to DeepEP that launches 432 kernels, *FlashDMoE* launches a single one.

We compare the kernel launch overheads between *FlashDMoE* and existing baselines. Table 1.1 shows the number of kernel launches during a single forward pass: *FlashDMoE* launches exactly one persistent kernel, while the baselines require up to 550 short-lived kernels. Figure 3.3 visually compares *FlashD-MoE* and DeepEP [10] using CUDA API traces captured by NSight Systems. DeepEP exhibits numerous small CUDA API calls, while *FlashDMoE* maintains high GPU utilization by avoiding launch overhead and synchronization gaps—achieving 93.17% GPU utilization (§6) compared to 20.61% for DeepEP.

#### CHAPTER 4

#### **METHOD**



Figure 4.1: FlashDMoE *Fused Kernel*. Green arrows demonstrate block or warp specialization.

The performance of modern distributed MoE systems suffers from two primary bottlenecks: (1) frequent ALLTOALL collective communication operations on the critical execution path due to expert parallelism, and (2) significant overhead from repeatedly launching multiple computation and communication kernels on the host CPU. To overcome these limitations, we introduce *FlashDMoE*, a fully fused MoE operator implemented as a single persistent GPU kernel. Unlike previous approaches [55, 10, 41, 36, 18, 22, 15, 34, 45, 51, 9], *FlashDMoE* is the first solution to implement a *completely fused Distributed MoE kernel*, eliminating kernel launch overhead entirely by requiring only a single kernel launch (see Table 1.1).

**Algorithm 1:** FlashDMoE Distributed MoE Fused Kernel

```
Input: A, O \in \mathbb{R}^{S \times H}, E \in \mathbb{R}^{L \times H \times P}, N
1 begin
        T, G_{\phi} \leftarrow \mathbf{FusedGate}(A)
        if blockId + 1 < N then
3
            Dispatch(T, A)
4
            processor::start()
5
        else
6
            if warpID == 0 then
7
                 scheduler::start()
 8
             else
9
                 subscriber::start(E, O)
10
             end if
11
        end if
12
13 end
```

### 4.1 Actor Model

The design of FlashDMoE is based on the actor model of concurrent computation [2, 17, 12]. We implement this model by specializing GPU thread blocks and warps into three distinct actor roles: (1) **Processor** (§4), (2) **Subscriber** (§2), and (3) **Scheduler**(§3). The Processor performs compute (GEMMs and element-wise operations) and tile communication. We use CUTLASS [48] as the underlying infrastructure for high-performance BLAS routines and NVSHMEM for kernel-initiated communication [37]. The Subscriber and Scheduler perform administrative functions. Specifically, the Scheduler assigns computational tasks to available thread blocks. One key innovation is making the Scheduler both *multithreaded*, enabling high scheduling throughput, and *work-conserving*, ensuring consistently high GPU SM utilization. On the other hand, the Subscriber subscribes to data communicated over the network. Of the *N* thread blocks on a GPU, we specialize N-1 to adopt the **Processor** role. We specialize the last block for administrative functions. Within this block, we specialize three warps for the

**Subscriber** role and one warp for the **Scheduler** role. This split of thread blocks across actors is intentional: our goal is to use few resources for administrative tasks while reserving bulk of the resources for performing MoE computation tasks. Figure 4.1 summarizes the *FlashDMoE* architecture and its constituent actors, while Algorithm 1 gives a very close translation of the system in code.

### 4.2 Inter-Actor Interactions



Figure 4.2: *GPU Memory Hierarchy*. The inverted pyramid (left) shows the load/store access latency [28, 1, 38]. Table above outlines the capacity for different memory tiers (for A100 GPUs). The shared memory and register capacity are static configurations for *FlashDMoE*. The right figure shows accessibility scopes: on-chip **registers** are scoped to a thread; on-chip **shared memory** is visible to all threads in a block; and off-chip **global memory** is accessible by all threads on device.

FlashDMoE decomposes MoE computation and communication at the granularity of a tile, a statically sized partition of a tensor, to achieve parallel execution and efficient overlap of tasks. Each tile corresponds to a discrete unit of work encapsulated by a task descriptor. The **Subscriber** actor decodes these task descriptors from the remote data packets it receives. Concurrently, the **Scheduler** actor receives notifications about available tasks and batches them for execution

$$D^{j} \xrightarrow{\text{Dispatch packets}} S_{b}^{i} \xrightarrow{\text{Tasks}} S_{h}^{i} \xrightarrow{\text{Tasks}} S_{h}^{i} \xrightarrow{\text{Schedule}} P^{i} \xrightarrow{\text{Tasks}} S_{h}^{i} \xrightarrow{\text{Tasks}} S_{h}^{i} \xrightarrow{\text{Schedule}} P^{i} \xrightarrow{\text{Task}} P^{i} \xrightarrow{\text{Tile}} S_{b}^{j} \xrightarrow{\text{Tasks}} S_{h}^{j} \xrightarrow{\text{Combine}} P^{j}$$

Figure 4.3: DMoE Functional Dependencies Expressed as a Chain of Actor Interactions. We denote  $S_b$ ,  $S_h$ , and P as the Subscriber, Scheduler and Processor actors, respectively. For any actor  $a \in \{S_b, S_b, P\}$ ,  $a^i$  identifies an actor on GPU i. We define  $D_i^j$  as the operator, where GPU j dispatches packets of tiles to GPU i, This diagram expresses task dependencies at the granularity of a tile, namely  $GEMM_0$ ,  $GEMM_1$ , combine and communication produce an output tile.

by the **Processor** actors. **Processor** actors perform computations defined by the tasks, like the feed-forward network (FFN) and expert-combine operations. Figure 4.3 shows the chain of actor interactions in *FlashDMoE* that comprise DMoE.

## 4.3 Tiling

Selecting appropriate tile dimensions in *FlashDMoE* is crucial to ensure efficient GPU utilization. A small-sized tile underutilizes the GPU's tensor cores, while excessively large tiles create register pressure, causing performance-degrading register spills to local memory. After careful parameter sweeps, we choose tile dimensions of (128, 64). Our key insights are: increasing tile width raises the register usage per thread, potentially triggering costly spills; increasing tile height without adjusting thread count increases workload per thread, harming performance. Raising the thread count per block beyond our fixed value of 128 threads reduces the number of concurrent blocks, negatively affecting SM occupancy. Larger thread-block sizes also increase overhead from intra-block synchronization (*\_\_syncthreads(*) barriers), further degrading performance. Thus, our chosen tile dimensions deftly balance these constraints.

### **Algorithm 2:** Subscriber Actor: executed by three warps

```
1 begin
      interrupt \leftarrow GetSharedInterrupt()
2
      flags \leftarrow GetSymmetricFlags()
3
      tQ \leftarrow \mathbf{GetTQ}()
      // Predefined upper bound on the number of tasks.
5
      // Modulated to the actual task count computed
6
      // from dispatch signals received from peer GPUs
7
      taskBound \leftarrow GetTaskBound()
      while AtomicLoad(interrupt) == False do
          // dispatch flags
10
         do in parallel
11
             Visit dispatch flags
             Atomically retrieve signal
13
             if Signal is set and flag is not visited then
                Mark visited
15
                SelfCorrectTaskBound(taskBound, Signal)
16
                Enforce memory consistency before consuming packet
17
                Decode packet into a set of GEMM<sup>0</sup> task descriptors
18
                Write task descriptors to tQ
19
                Notify Scheduler of decoded tasks
20
             end if
21
         end
22
         Advance flags by number of dispatch flags length
23
         Atomically retrieve signal
24
         // combine signals
         do in parallel
26
             Visit combine flags: one per tile
27
             if Signal is set and flag is not visited then
28
                Mark visited
29
                Enforce memory consistency before consuming packet
30
                Decode packet into a set of combine task descriptors
31
                Write task descriptors to tQ
32
                Notify Scheduler of decoded tasks
33
             end if
34
35
         end
      end while
36
37 end
```

### **Algorithm 3:** *Scheduler Actor*: executed by one warp

```
Input: N: Number of processors
1 begin
      scheduled \leftarrow 0
2
      tTB \leftarrow 0
3
      tqS \leftarrow \{\}
4
      pTDB \leftarrow \mathbf{GetProcessorDoorbell}()
      sTDB \leftarrow \mathbf{GetSubscriberDoorbell}()
6
      taskBound \leftarrow GetTaskBound()
7
      tTB \leftarrow AtomicLoad(taskBound)
      // circular buffer ready queue
      rQ \leftarrow \{\}
10
      // Populate ready queue with Processor ids
11
      PopulateRQ(rQ)
12
      while scheduled < tTB do
13
          lt \leftarrow 0
14
          do in parallel
15
              Sweep doorbells and populate task counts into tqS
16
              Aggregate locally observed task counts into lt
17
          end
18
          qS, taskTally \leftarrow 0
19
          // qS is the inclusive output
20
          WarpInclusiveSum(lt, qS, tasktally)
21
          while tasktally > 0 do
22
              Repopulate rQ with ready processor ids
23
              do in parallel
24
                  Starting at rQ[qS] signal processors about tasks from tqS
25
              end
26
          end while
27
          if threadId == 0 then
28
              tTB \leftarrow AtomicLoad(taskBound)
29
30
          tTB \leftarrow \mathbf{WarpBroadcast}(tTB)
31
      end while
32
      InterruptSubscribers()
33
      InterruptProcessors()
35 end
```

### **Algorithm 4:** *Processor Actor*: executed by a block

```
1 begin
      tQ \leftarrow \mathbf{GetTQ}()
2
      signal \leftarrow 0
3
      // shared memory variables
4
      task \leftarrow \{\}
5
      interrupt \leftarrow False
6
      complete \leftarrow False
7
      while interrupt == False do
8
          if warpId == 0 then
             if threadId == 0 then
10
                 await Task From Scheduler ({\it interrupt}, \ {\it signal})
11
                 FencedNotifyRQ(ready)
12
             end if
13
             syncwarp()
14
             warpReadTQ(tQ, signal, task)
15
          end if
16
          syncthreads()
17
          if interrupt == False then
18
             switch task. Type do
19
                 case GEMM<sub>0</sub> do
20
                     // fused GEMM, epilogue and async tile
21
                          staging
                     \mathbf{fGET}(GEMM_0, task)
22
                     if threadId == 0 then
23
                         complete ← NotifyTileCompletion()
24
                     end if
25
                     syncthreads()
26
                     if complete == True then
                        NotifySchedulerNextGEMM(tQ)
28
                     end if
29
                 end case
30
                 case GEMM_1 do
31
                     // fused GEMM, epilogue and async tile
32
                         transfer
                     \mathbf{fGET}(GEMM_1, task)
33
                 end case
34
                 case Combine do
35
                     combine(task)
36
                 end case
37
             end switch
38
          end if
39
      end while
40
41 end
```

#### CHAPTER 5

### PROGRAMMING ABSTRACTIONS

### **5.1** Task

We describe the FFN in §1.1, so here we explicate the *combine* operation. The expert-combine operation, used in architectures like GShard [25] and DeepSeek [10], merges outputs from multiple experts by computing a weighted combination based on their affinity scores:

$$C_i = \sum_{j=1}^k g_{i,e} (5.1)$$

$$\mathbf{h}_i = \sum_{i=1}^k \frac{g_{i,e}}{C_i} \cdot \mathbf{h}_i^k \tag{5.2}$$

Above,  $i \in 0$ , S - 1 represents an input token index,  $e = E_{i,k}$  identifies the k-th expert selected for token i, and  $g_{i,e}$  is the affinity score indicating how relevant expert e is for token i.

### 5.2 Unified Abstraction

We unify the FFN and combine operations under a common abstraction called a *task*. Tasks provide a uniform interface for communicating tile-level work among Subscribers, Schedulers, and Processors. Formally, a task descriptor  $t \in \mathcal{T}$  is defined as a tuple:

$$t = (\mathcal{M}, \star, \phi)$$

where  $\mathcal{M}$  is a set of metadata (such as device ID, tile index),  $\star$  is a binary tensor operation (specifically, matrix multiplication  $\cdot$  or Hadamard product  $\odot$ ), and  $\phi$ 

is an element-wise activation function (e.g., ReLU or identity). We define a task *t* operating on input tensors *A*, *B*, *D*, producing output tensor *C*, as follows:

$$\mathcal{F}_t(A, B, C, D) := C \leftarrow \phi(A \star_t B + D) \tag{5.3}$$

The operator  $\star_t$  (instantiated from  $\star$ ) may behave differently depending on the task metadata  $\mathcal{M}$ , and the result of  $A \star_t B$  is accumulated into D. We provide an example of task metadata in Figure 5.1.

```
1 #define GEMMs 2
 2 struct __align__(16) Task {
       const byte* aData;
 4
       array<const byte*, GEMMs> bData;
 5
       array<br/>byte*, GEMMs> cData;
 6
       array<const byte*, GEMMs> dData;
 7
       byte* rcData;
 8
       uint64_t* flags;
9
       uint M;
10
       uint syncIdx;
11
       uint tileIdx;
12
       uint batchIdx;
13
       uint peerIdx;
14
       uint expertIdx;
       uint isPeerRemote;
15
16
       TaskType taskType;
       uint16_t tileSize;
17
       // Pad till 128-byte cache line
18
19
       uint padding[6] = \{\};
20 }
```

Figure 5.1: *Task Struct*. TaskType  $\in \{GEMM_0, GEMM_1, Combine\}$ 

In practice, we implement each task defined by Equation 5.3 as a *single fused* \_\_\_device\_\_ decorated function which the **Processor** (Algorithm 4) invokes at runtime. Fusion for t entails applying  $\phi$  and the succeeding addition operation to registers storing the results of the binary operator  $\star_t$ . To illustrate its flexibility, we show how the FFN and expert-combine operations can be expressed

using this task framework. Note that we omit the matrix multiplication symbol (·) for simplicity. Also,  $\phi_1$  can be any activation function, while  $\phi_2$  is the identity function. The FFN is expressed as:

$$t_1 = (\mathcal{M}, \cdot, \phi_1), \quad t_2 = (\mathcal{M}, \cdot, \phi_2),$$
 
$$\mathcal{F}_{t_1}(A, B_1, C_1, D_1) := C_1 \leftarrow \phi_1 (AB_1 + D_1),$$
 
$$\mathcal{F}_{t_2}(C_1, B_2, C_2, D_2) := C_2 \leftarrow \phi_2 (C_1B_2 + D_2).$$

Whereas, the expert-combine operation is formalized as:

$$t_3 = (\mathcal{M}, \odot, \phi_2),$$
 
$$\mathcal{F}_{t_3}(A, S, C, C) := C \leftarrow \phi_2 (A \odot S + C).$$

## 5.3 Symmetric Tensor Layout for Inter-GPU Communication



Figure 5.2: Symmetric Tensor Layout across 2 Expert-parallel Processes.

Within a single GPU device, the actors in *FlashDMoE* communicate through the GPU's memory subsystem (see Figure 4.2). Specifically, the Scheduler and Subscriber actors exchange data via fast shared memory, while other actor pairs communicate through global memory. For communication across multiple devices, *FlashDMoE* uses *device-initiated communication*, leveraging the one-sided PGAS (Partitioned Global Address Space) programming model [54]. However, achieving scalable and correct one-sided memory accesses in PGAS without costly synchronization is a known challenge [10, 56]. We address this challenge with a provably correct and scalable solution: a symmetric tensor layout *L*, supporting fully non-blocking memory accesses. We define L as:

## $L \in \mathbb{R}^{P \times R \times B \times E \times C \times H}$

where: P is the expert parallel world size, R identifies communication rounds (two rounds, one for token dispatch and one for combine), B is number of staging buffers, E is the number of local experts, C is the upscaled expert capacity (§5.4) and E is the embedding dimension. Our core insight to enable non-blocking communication is E to enable E specifically, we overprovision memory for the underlying token matrix by at least E or times, where E is the number of communication rounds in the dependency graph, and the factor of 2 accounts for separate buffers for incoming and outgoing data within each communication round. For MoE models, we have E or E at this modest increase in memory usage eliminates the need for synchronization during one-sided data transfers. Figure 5.3 illustrates how cells within this symmetric tensor layout are indexed and used for Direct Memory Access (DMA) and Remote DMA (RDMA) operations. As Theorem 5.3.1 reinforces, this indexing scheme over E is the underlying mechanism that allows for fully non-blocking



Figure 5.3: State machine for DMA (top) and RDMA (bottom) communication.

accesses eliding synchronization because all accesses are write conflict-free.

**Definition 5.3.1.** Define a write as  $w(p_s, p_t, i)$ , where  $p_s$  is the source process and i is an ordered tuple indicating the index coordinates for L residing on the target process  $p_t$ . A write-write conflict occurs when there exist at least two distinct, un-synchronized, concurrent writes  $w_1(p_{s_1}, p_{t_1}, i_1)$  and  $w_2(p_{s_2}, p_{t_2}, i_2)$ , such that  $p_{t_1} = p_{t_2}$  and index coordinates  $i_1 = i_2$  but  $p_{s_1} \neq p_{s_2}$ 

**Definition 5.3.2.** For any source process  $p_s$ , a valid index coordinate i = (p\*, r, b, e, c) satisfies the following:

- 1. For inter-device writes, it must hold that  $p*=p_s$  and b=1. Note this also applies to self-looping writes  $w(p_t, p_t, i)$ .
- 2. For any write  $w(p_s, p_t, i)$ , if b = 0, then  $p_s = p_t$ . This rule describes intra-device staging writes.

### **Theorem 5.3.1.** *L* is write-write conflict-free.

*Proof.* As is the case for typical physical implementations, assume that each index coordinate i maps to a distinct memory segment in L. Next, we show by contradiction that no write-write conflicts can exist when accessing L using valid i. For simplicity, we only include the index coordinates when describing a write. Assume that there exist at least two writes  $w_1(p_{s_1}, p_{t_1}, i_1)$ ,  $w_2(p_{s_2}, p_{t_2}, i_2)$  with  $p_{t_1} = p_{t_2}$  and valid destination coordinates  $i_1, i_2$ , where  $i_1 = i_2$  lexicographically and both are unpacked below.

$$i_1 = (p_1, r_1, b_1, e_1, c_1), i_1 = (p_2, r_2, b_2, e_2, c_2)$$

Note that for the message staging state, even though  $i_1 = i_2$  the resultant memory segments reside in different physical buffers resident in  $p_{s_1}$  and  $p_{s_2}$  respectively. Therefore, for this state, there are no conflicts as intra-process writes always have distinct  $c_j$  coordinates, where  $j \in \{0, C-1\}$ . For inter-process transfers, we have two cases.

Case 1: 
$$p_{s_1} = p_{s_2}$$

Here,  $w_1$  and  $w_2$  are identical operations. This contradicts the definition of a conflict, which requires that  $p_{s_1} \neq p_{s_2}$ . In practice, such repeat writes never even occur.

Case 2: 
$$p_{s_1} \neq p_{s_2}$$

To ensure validity for  $i_1$  and  $i_2$ , it is the case that  $p_1 = p_{s_1}$  and  $p_2 = p_{s_2}$ . However, this implies that  $i_1 \neq i_2$  yielding a contradiction as desired.

To construct L, we start from the original token buffer  $T \in \mathbb{R}^{S \times H}$ , where S is the sequence length and H is the hidden dimension. We first reorganize the sequence dimension S into three sub-dimensions representing the expert capacity (C), local expert slots (E), and the expert parallel world size (W), st:

$$C \cdot E \cdot W = C \cdot E' = S'$$
, where  $S' \ge S$  and  $E' \ge E_W$ 

In the typical case of uniform expert distribution (illustrated in Figure 5.2), we have S' = S and  $E' = E_W$ , where  $E_W$  is the total number of experts in the model. Thus, the size of the token buffer is  $Size(T) = S' \cdot H$ . In Figure 5.2, each cell labeled  $E_i$  (with  $i \in \{0, ..., 3\}$ ) is a matrix of size (C, H). Extending prior work [25, 55], we introduce additional temporal dimensions R (communication rounds) and R (staging buffers). Each communication round has two fixed staging slots: one for outgoing tokens and another for incoming tokens. Each slot, indexed by dimension P, forms a tensor of shape (S', H). Therefore, the tensor size Size(L) is generally at least four times the original token buffer size, becoming exactly four times larger in the case of uniform expert distribution. Empirically (§6.7), we find:

$$Size(L) \approx 4 \cdot Size(T)$$

# 5.4 In-place Padding for Payload Efficiency

Due to the dynamic and uneven distribution of tokens in MoE dispatch [5], GPUs commonly receive fewer tokens than their predefined expert capacity.

Current MoE frameworks [41] typically pad these buffers with null tokens before computation, unnecessarily increasing communication payloads and degrading performance. In contrast, we propose *in-place padding*, performing padding directly within the local symmetric tensor buffers and thus eliminating excess network communication. As we show in Figure 5.2 as a reference, each cell  $E_i$  is sized according to the expert capacity C. We further align this capacity to ensure divisibility by the tile block size bM = 128, guaranteeing safe and aligned memory reads by Processor threads consuming remote tokens. This in-place padding strategy slightly increases the memory footprint of L, as described below:

$$Size(L) \approx \begin{cases} 4 \cdot Size(T), & \frac{S}{E} \ge bM \\ 4 \cdot \frac{bM \cdot E}{S} \cdot Size(T), & \text{otherwise} \end{cases}$$

### **CHAPTER 6**

#### **EVALUATION**

Table 6.1: Implementation metrics of *FlashDMoE* using fully inlined NVSH-MEM.

| Metric                          | Value      |
|---------------------------------|------------|
| Total lines of code (CUDA/C++)  | 6820       |
| Kernel stack frame size         | 0 B        |
| Spill stores (per thread)       | 0          |
| Spill loads (per thread)        | 0          |
| Shared memory usage (per block) | 46 KB      |
| Registers per thread            | 255        |
| Max active blocks per SM        | 2          |
| Compilation time                | 53 seconds |
| Binary size                     | 29 MB      |

We implement (Table 6.1) and evaluate *FlashDMoE* by addressing two main questions: (1) Under what conditions does *FlashDMoE* outperform state-of-theart MoE systems, and why (§6.2 - 6.5)? and (2) How much GPU memory does *FlashDMoE* require for core data structures (§6.7)?

# 6.1 Setup

We run experiments on a RunPod server with 8 × NVIDIA H100 80G GPUs, 125 GB of RAM, and 20 vCPUs. We used PyTorch 2.6.0, CUDA 12.8, and Ubuntu 22.04. All experiments use MoE transformer models configured with 16 attention heads, an embedding dimension of 1024, and an FFN hidden size of 4096. We use top-2 routing with a capacity factor of 1.0. Note all baselines were evaluated using FP16 precision, while *FlashDMoE* was evaluated on FP32 precision, see §7 for more details. We compare *FlashDMoE* against several state-of-the-art MoE systems: (1) **Comet** [55], (2) **FasterMoE** [14], (3) **Megatron-LM** [31], and

(4) **DeepEP** [10]. Comet and DeepEP rely on NVSHMEM for communication, while FasterMoE and Megatron-LM use NCCL. We also evaluate *FlashDMoE* on a multi-node environment and discuss our findings in §6.6.

## 6.2 Scalability with Tokens and Experts



(a) Scaling with the number of tokens (E = 32).



(b) Scaling with the number of experts (T = 8K)

Figure 6.1: Scalability with respect to the number of tokens and experts. All experiments use 8 GPUs.

We first analyze the scalability of *FlashDMoE* in two dimensions: the number of input tokens and the number of experts. Figure 6.1 shows execution time for a single MoE layer on 8 GPUs. *FlashDMoE* outperforms baselines between **1.4X** to **9.5X** when scaling across sequence lengths and between **1.2X** to **15X** across

number of experts. We provide raw performance numbers in §6.8.

When scaling number of tokens (Fig. 6.1a), *FlashDMoE* maintains near-constant execution time (2.07 – 2.33 ms). In contrast, baselines incur significant runtime increases. For example, Comet's runtime grows from 2.9 ms to 19 ms, FasterMoE from 3.85 ms to 12.37 ms, Megatron-LM from 2.59 ms to 10.32 ms, and DeepEP from 3.80 ms to 7.12 ms. *FlashDMoE* avoids these overheads by overlapping computation and communication, achieving stable performance despite higher token counts. When scaling the number of experts (Figure 6.1b), *FlashDMoE*'s runtime increases proportionally from 1.12 ms to 8.20 ms, reflecting higher overhead from routing and processing additional experts. Comet and Megatron-LM exhibit relatively flat runtimes (around 19 ms and 10.4 ms, respectively) but at significantly higher latencies. FasterMoE (7.9–22.65 ms) and DeepEP (6.03–11.3 ms) show increasing runtimes due to overheads from inefficient communication or computation handling.

# 6.3 GPU Scalability



Figure 6.2: Scalability with respect to the number of GPUs (E = 32, T = 8K).

Figure 6.2 keeps tokens per GPU constant while scaling GPU count. FlashD-

*MoE* shows strong scalability, achieving a speedup from 5.99 ms (2 GPUs) to 1.98 ms (4 GPUs) and maintaining performance at 8 GPUs (2.33 ms). This highlights efficient inter-GPU communication and workload distribution. Comet and Megatron-LM degrade notably (Comet: 7.26–19.01 ms; Megatron-LM: 6.59–10.32 ms), indicating communication bottlenecks at scale. FasterMoE shows modest improvements (14.28–12.37 ms), and DeepEP displays a similar pattern to *FlashDMoE* (9.65–7.12 ms) but with consistently higher runtimes.

### 6.4 SM Utilization



(a) Comparison of SM utilization, defined as the ratio of cycles in which SMs have at least one warp in flight to the total number of cycles [35]. Values represent the average SM utilization over 100 iterations. All experiments use T = 8K and E = 64 on two GPUs.



(b) Payload efficiency. The y-axis shows the total number of bits transferred over NVLink with the same configuration. The x-axis shows the total layer execution time over 100 iterations. Both axes are log-scaled.

Figure 6.3: Factors leading to performance improvement of *FlashDMoE*.

Figure 6.3a shows GPU utilization, measured by the average SM utilization (fraction of GPU cycles with active warps [35]). *FlashDMoE* achieves over 90% SM utilization, showing effective fine-grained task scheduling and reduced idle GPU time.

## 6.5 Payload Efficiency

We next evaluate payload efficiency (Figure 6.3b) by measuring the volume of data transferred over NVLink. Traditional MoE frameworks communicate zero-padded tokens, inflating network payloads. In contrast, *FlashDMoE*'s in-place padding approach transmits only active tokens, reducing communication volume substantially and resulting in better runtime performance.

### 6.6 Multi-Node Evaluation

## 6.6.1 Setup

In this experiment, we seek to evaluate *FlashDMoE* in the multi-node setting. We use 4 nodes, where each node comprises 4 A100 GPUs fully interconnected via NVLink. Across nodes, each GPU uses a single NIC providing 25 GB/s of bandwidth. We set the number of experts to be 16 and assign each GPU to host only one, so the number of local experts is 1. Note that we define MIV formally as follows:

$$MIV = \frac{Tokens}{Experts} * local\_experts * precision * hidden\_size * 2 * n_{rg}$$

where  $n_{rg}$  is the number of remote peers and the multiplicative factor of 2 accounts for communication rounds (dispatch and combine).  $n_{rg} = 12$  for this experiment.

### 6.6.2 Results



Figure 6.4: Multi-node Latency evaluation. Embbeding dimension is 1024 and FFN intermediate size is 4096. We define Maximal Incast Volume (MIV) as the worst case upper bound for data volume that a NIC receives in a single incast occurence.

We observe a sublinear increase in latency as we scale the number of tokens. However, we observe at *Tokens* > 2048, that the application fails to terminate due to failure to receive expectant messages. We hypothesize this failure to be due to buffer overflow at the networking hardware layer as is common for applications that generate many and large messages [33] like our system. We note that this failure is addressable by tuning hardware configurations [13] but we consider this exploration as an exercise orthogonal to this work.

## 6.7 Memory Overhead

We measure the GPU memory required for the symmetric tensor L and runtime bookkeeping state of FlashDMoE. Memory overhead depends primarily on the tile size, expert capacity (EC), and the number of experts (E). Table 6.2 summarizes memory overhead under various configurations, confirming that FlashDMoE maintains a modest and predictable memory footprint.

Table 6.2: Memory overhead (tile size bM = 128, Size(T) = Tokens \* 4KB).

| Tokens | Experts | EC   | max(bM, EC) | Bookkeeping (MB) | Size(L) (MB) | Total (MB) |
|--------|---------|------|-------------|------------------|--------------|------------|
| 4K     | 16      | 256  | 256         | 64.57            | 64.00        | 128.57     |
| 4K     | 32      | 128  | 128         | 64.55            | 64.00        | 128.55     |
| 4K     | 64      | 64   | 128         | 128.90           | 128.01       | 256.91     |
| 4K     | 128     | 32   | 128         | 257.96           | 256.02       | 513.98     |
| 8K     | 16      | 512  | 512         | 128.95           | 128.01       | 256.95     |
| 8K     | 32      | 256  | 256         | 128.90           | 128.01       | 256.91     |
| 8K     | 64      | 128  | 128         | 128.90           | 128.01       | 256.91     |
| 8K     | 128     | 64   | 128         | 258.15           | 256.02       | 514.17     |
| 16K    | 16      | 1024 | 1024        | 257.89           | 256.02       | 513.90     |
| 16K    | 32      | 512  | 512         | 257.79           | 256.02       | 513.81     |
| 16K    | 64      | 256  | 256         | 257.80           | 256.02       | 513.81     |
| 16K    | 128     | 128  | 128         | 258.53           | 256.02       | 514.54     |

# 6.8 Numerical Data

Table 6.3: Latency (ms) comparison across different numbers of tokens (Figure 6.1a).

| Tokens | FlashDMoE | Comet | FasterMoE | Megatron | DeepEP |
|--------|-----------|-------|-----------|----------|--------|
| 1024   | 2.07      | 2.87  | 3.85      | 2.59     | 3.80   |
| 2048   | 1.90      | 5.12  | 5.04      | 3.83     | 3.83   |
| 4096   | 2.19      | 9.54  | 7.44      | 5.83     | 4.62   |
| 8192   | 2.33      | 19.01 | 12.37     | 10.32    | 7.12   |

Table 6.4: Latency (ms) comparison across different numbers of experts (Figure 6.1b).

| Experts | FlashDMoE | Comet | FasterMoE | Megatron | DeepEP |
|---------|-----------|-------|-----------|----------|--------|
| 16      | 1.20      | 19.01 | 11.38     | 10.47    | 5.59   |
| 32      | 2.33      | 19.01 | 12.37     | 10.32    | 7.12   |
| 64      | 4.20      | 19.21 | 16.64     | 10.47    | 7.49   |
| 128     | 8.20      | 19.36 | 22.65     | 10.47    | 11.30  |

Table 6.5: Latency (ms) comparison across different numbers of GPUs (Figure 6.2).

| GPUs | FlashDMoE | Comet | FasterMoE | Megatron | DeepEP |
|------|-----------|-------|-----------|----------|--------|
| 2    | 6.00      | 7.26  | 14.28     | 6.59     | 9.65   |
| 4    | 1.98      | 10.98 | 13.59     | 7.82     | 6.72   |
| 8    | 2.33      | 19.01 | 12.37     | 10.32    | 7.12   |

#### CHAPTER 7

### LIMITATIONS AND FUTURE WORK

Despite the performance gains and architectural innovations of *FlashD-MoE*, there are several limitations worth acknowledging—both practical and conceptual—that open the door to future research.

- **Programming complexity.** Developing fully fused, persistent kernels is a non-trivial engineering task. While *FlashDMoE* proves the feasibility and benefit of such kernels, their construction demands deep expertise in GPU architectures, synchronization and distributed protocols, and memory hierarchies. This high barrier to entry limits adoption. Future work may consider compiler-level abstractions or DSLs to democratize this technique.
- FP16 support and shared memory bank conflicts.
- Although modern GPUs natively support half-precision computation, adapting FLASHDMOE to FP16 is non-trivial due to shared memory bank conflicts. These arise from the mismatch between bank granularity and half-precision load-store alignment, leading to degraded memory throughput (§A). Overcoming this will require careful rethinking of shared memory access patterns or coalesced tile layouts specifically optimized for FP16.
- Rigid memory allocation. The current memory allocation scheme used for the symmetric tensor layout assumes a static token distribution across experts. In reality, expert popularity is often skewed and dynamic. Future extensions may explore elastic memory regions or runtime-aware reallocation schemes that adapt based on observed routing patterns.

• Lack of backward pass and training support. While this work focuses on inference, enabling training requires fusing backward computation and gradient communication into the kernel. Supporting this entails non-trivial changes to both memory bookkeeping and task descriptor definitions. Nevertheless, it remains an exciting direction for extending this system to fully support end-to-end training.

### **CHAPTER 8**

#### **CONCLUSION**

This thesis introduces *FlashDMoE*, the first system to fuse the entire Mixture-of-Experts (MoE) operator into a single, persistent GPU kernel. We show that prevailing MoE implementations suffer from two critical inefficiencies: (1) CPU-managed synchronous communication that leads to underutilized interconnects and (2) fragmented execution via multiple GPU kernels, introducing overhead and synchronization delays.

In contrast, *FlashDMoE* embraces a model of GPU autonomy by embedding computation, communication, and scheduling within a unified kernel. It leverages actor-style concurrency, warp specialization, and asynchronous (R)DMA to achieve fine-grained communication–computation overlap.

Our evaluation demonstrates up to **15**× **speedup** over state-of-the-art systems, up to **4**× improved GPU utilization, and greater payload efficiency, especially in dynamic workloads like MoE. *FlashDMoE* challenges the dominant execution paradigms in distributed deep learning and presents a compelling template for building future GPU-native systems.

While several limitations remain—including programming complexity and lack of FP16 support—this work lays the groundwork for a new era of *in-kernel distributed computation*. Future systems may build upon this foundation to enable kernel fusion for entire training pipelines, ushering in a design shift from CPU orchestration to **fully autonomous GPU execution**.

#### **BIBLIOGRAPHY**

- [1] Hamdy Abdelkhalik, Yehia Arafa, Nandakishore Santhi, and Abdel-Hameed Badawy. Demystifying the nvidia ampere architecture through microbenchmarking and instruction-level analysis, 2022.
- [2] Gul A. Agha. Actors: A model of concurrent computation in distributed systems. Technical report, 1985. MIT Artificial Intelligence Laboratory Technical Reports.
- [3] Meta AI. The llama 4 herd: The beginning of a new era of natively multimodal ai innovation, 2025.
- [4] Osayamen J Aimuyo. Aristos: Pipelining one-sided communication in distributed mixture of experts. *SIGMETRICS Perform. Eval. Rev.*, 52(4):3–5, March 2025.
- [5] Quentin Anthony, Yury Tokpanov, Paolo Glorioso, and Beren Millidge. Blackmamba: Mixture of experts for state-space models, 2024.
- [6] C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing bandwidth limited problems using one-sided communication and overlap. In *Proceedings 20th IEEE International Parallel & Distributed Processing Symposium*, pages 10 pp.–, 2006.
- [7] Chang Chen, Xiuhong Li, Qianchao Zhu, Jiangfei Duan, Peng Sun, Xingcheng Zhang, and Chao Yang. Centauri: Enabling efficient scheduling for communication-computation overlap in large model training via communication partitioning. In *Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 24)*, pages 178–191, 2024.
- [8] Yuxin Chen, Benjamin Brock, Serban Porumbescu, Aydin Buluc, Katherine Yelick, and John Owens. Atos: A task-parallel gpu scheduler for graph analytics. In *Proceedings of the 51st International Conference on Parallel Processing*, ICPP '22, New York, NY, USA, 2023. Association for Computing Machinery.
- [9] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. In *Advances in Neural Information Processing Systems*, volume 35, pages 16344–16359. Curran Associates, Inc., 2022.

- [10] DeepSeek-AI. Deepseek-v3 technical report, 2025.
- [11] Trevor Gale, Deepak Narayanan, Cliff Young, and Matei Zaharia. Megablocks: Efficient sparse training with mixture-of-experts. In D. Song, M. Carbin, and T. Chen, editors, *Proceedings of Machine Learning and Systems*, volume 5, pages 288–304. Curan, 2023.
- [12] Irene Greif. SEMANTICS OF COMMUNICATING PARALLEL PROCESSES. PhD thesis, Massachusetts Institute of Technology, 1975.
- [13] OpenFabrics Interfaces Working Group. fi\_cxi. https://ofiwg.github.io/libfabric/v1.21.0/man/fi\_cxi.7.html, 2025. [Accessed 23-05-2025].
- [14] Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. In *Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 22)*, pages 120–134, 2022.
- [15] Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. Fastermoe: modeling and optimizing training of large-scale dynamic pre-trained models. In *Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming*, PPoPP '22, page 120–134, New York, NY, USA, 2022. Association for Computing Machinery.
- [16] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus), 2023.
- [17] Carl Hewitt, Peter Bishop, and Richard Steiger. A universal modular actor formalism for artificial intelligence. IJCAI'73, page 235–245, San Francisco, CA, USA, 1973. Morgan Kaufmann Publishers Inc.
- [18] Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, HoYuen Chau, Peng Cheng, Fan Yang, Mao Yang, and Yongqiang Xiong. Tutel: Adaptive mixture-of-experts at scale. In D. Song, M. Carbin, and T. Chen, editors, *Proceedings of Machine Learning and Systems*, volume 5, pages 269–287. Curan, 2023.
- [19] Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. Tutel: Adaptive

- mixture-of-experts at scale. *Proceedings of Machine Learning and Systems* (MLSys 23), 5:269–287, 2023.
- [20] Abhinav Jangda, Jun Huang, Guodong Liu, Amir Hossein Nodehi Sabet, Saeed Maleki, Youshan Miao, Madanlal Musuvathi, Todd Mytkowicz, and Olli Saarikivi. Breaking the computation and communication abstraction barrier in distributed machine learning workloads. In *Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 22)*, pages 402–416, 2022.
- [21] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024.
- [22] Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. Lancet: Accelerating mixture-of-experts training via whole graph computation-communication overlapping. In P. Gibbons, G. Pekhimenko, and C. De Sa, editors, *Proceedings of Machine Learning and Systems*, volume 6, pages 74–86, 2024.
- [23] Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, and Yida Wang. Lancet: Accelerating mixture-of-experts training via whole graph computation-communication overlapping. In *Proceedings of Machine Learning and Systems (MLSys 24)*, pages 74–86, 2024.
- [24] Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al. {MegaScale}: Scaling large language model training to more than 10,000 {GPUs}. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), pages 745–760, 2024.
- [25] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- [26] Juncai Liu, Jessie Hui Wang, and Yimin Jiang. Janus: A unified distributed training framework for sparse mixture-of-experts models. In *Proceedings of*

- the ACM SIGCOMM 2023 Conference, ACM SIGCOMM '23, page 486–498, New York, NY, USA, 2023. Association for Computing Machinery.
- [27] Xin Lu, Yanyan Zhao, Bing Qin, Liangyu Huo, Qing Yang, and Dongliang Xu. How does architecture influence the base capabilities of pre-trained language models? a case study based on ffn-wider and moe transformers. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, *Advances in Neural Information Processing Systems*, volume 37, pages 87411–87437. Curran Associates, Inc., 2024.
- [28] Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, and Xiaowen Chu. Benchmarking and Dissecting the Nvidia Hopper GPU Architecture. In 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 656–667, Los Alamitos, CA, USA, May 2024. IEEE Computer Society.
- [29] Kshiteej Mahajan, Ching-Hsiang Chu, Srinivas Sridharan, and Aditya Akella. Better together: Jointly optimizing {ML} collective scheduling and execution planning using {SYNDICATE}. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), pages 809–824, 2023.
- [30] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In *Proceedings of the 27th International Conference on International Conference on Machine Learning*, ICML'10, page 807–814, Madison, WI, USA, 2010. Omnipress.
- [31] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis*, pages 1–15, 2021.
- [32] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In *Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis*, SC '21, New York, NY, USA, 2021. Association for Computing Machinery.

- [33] NERSC. Network NERSC Documentation. https://docs.nersc.gov/performance/network/, 2025. [Accessed 23-05-2025].
- [34] Xiaonan Nie, Xupeng Miao, Zilong Wang, Zichao Yang, Jilong Xue, Lingxiao Ma, Gang Cao, and Bin Cui. Flexmoe: Scaling large-scale sparse pretrained model training via dynamic device placement. *Proc. ACM Manag. Data*, 1(1), May 2023.
- [35] NVIDIA. NVIDIA Nsight Systems Metrics.
- [36] NVIDIA. Megatron-lm, 2025. v0.11.0.
- [37] NVIDIA. Nvidia openshmem library (nvshmem), 2025. v3.2.5.
- [38] NVIDIA. Ptx isa: Version 8.7, 2025.
- [39] Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, and Matthew D Sinclair. T3: Transparent tracking & triggering for fine-grained overlap of compute & collectives. In *Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 24)*, pages 1146–1164, 2024.
- [40] Kishore Punniyamurthy, Khaled Hamidouche, and Bradford M Beckmann. Optimizing distributed ml communication with fused computation-collective operations. In SC24: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–17, 2024.
- [41] Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. In *Proceedings of the 39th International Conference on Machine Learning*, volume 162 of *Proceedings of Machine Learning Research*, pages 18332–18346. PMLR, 17–23 Jul 2022.
- [42] Mosaic Research. Introducing dbrx: A new state-of-the-art open llm, 2024.
- [43] Snowflake AI Research. Snowflake arctic: The best llm for enterprise ai efficiently intelligent, truly open, 2024.
- [44] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In 5th Interna-

- tional Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
- [45] Shaohuai Shi, Xinglin Pan, Qiang Wang, Chengjian Liu, Xiaozhe Ren, Zhongzhe Hu, Yu Yang, Bo Li, and Xiaowen Chu. Schemoe: An extensible mixture-of-experts distributed training system with tasks scheduling. In *Proceedings of the Nineteenth European Conference on Computer Systems*, EuroSys '24, page 236–249, New York, NY, USA, 2024. Association for Computing Machinery.
- [46] Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, and Abhinav Bhatele. A hybrid tensor-expert-data parallelism approach to optimize mixture-of-experts training. In *Proceedings of the 37th ACM International Conference on Supercomputing*, ICS '23, page 203–214, New York, NY, USA, 2023. Association for Computing Machinery.
- [47] Weigao Sun, Zhen Qin, Weixuan Sun, Shidi Li, Dong Li, Xuyang Shen, Yu Qiao, and Yiran Zhong. CO2: Efficient distributed training with full communication-computation overlap. In *The Twelfth International Conference on Learning Representations (ICLR 24)*, 2024.
- [48] Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. CUT-LASS, 2025.
- [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In *Advances in Neural Information Processing Systems*, volume 30. Curran Associates, Inc., 2017.
- [50] Hulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. Harnessing inter-gpu shared memory for seamless moe communication-computation fusion. In *Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming*, pages 170–182, 2025.
- [51] Hulin Wang, Yaqi Xia, Donglin Yang, Xiaobo Zhou, and Dazhao Cheng. Harnessing inter-gpu shared memory for seamless moe communication-computation fusion. In *Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming*, PPoPP '25, page

- 170–182, New York, NY, USA, 2025. Association for Computing Machinery.
- [52] Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake Hechtman, Dehao Chen, Karthik Srinivasa Murthy, Marcello Maggioni, Qiao Zhang, et al. Overlap communication with dependent computation via decomposition in large deep learning models. In *Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 22)*, pages 93–106, 2022.
- [53] Michael Wendt and Joshua Wyatt. Getting started with CUDA graphs. https://developer.nvidia.com/blog/cuda-graphs/, 2019. Accessed: 2024-05-15.
- [54] Katherine Yelick, Dan Bonachea, Wei-Yu Chen, Phillip Colella, Kaushik Datta, Jason Duell, Susan L. Graham, Paul Hargrove, Paul Hilfinger, Parry Husbands, Costin Iancu, Amir Kamil, Rajesh Nishtala, Jimmy Su, Michael Welcome, and Tong Wen. Productivity and performance using partitioned global address space languages. In *Proceedings of the 2007 International Workshop on Parallel Symbolic Computation*, PASCO '07, page 24–32, New York, NY, USA, 2007. Association for Computing Machinery.
- [55] Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, and Xin Liu. Comet: Fine-grained computation-communication overlapping for mixture-of-experts. In *MLSys* '25.
- [56] Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yifan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu. Triton-distributed: Programming overlapping kernels on distributed ai systems with the triton compiler, 2025.

### APPENDIX A

### **FP16 MEMORY THROUGHPUT**



(a) Memory subsystem throughput for FP16



(b) Memory subsystem throughput for FP32

Figure A.1: Here, we report the total GPU memory throughput for both FP16 (top) and FP32 (bottom) variants of *FlashDMoE*. Notably, the FP16 implementation issues approximately 2× more shared memory instructions compared to its FP32 counterpart under identical workloads. We attribute this inefficiency to suboptimal shared memory layouts in *FlashDMoE* when operating on half-precision data. While this bottleneck is addressable through improved layout strategies, we leave its resolution to future work due to time constraints.