Distributed GEMM #302

panditsa · 2025-09-16T02:59:47Z

Enable distributed matrix multiplication operations across multiple GPU devices in wave, where the distribution factor for each dimension is managed through DeviceConstraint

Key Changes

In the host_codegen.py, we split the input tensors across devices based on device constraint, dispatch computation to each device, and merge results back into the original tensor shape
We now use a HostSignature class to manage full problem-size buffers (e.g., 1024x8192 matrix), while KernelSignature handles per-device tile buffers (e.g., 512x4096 tiles for 2x2 distribution). Non-distributed workloads have no impact, because in absence of DeviceConstraint, HostSignature is the same as KernelSignature
In the host_utils.py, added functions that help split input and output tensors per device, manage tensor distribution across devices using device_constraint_map.
Updated distributed GEMM template to accept device_m and device_n parameters, with test coverage for distribution factors from 1x1 up to 4x2 across problem sizes up to 4096x20480x2560
Use MultiDeviceLaunchable class to orchestrate execution across multiple GPU devices

Hardcode84

Any lit test for codegen changes? Also, please fix merge conflicts.

Hardcode84 · 2025-09-30T19:13:20Z

Please fix the DCO and merge conflicts.

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

Side-effect, device constraint on multiple dimensions functional Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

wave_lang/kernel/compiler/host_utils.py

Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

Enable distributed matrix multiplication operations across multiple GPU devices in wave, where the distribution factor for each dimension is managed through DeviceConstraint ### Key Changes - In the host_codegen.py, we split the input tensors across devices based on device constraint, dispatch computation to each device, and merge results back into the original tensor shape - We now use a `HostSignature` class to manage full problem-size buffers (e.g., 1024x8192 matrix), while `KernelSignature` handles per-device tile buffers (e.g., 512x4096 tiles for 2x2 distribution). Non-distributed workloads have no impact, because in absence of `DeviceConstraint`, `HostSignature` is the same as` KernelSignature` - In the `host_utils.py`, added functions that help split input and output tensors per device, manage tensor distribution across devices using `device_constraint_map`. - Updated distributed GEMM template to accept` device_m` and `device_n` parameters, with test coverage for distribution factors from 1x1 up to 4x2 across problem sizes up to 4096x20480x2560 - Use MultiDeviceLaunchable class to orchestrate execution across multiple GPU devices --------- Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

panditsa force-pushed the sanket/dist_gemm branch 7 times, most recently from 985e5cb to 7e97598 Compare September 18, 2025 04:06

panditsa marked this pull request as ready for review September 19, 2025 04:06

panditsa requested a review from Hardcode84 September 19, 2025 04:06

panditsa force-pushed the sanket/dist_gemm branch 9 times, most recently from 1491a50 to 8f64897 Compare September 22, 2025 16:58

Hardcode84 reviewed Sep 25, 2025

View reviewed changes

panditsa force-pushed the sanket/dist_gemm branch 4 times, most recently from f9a9485 to 875041d Compare October 2, 2025 19:24

panditsa added 6 commits October 2, 2025 12:27

adding zeroes for device_dimension in kernel emitter

baca7a8

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

device layout in host_codegen

0542735

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

WIP: host_codegen changes

170fb20

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

fix bad merge on host_codegen

7d5249c

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

host_codegen create a host_signature class

52f8c56

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

using multi_device_launchable in compile

d0332af

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

panditsa and others added 19 commits October 2, 2025 12:27

lint emitter

fd4008c

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

remove num_devices arg from compile.py

187d2f4

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

host_codegen fix single device usecase

8b6fbab

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

working dist_gemm and test

043594a

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

[Dist.] Async support in host_codegen

6cdae19

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

[Dist.] Working async GEMM

fb7c1bb

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

add host_utils and cleanup host_codegen

9934c17

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

cleaned up host codegen

e8d3932

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

fix compile.py

2d3e915

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

use slice_cache to reuse the slices

04f4a3d

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

add shapes to enable upto 8 devices

f1d0aca

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

fix the dispatchop error when using only one device

fa51530

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

dist_gemm kernel gen and test

c5b315c

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

add dynamic arguments to dispatchop

c9afcd1

Signed-off-by: Sanket Pandit <sanketp@amd.com> Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

fix n-dimensional split

d5b3685

Side-effect, device constraint on multiple dimensions functional Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

no async support, doesn't work when running in a loop

3e46e06

Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

fix compile_to_mlir flow

a1d67d8

Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

add lit test

41a7b10

Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

Update codegen.py to add newline

875041d

Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

Hardcode84 approved these changes Oct 2, 2025

View reviewed changes

wave_lang/kernel/compiler/host_utils.py Show resolved Hide resolved

panditsa added 2 commits October 3, 2025 10:56

add header

fc6db29

Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

add require_gpus marker

7f14a43

Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

panditsa merged commit 6d60f87 into iree-org:main Oct 3, 2025
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed GEMM #302

Distributed GEMM #302

Uh oh!

panditsa commented Sep 16, 2025 •

edited

Loading

Uh oh!

Hardcode84 left a comment

Uh oh!

Hardcode84 commented Sep 30, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Distributed GEMM #302

Distributed GEMM #302

Uh oh!

Conversation

panditsa commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Uh oh!

Hardcode84 left a comment

Choose a reason for hiding this comment

Uh oh!

Hardcode84 commented Sep 30, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

panditsa commented Sep 16, 2025 •

edited

Loading