[LLVMGPU] Fit mma schedules inside shared memory limits #16927

Groverkss · 2024-03-28T19:36:46Z

This patch adds support to check if a matmul schedule would cause promotion to create allocations which do not fit shared memory size, and shrink the MMA schedule if so. The patch also updates the check-resource-usage pass in LLVMGPU pass pipeline to query shared memory limit from the target.

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp

kuhar

This is a bit tricky to test because ideally KernelConfig won't attempt to create invalid schedules in the first place... However, a simple batch mamul that was known to cause problems in the past should do:

hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mfma_layout<F16_16x16x16_F32>, #iree_gpu.mfma_layout<F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none"}>) {
  hal.executable.export public @main$async_dispatch_132_batch_matmul_64x80x1280x1280_f16xf16xf32 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>]>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>, #hal.interface.binding<0, 2>]} {
  ^bb0(%arg0: !hal.device):
    %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
    hal.return %x, %y, %z : index, index, index
  }
  builtin.module {
    func.func @main$async_dispatch_132_batch_matmul_64x80x1280x1280_f16xf16xf32() {
      %cst = arith.constant 0.000000e+00 : f32
      %c129181184 = arith.constant 129181184 : index
      %c18112 = arith.constant 18112 : index
      %c100980224 = arith.constant 100980224 : index
      %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c129181184) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<64x80x1280xf16>>
      %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c18112) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<64x1280x1280xf16>>
      %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c100980224) : !flow.dispatch.tensor<writeonly:tensor<64x80x1280xf32>>
      %3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [64, 80, 1280], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<64x80x1280xf16>> -> tensor<64x80x1280xf16>
      %4 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0], sizes = [64, 1280, 1280], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<64x1280x1280xf16>> -> tensor<64x1280x1280xf16>
      %5 = tensor.empty() : tensor<64x80x1280xf32>
      %6 = linalg.fill ins(%cst : f32) outs(%5 : tensor<64x80x1280xf32>) -> tensor<64x80x1280xf32>
      %7 = linalg.batch_matmul ins(%3, %4 : tensor<64x80x1280xf16>, tensor<64x1280x1280xf16>) outs(%6 : tensor<64x80x1280xf32>) -> tensor<64x80x1280xf32>
      flow.dispatch.tensor.store %7, %2, offsets = [0, 0, 0], sizes = [64, 80, 1280], strides = [1, 1, 1] : tensor<64x80x1280xf32> -> !flow.dispatch.tensor<writeonly:tensor<64x80x1280xf32>>
      return
    }
  }
}

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp

compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.h

Groverkss · 2024-04-01T20:23:07Z

compiler/src/iree/compiler/Codegen/LLVMGPU/test/vector_distribution_pipeline_test.mlir

@@ -0,0 +1,31 @@
+// RUN: iree-opt --split-input-file \ 
+// RUN: --iree-codegen-llvmgpu-use-vector-distribution '--pass-pipeline=builtin.module(hal.executable(hal.executable.variant(iree-llvmgpu-select-lowering-strategy, iree-llvmgpu-lower-executable-target, canonicalize)))' %s | FileCheck %s


The check-resource-usage pass is actually run during the lowering to llvm, which this pass pipeline doesn't run. But, I'm not sure what else to use here...

This test does confirm the e2e working of the fit shared memory thing though. Also, this test would not pass the fixed check resource usage pass without the fit shared memory changes.

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp

compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.h

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp

compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp

compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.h

compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp

compiler/src/iree/compiler/Codegen/SPIRV/KernelConfig.cpp

compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.cpp

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp

This reverts commit 9419222.

MaheshRavishankar · 2024-04-11T23:58:26Z

compiler/src/iree/compiler/Codegen/LLVMGPU/test/vector_distribution_pipeline_test.mlir

+  }
+}}
+
+// CHECK-LABEL: .executable.export public @fit_shared_memory_schedule


THis check doesnt seem to be enough. Is this really testing what you want to with this test?

If it doesn't fit a verifier later on errors out. IIUC testing that this compiles, together with a verifier tested elsewhere, shows that this works. Although it would be also nice to have the schedule sizes checked too.

This patch adds support to check if a matmul schedule would cause promotion to create allocations which do not fit shared memory size, and shrink the MMA schedule if so. The patch also updates the check-resource-usage pass in LLVMGPU pass pipeline to query shared memory limit from the target. --------- Co-authored-by: Quinn Dawkins <quinn.dawkins@gmail.com> Co-authored-by: Jakub Kuderski <jakub@nod-labs.com> Signed-off-by: Lubo Litchev <lubol@google.com>

Groverkss requested a review from kuhar March 28, 2024 19:36

qedawkins reviewed Mar 28, 2024

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp Outdated Show resolved Hide resolved

kuhar reviewed Mar 28, 2024

View reviewed changes

Groverkss commented Apr 1, 2024

View reviewed changes

compiler/src/iree/compiler/Codegen/LLVMGPU/KernelConfig.h Outdated Show resolved Hide resolved

Groverkss marked this pull request as ready for review April 1, 2024 20:20

Groverkss requested review from antiagainst and MaheshRavishankar as code owners April 1, 2024 20:20

Groverkss commented Apr 1, 2024

View reviewed changes

Groverkss changed the title ~~Fit mma schedules inside shared memory limits~~ [LLVMGPU] Fit mma schedules inside shared memory limits Apr 1, 2024

Groverkss requested review from kuhar and qedawkins April 1, 2024 20:24

kuhar reviewed Apr 2, 2024

View reviewed changes

qedawkins reviewed Apr 2, 2024

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp Outdated Show resolved Hide resolved

kuhar mentioned this pull request Apr 2, 2024

[SPIR-V] Shared memory limit exceeded for batch matmul dispatch #16937

Open

Groverkss force-pushed the shared-memory-limits branch from 710124a to ce095d1 Compare April 3, 2024 14:53

Groverkss requested review from kuhar and qedawkins April 3, 2024 14:54

kuhar reviewed Apr 3, 2024

View reviewed changes

Groverkss requested a review from kuhar April 4, 2024 15:25

kuhar approved these changes Apr 4, 2024

View reviewed changes

raikonenfnu reviewed Apr 9, 2024

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp Outdated Show resolved Hide resolved

raikonenfnu reviewed Apr 9, 2024

View reviewed changes

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp Show resolved Hide resolved

qedawkins and others added 8 commits April 11, 2024 14:51

Fit mma schedules inside shared memory limits (iree-org#16840)

2f55793

-1 instead of / 2

240c278

Revert "-1 instead of / 2"

2a6dbea

This reverts commit 9419222.

address comments and add shared memory limit

c07f803

clang-format

919d116

move todo

b2215e6

Add test

323803f

Address comments

8fabfd7

Groverkss and others added 3 commits April 11, 2024 14:51

clang-format

fd24c5d

address comments

fc2365e

Address comments and rebase

688649a

kuhar force-pushed the shared-memory-limits branch from e6f30fc to 688649a Compare April 11, 2024 20:14

kuhar enabled auto-merge (squash) April 11, 2024 20:14

kuhar merged commit 94971b4 into iree-org:main Apr 11, 2024
53 checks passed

MaheshRavishankar reviewed Apr 11, 2024

View reviewed changes

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLVMGPU] Fit mma schedules inside shared memory limits #16927

[LLVMGPU] Fit mma schedules inside shared memory limits #16927

Groverkss commented Mar 28, 2024 •

edited

Loading

kuhar left a comment

Groverkss Apr 1, 2024

MaheshRavishankar Apr 11, 2024

kuhar Apr 12, 2024

		@@ -0,0 +1,31 @@
		// RUN: iree-opt --split-input-file \
		// RUN: --iree-codegen-llvmgpu-use-vector-distribution '--pass-pipeline=builtin.module(hal.executable(hal.executable.variant(iree-llvmgpu-select-lowering-strategy, iree-llvmgpu-lower-executable-target, canonicalize)))' %s \| FileCheck %s

[LLVMGPU] Fit mma schedules inside shared memory limits #16927

[LLVMGPU] Fit mma schedules inside shared memory limits #16927

Conversation

Groverkss commented Mar 28, 2024 • edited Loading

kuhar left a comment

Choose a reason for hiding this comment

Groverkss Apr 1, 2024

Choose a reason for hiding this comment

MaheshRavishankar Apr 11, 2024

Choose a reason for hiding this comment

kuhar Apr 12, 2024

Choose a reason for hiding this comment

Groverkss commented Mar 28, 2024 •

edited

Loading