Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLVMGPU] Fit mma schedules inside shared memory limits #16927

Merged
merged 11 commits into from
Apr 11, 2024

Conversation

Groverkss
Copy link
Contributor

@Groverkss Groverkss commented Mar 28, 2024

This patch adds support to check if a matmul schedule would cause promotion to create allocations which do not fit shared memory size, and shrink the MMA schedule if so. The patch also updates the check-resource-usage pass in LLVMGPU pass pipeline to query shared memory limit from the target.

@Groverkss Groverkss requested a review from kuhar March 28, 2024 19:36
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky to test because ideally KernelConfig won't attempt to create invalid schedules in the first place... However, a simple batch mamul that was known to cause problems in the past should do:

hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mfma_layout<F16_16x16x16_F32>, #iree_gpu.mfma_layout<F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none"}>) {
  hal.executable.export public @main$async_dispatch_132_batch_matmul_64x80x1280x1280_f16xf16xf32 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>]>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>, #hal.interface.binding<0, 2>]} {
  ^bb0(%arg0: !hal.device):
    %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
    hal.return %x, %y, %z : index, index, index
  }
  builtin.module {
    func.func @main$async_dispatch_132_batch_matmul_64x80x1280x1280_f16xf16xf32() {
      %cst = arith.constant 0.000000e+00 : f32
      %c129181184 = arith.constant 129181184 : index
      %c18112 = arith.constant 18112 : index
      %c100980224 = arith.constant 100980224 : index
      %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c129181184) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<64x80x1280xf16>>
      %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c18112) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<64x1280x1280xf16>>
      %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c100980224) : !flow.dispatch.tensor<writeonly:tensor<64x80x1280xf32>>
      %3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [64, 80, 1280], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<64x80x1280xf16>> -> tensor<64x80x1280xf16>
      %4 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0], sizes = [64, 1280, 1280], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<64x1280x1280xf16>> -> tensor<64x1280x1280xf16>
      %5 = tensor.empty() : tensor<64x80x1280xf32>
      %6 = linalg.fill ins(%cst : f32) outs(%5 : tensor<64x80x1280xf32>) -> tensor<64x80x1280xf32>
      %7 = linalg.batch_matmul ins(%3, %4 : tensor<64x80x1280xf16>, tensor<64x1280x1280xf16>) outs(%6 : tensor<64x80x1280xf32>) -> tensor<64x80x1280xf32>
      flow.dispatch.tensor.store %7, %2, offsets = [0, 0, 0], sizes = [64, 80, 1280], strides = [1, 1, 1] : tensor<64x80x1280xf32> -> !flow.dispatch.tensor<writeonly:tensor<64x80x1280xf32>>
      return
    }
  }
}

@Groverkss Groverkss marked this pull request as ready for review April 1, 2024 20:20
@@ -0,0 +1,31 @@
// RUN: iree-opt --split-input-file \
// RUN: --iree-codegen-llvmgpu-use-vector-distribution '--pass-pipeline=builtin.module(hal.executable(hal.executable.variant(iree-llvmgpu-select-lowering-strategy, iree-llvmgpu-lower-executable-target, canonicalize)))' %s | FileCheck %s
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check-resource-usage pass is actually run during the lowering to llvm, which this pass pipeline doesn't run. But, I'm not sure what else to use here...

This test does confirm the e2e working of the fit shared memory thing though. Also, this test would not pass the fixed check resource usage pass without the fit shared memory changes.

@Groverkss Groverkss changed the title Fit mma schedules inside shared memory limits [LLVMGPU] Fit mma schedules inside shared memory limits Apr 1, 2024
@Groverkss Groverkss requested a review from kuhar April 4, 2024 15:25
@kuhar kuhar enabled auto-merge (squash) April 11, 2024 20:14
@kuhar kuhar merged commit 94971b4 into iree-org:main Apr 11, 2024
53 checks passed
}
}}

// CHECK-LABEL: .executable.export public @fit_shared_memory_schedule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THis check doesnt seem to be enough. Is this really testing what you want to with this test?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it doesn't fit a verifier later on errors out. IIUC testing that this compiles, together with a verifier tested elsewhere, shows that this works. Although it would be also nice to have the schedule sizes checked too.

LLITCHEV pushed a commit to LLITCHEV/iree that referenced this pull request Jul 30, 2024
This patch adds support to check if a matmul schedule would cause
promotion to create allocations which do not fit shared memory size, and
shrink the MMA schedule if so. The patch also updates the
check-resource-usage pass in LLVMGPU pass pipeline to query shared
memory limit from the target.

---------

Co-authored-by: Quinn Dawkins <quinn.dawkins@gmail.com>
Co-authored-by: Jakub Kuderski <jakub@nod-labs.com>
Signed-off-by: Lubo Litchev <lubol@google.com>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants