-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLVMGPU] Fit mma schedules inside shared memory limits #16927
Conversation
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky to test because ideally KernelConfig won't attempt to create invalid schedules in the first place... However, a simple batch mamul that was known to cause problems in the past should do:
hal.executable.variant public @rocm_hsaco_fb target(<"rocm", "rocm-hsaco-fb", {mma_intrinsics = [#iree_gpu.mfma_layout<F16_16x16x16_F32>, #iree_gpu.mfma_layout<F16_32x32x8_F32>], target_arch = "gfx942", ukernels = "none"}>) {
hal.executable.export public @main$async_dispatch_132_batch_matmul_64x80x1280x1280_f16xf16xf32 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>]>]>) attributes {hal.interface.bindings = [#hal.interface.binding<0, 0>, #hal.interface.binding<0, 1>, #hal.interface.binding<0, 2>]} {
^bb0(%arg0: !hal.device):
%x, %y, %z = flow.dispatch.workgroup_count_from_slice
hal.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @main$async_dispatch_132_batch_matmul_64x80x1280x1280_f16xf16xf32() {
%cst = arith.constant 0.000000e+00 : f32
%c129181184 = arith.constant 129181184 : index
%c18112 = arith.constant 18112 : index
%c100980224 = arith.constant 100980224 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c129181184) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<64x80x1280xf16>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c18112) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<64x1280x1280xf16>>
%2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c100980224) : !flow.dispatch.tensor<writeonly:tensor<64x80x1280xf32>>
%3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [64, 80, 1280], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<64x80x1280xf16>> -> tensor<64x80x1280xf16>
%4 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0], sizes = [64, 1280, 1280], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<64x1280x1280xf16>> -> tensor<64x1280x1280xf16>
%5 = tensor.empty() : tensor<64x80x1280xf32>
%6 = linalg.fill ins(%cst : f32) outs(%5 : tensor<64x80x1280xf32>) -> tensor<64x80x1280xf32>
%7 = linalg.batch_matmul ins(%3, %4 : tensor<64x80x1280xf16>, tensor<64x1280x1280xf16>) outs(%6 : tensor<64x80x1280xf32>) -> tensor<64x80x1280xf32>
flow.dispatch.tensor.store %7, %2, offsets = [0, 0, 0], sizes = [64, 80, 1280], strides = [1, 1, 1] : tensor<64x80x1280xf32> -> !flow.dispatch.tensor<writeonly:tensor<64x80x1280xf32>>
return
}
}
}
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
@@ -0,0 +1,31 @@ | |||
// RUN: iree-opt --split-input-file \ | |||
// RUN: --iree-codegen-llvmgpu-use-vector-distribution '--pass-pipeline=builtin.module(hal.executable(hal.executable.variant(iree-llvmgpu-select-lowering-strategy, iree-llvmgpu-lower-executable-target, canonicalize)))' %s | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check-resource-usage pass is actually run during the lowering to llvm, which this pass pipeline doesn't run. But, I'm not sure what else to use here...
This test does confirm the e2e working of the fit shared memory thing though. Also, this test would not pass the fixed check resource usage pass without the fit shared memory changes.
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
710124a
to
ce095d1
Compare
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp
Outdated
Show resolved
Hide resolved
e6f30fc
to
688649a
Compare
} | ||
}} | ||
|
||
// CHECK-LABEL: .executable.export public @fit_shared_memory_schedule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THis check doesnt seem to be enough. Is this really testing what you want to with this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it doesn't fit a verifier later on errors out. IIUC testing that this compiles, together with a verifier tested elsewhere, shows that this works. Although it would be also nice to have the schedule sizes checked too.
This patch adds support to check if a matmul schedule would cause promotion to create allocations which do not fit shared memory size, and shrink the MMA schedule if so. The patch also updates the check-resource-usage pass in LLVMGPU pass pipeline to query shared memory limit from the target. --------- Co-authored-by: Quinn Dawkins <quinn.dawkins@gmail.com> Co-authored-by: Jakub Kuderski <jakub@nod-labs.com> Signed-off-by: Lubo Litchev <lubol@google.com>
This patch adds support to check if a matmul schedule would cause promotion to create allocations which do not fit shared memory size, and shrink the MMA schedule if so. The patch also updates the check-resource-usage pass in LLVMGPU pass pipeline to query shared memory limit from the target.