-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for data tiling on other (GPU/LLVMGPU/SPIR-V) backends #16933
Comments
Thanks @qedawkins for the detailed issue! Nice read! There is a lot to unpack here, and I will read through this a few times, but here are some priors that for me clash with some points above
All the above could be made to work and we can ensure those work as expected, but I think there is a difference on GPUs. We can use packing as a way to move data from global memory to shared memory (thats first level of packing you describe) and another pack to move data from shared memory to registers (We know we already have the transformations for this cause this is what we are using on the AIE backend and those work as expected, including working with bufferization). Not saying that strategy is most efficient (it is unclear how effective the use of shared memory this way will be, and how it affects overall performance to me). This approach will allow |
Makes sense, I was using "encoding" both as a general term for any attribute describing the layout of a tensor in memory, and as the concrete thing it is today. It would help to decouple those usages of the term, but I agree that in its current state, the propagation of an encoding is not possible.
+1 to this approach, but I see this as orthogonal. What we do at rest without data tiling should be largely the same as what we do with data tiling, should we choose to push on it. I see data tiling as a "last 10%" kind of strategy (especially for the most powerful GPUs) rather that the absolute requirement that it is on CPU. I'm hoping that this issue can be more of a starting point for brainstorming on that last 10% rather than an excuse to not do the work in the default case.
Big +1 to this approach. We should discuss this idea more in a separate issue. It would be a big win to me if the codegen approaches for AIE and GPU end up overlapping a lot too :) |
I took a look at different options in data-tiling path. The constant computation is still hoisted to initialization stage if we disable const-eval. However, it does not work if we defer the materialization to very end stage. [1] We can teach the compiler about the case, but there are other problems. We're facing the allocation issue. The // -----// IR Dump Before CPUMaterializeUpperBoundTileSize (iree-codegen-cpu-materialize-upper-bound-tile-size) //----- //
// ...
%25:2 = iree_linalg_ext.upper_bound_tile_size tensor<?x?xf32, #iree_linalg_ext.encoding<role = RESULT, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>> -> index, index
%26 = affine.apply affine_map<()[s0, s1] -> ((s1 ceildiv s0) * s0)>()[%25#0, %10]
%27 = affine.apply affine_map<()[s0, s1] -> ((s1 ceildiv s0) * s0)>()[%25#1, %11]
%28 = arith.muli %26, %c4 : index
%29 = arith.muli %28, %27 : index
%30 = util.align %19, %c64 : index
%31 = util.align %24, %c64 : index
%32 = arith.addi %30, %31 : index
%33 = util.align %29, %c64 : index
%34 = arith.addi %32, %33 : index
%result_0, %result_timepoint_1 = stream.resource.alloca uninitialized : !stream.resource<transient>{%34} => !stream.timepoint The workaround is running an early materialization pass (i.e., [1] So one of the works is to teach the compiler to hoist "set encodings on constants" to an initializer. |
Rounding up for now is easiest. The real solution is going to be to use tensor encodings properly - stream is designed to support those even with heterogeneous devices ( |
That is probably step 1 here then. After discussion offline, early materialization isn't going to work very well on GPU given that we want multi-level packs. If we have the chance, it would be nice to do it the right way. |
The first place to start sniffing around would be What I'd do, if doing it all incrementally: phase 1:
phase 2:
phase 3:
|
Thanks for the tips! I'll start picking this up after my break.
This will be the key I think, because encoding attrs could be op specific (and will likely need to be revisited if we want to do any kind of propagation).
My current thought process is that we will want to plug any kind of external tuning process into the tile sizes selected for an encoding. Going even a step further, on a target like SPIR-V where we have a JIT compiler + specialization constants, the tile sizes we pick for an encoding could be resolved + tuned directly on the vmfb. Or we could pass encodings as a push constant and we could provide some parameterized lookup table for resolving the encodings. Probably far away though, and maybe will never beat baking the tile sizes in at compile time (without another compiler backing it like SPIR-V). |
Yeah, having the baked out stuff will always be best for performance, then it's a tradeoff of compilation time/deployment size as to whether we bake out a lot of variants or leave things dynamic. Classic optimize-for-speed or optimize-for-size style stuff, and something we can use PGO for (knowing the actual values used for any executable/push constant will let us turn those into compile-time constants). If we write things using the queries to start it'll be push constants, but we have the option of going and turning those queries (for fixed devices) or related arithmetic (the arith on both the query result and dynamic shapes, etc) to constants and then propagating those into dispatches. I'll also have a pass that hoists const exprs derived from device queries/arith into executable constant blocks which will become specialization constants in SPIR-V or just constant device-side buffers in CUDA/ROCM (effectively then just push constants but without the per-dispatch overhead of pushing them). Fun things that are all unlocked by making some progress here on getting more IR and less hardcoding :) |
Oooh nice, the constant buffer thing sounds cool. Yeah need to make progress here first. |
Thanks for many details! Yes, phase 1 sounds good to me, and I need to think more about phase 2. I will start picking up the work, and make progress towards phase 2. |
Replying to this part of the original issue description above:
I still think that a single pack + mmt4d as in CPU should work:
But this is just one tile in a larger tiled matrix so lets's carry on spelling out the layout of subsequent tiles. Each tile comes below the preceding tile and continues the offset increments from where it left off, so if we write the first 4 tiles, we get:
So we see that as long as the number of rows is a multiple of 16, the So we can drop the This is not a coincidence. |
I'm making some progress. In my prototype, I create SetEncodingHintOnDispatches pass which introduces E.g., %9:2 = iree_linalg_ext.upper_bound_tile_size tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>> -> index, index
%10 = affine.apply affine_map<()[s0, s1] -> ((s1 ceildiv s0) * s0)>()[%9#1, %1]
%11 = affine.apply affine_map<()[s0, s1] -> ((s1 ceildiv s0) * s0)>()[%9#0, %0]
%12 = flow.dispatch.workgroups[%9#1, %9#0, %0, %1, %11, %10](%9#1, %9#0, %2, %0, %1, %11, %10) : (index, index, tensor<?x?xf32>{%0, %1}, index, index, index, index) -> tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>>{%11, %10} =
(%arg3: index, %arg4: index, %arg5: !flow.dispatch.tensor<readonly:tensor<?x?xf32>>, %arg6: index, %arg7: index, %arg8: index, %arg9: index, %arg10: !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>>>) {
%cst = arith.constant 0.000000e+00 : f32
%24 = flow.dispatch.workload.ordinal %arg6, 2 : index
%25 = flow.dispatch.workload.ordinal %arg7, 3 : index
%26 = flow.dispatch.workload.ordinal %arg8, 4 : index
%27 = flow.dispatch.workload.ordinal %arg9, 5 : index
%28 = flow.dispatch.tie_shape %arg5 : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%24, %25}
%29 = flow.dispatch.tie_shape %arg10 : !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>>>{%26, %27}
%30 = flow.dispatch.workload.ordinal %arg3, 0 : index
%31 = flow.dispatch.workload.ordinal %arg4, 1 : index
%32 = flow.dispatch.tensor.load %28, offsets = [0, 0], sizes = [%24, %25], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%24, %25} -> tensor<?x?xf32>
%33 = affine.apply affine_map<()[s0, s1] -> (-s1 + (s1 ceildiv s0) * s0)>()[%30, %25]
%34 = affine.apply affine_map<()[s0, s1] -> (-s1 + (s1 ceildiv s0) * s0)>()[%31, %24]
%padded = tensor.pad %32 low[0, 0] high[%34, %33] {
^bb0(%arg11: index, %arg12: index):
tensor.yield %cst : f32
} : tensor<?x?xf32> to tensor<?x?xf32>
%35 = iree_linalg_ext.set_encoding %padded : tensor<?x?xf32> -> tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>>
flow.dispatch.tensor.store %35, %29, offsets = [0, 0], sizes = [%26, %27], strides = [1, 1] : tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>>>{%26, %27}
flow.return
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index, %arg7: index, %arg8: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg3, %arg4, %arg5, %arg6, %arg7, %arg8
flow.return %x, %y, %z : index, index, index
} is converted to %9 = flow.dispatch.workgroups[%c16, %0, %1](%c16, %2, %0, %1) : (index, tensor<?x?xf32>{%0, %1}, index, index) -> tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>>{%0, %1}
// -------------- This is the new attribute --------------- //
attributes {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} =
(%arg3: index, %arg4: !flow.dispatch.tensor<readonly:tensor<?x?xf32>>, %arg5: index, %arg6: index, %arg7: !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>>>) {
%cst = arith.constant 0.000000e+00 : f32
%15 = flow.dispatch.workload.ordinal %arg5, 1 : index
%16 = flow.dispatch.workload.ordinal %arg6, 2 : index
%17 = flow.dispatch.workload.ordinal %arg3, 0 : index
%18 = flow.dispatch.tie_shape %arg4 : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%15, %16}
%19 = flow.dispatch.tie_shape %arg7 : !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>>>{%15, %16}
%20 = flow.dispatch.tensor.load %18, offsets = [0, 0], sizes = [%15, %16], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<?x?xf32>>{%15, %16} -> tensor<?x?xf32>
%21 = affine.apply affine_map<()[s0, s1] -> (-s1 + (s1 ceildiv s0) * s0)>()[%17, %16]
%22 = affine.apply affine_map<()[s0, s1] -> (-s1 + (s1 ceildiv s0) * s0)>()[%17, %15]
%padded = tensor.pad %20 low[0, 0] high[%22, %21] {
^bb0(%arg8: index, %arg9: index):
tensor.yield %cst : f32
} : tensor<?x?xf32> to tensor<?x?xf32>
%23 = iree_linalg_ext.set_encoding %padded : tensor<?x?xf32> -> tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>>
flow.dispatch.tensor.store %23, %19, offsets = [0, 0], sizes = [%15, %16], strides = [1, 1] : tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>> -> !flow.dispatch.tensor<writeonly:tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>>>{%15, %16}
flow.return
} count(%arg3: index, %arg4: index, %arg5: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_slice %arg3, %arg4, %arg5
flow.return %x, %y, %z : index, index, index
} ( This allow us to propagate the information to backend; we can also use the new attribute in stream pipeline. In ConvertToStream pass, I teach the ConvertDispatchOp pattern to add the attribute on // -----// IR Dump Before ConvertToStreamPass (iree-stream-conversion) //----- //
util.func public @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32(%input0: tensor<?x?xf32>, %input1: tensor<?x?xf32>, %input2: tensor<?x?xf32>) -> (%output0: tensor<?x?xf32>)"}} {
%c16 = arith.constant 16 : index
%0 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[0] : index
%1 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[1] : index
%2 = hal.tensor.import %arg0 "input0" : !hal.buffer_view -> tensor<?x?xf32>{%0, %1}
%3 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[0] : index
%4 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[1] : index
%5 = hal.tensor.import %arg1 "input1" : !hal.buffer_view -> tensor<?x?xf32>{%3, %4}
%6 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[0] : index
%7 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[1] : index
%8 = hal.tensor.import %arg2 "input2" : !hal.buffer_view -> tensor<?x?xf32>{%6, %7}
// All the below dispatches have `{encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>}`
%9 = flow.dispatch @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_0::@matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_0_set_encoding_LHS_DxD[%c16, %0, %1](%c16, %2, %0, %1) {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : (index, tensor<?x?xf32>{%0, %1}, index, index) -> tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2]>>{%0, %1}
%10 = flow.dispatch @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_1::@matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_1_set_encoding_RHS_DxD[%c16, %3, %4](%c16, %5, %3, %4) {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : (index, tensor<?x?xf32>{%3, %4}, index, index) -> tensor<?x?xf32, #iree_linalg_ext.encoding<role = RHS, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2]>>{%3, %4}
%11 = flow.dispatch @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_2::@matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_2_set_encoding_RESULT_DxD[%c16, %6, %7](%c16, %8, %6, %7) {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : (index, tensor<?x?xf32>{%6, %7}, index, index) -> tensor<?x?xf32, #iree_linalg_ext.encoding<role = RESULT, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2]>>{%6, %7}
%12 = flow.dispatch @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_3::@matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_3_matmul_DxDxD_f32[%0, %1, %3, %4, %6, %7](%9, %10, %11, %0, %1, %3, %4, %6, %7) {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : (tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2]>>{%0, %1}, tensor<?x?xf32, #iree_linalg_ext.encoding<role = RHS, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2]>>{%3, %4}, tensor<?x?xf32, #iree_linalg_ext.encoding<role = RESULT, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2]>>{%6, %7}, index, index, index, index, index, index) -> %11{%6, %7}
%13 = flow.dispatch @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_4::@matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_4_unset_encoding_RESULT_DxD[%6, %7](%12, %6, %7) {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : (tensor<?x?xf32, #iree_linalg_ext.encoding<role = RESULT, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2]>>{%6, %7}, index, index) -> tensor<?x?xf32>{%6, %7}
%14 = hal.tensor.export %13 "output0" : tensor<?x?xf32>{%6, %7} -> !hal.buffer_view
util.return %14 : !hal.buffer_view
}
// -----// IR Dump After ConvertToStreamPass (iree-stream-conversion) //----- //
util.func public @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32(%input0: tensor<?x?xf32>, %input1: tensor<?x?xf32>, %input2: tensor<?x?xf32>) -> (%output0: tensor<?x?xf32>)"}} {
%c16 = arith.constant 16 : index
%0 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[0] : index
%1 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[1] : index
%element_type_f32 = hal.element_type<f32> : i32
%dense_row_major = hal.encoding_type<dense_row_major> : i32
hal.buffer_view.assert<%arg0 : !hal.buffer_view> message("input0") shape([%0, %1]) type(%element_type_f32) encoding(%dense_row_major)
%2 = stream.tensor.sizeof tensor<?x?xf32>{%0, %1} : index
%3 = stream.tensor.import %arg0 : !hal.buffer_view -> tensor<?x?xf32>{%0, %1} in !stream.resource<external>{%2}
%4 = stream.async.transfer %3 : !stream.resource<external>{%2} -> !stream.resource<*>{%2}
%5 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[0] : index
%6 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[1] : index
%element_type_f32_0 = hal.element_type<f32> : i32
%dense_row_major_1 = hal.encoding_type<dense_row_major> : i32
hal.buffer_view.assert<%arg1 : !hal.buffer_view> message("input1") shape([%5, %6]) type(%element_type_f32_0) encoding(%dense_row_major_1)
%7 = stream.tensor.sizeof tensor<?x?xf32>{%5, %6} : index
%8 = stream.tensor.import %arg1 : !hal.buffer_view -> tensor<?x?xf32>{%5, %6} in !stream.resource<external>{%7}
%9 = stream.async.transfer %8 : !stream.resource<external>{%7} -> !stream.resource<*>{%7}
%10 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[0] : index
%11 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[1] : index
%element_type_f32_2 = hal.element_type<f32> : i32
%dense_row_major_3 = hal.encoding_type<dense_row_major> : i32
hal.buffer_view.assert<%arg2 : !hal.buffer_view> message("input2") shape([%10, %11]) type(%element_type_f32_2) encoding(%dense_row_major_3)
%12 = stream.tensor.sizeof tensor<?x?xf32>{%10, %11} : index
%13 = stream.tensor.import %arg2 : !hal.buffer_view -> tensor<?x?xf32>{%10, %11} in !stream.resource<external>{%12}
%14 = stream.async.transfer %13 : !stream.resource<external>{%12} -> !stream.resource<*>{%12}
%c0 = arith.constant 0 : index
// note: the below sizeof ops have {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>}.
%15 = stream.tensor.sizeof tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2]>>{%0, %1} {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : index
%16 = stream.async.dispatch @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_0::@matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_0_set_encoding_LHS_DxD[%c16, %0, %1](%c16, %4[%c0 to %2 for %2], %0, %1) {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : (index, !stream.resource<*>{%2}, index, index) -> !stream.resource<*>{%15}
%c0_4 = arith.constant 0 : index
%17 = stream.tensor.sizeof tensor<?x?xf32, #iree_linalg_ext.encoding<role = RHS, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2]>>{%5, %6} {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : index
%18 = stream.async.dispatch @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_1::@matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_1_set_encoding_RHS_DxD[%c16, %5, %6](%c16, %9[%c0_4 to %7 for %7], %5, %6) {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : (index, !stream.resource<*>{%7}, index, index) -> !stream.resource<*>{%17}
%c0_5 = arith.constant 0 : index
%19 = stream.tensor.sizeof tensor<?x?xf32, #iree_linalg_ext.encoding<role = RESULT, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2]>>{%10, %11} {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : index
%20 = stream.async.dispatch @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_2::@matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_2_set_encoding_RESULT_DxD[%c16, %10, %11](%c16, %14[%c0_5 to %12 for %12], %10, %11) {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : (index, !stream.resource<*>{%12}, index, index) -> !stream.resource<*>{%19}
%c0_6 = arith.constant 0 : index
%21 = stream.async.dispatch @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_3::@matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_3_matmul_DxDxD_f32[%0, %1, %5, %6, %10, %11](%16[%c0_6 to %15 for %15], %18[%c0_6 to %17 for %17], %20[%c0_6 to %19 for %19], %0, %1, %5, %6, %10, %11) {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : (!stream.resource<*>{%15}, !stream.resource<*>{%17}, !stream.resource<*>{%19}, index, index, index, index, index, index) -> %20{%19}
%c0_7 = arith.constant 0 : index
%22 = stream.tensor.sizeof tensor<?x?xf32>{%10, %11} {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : index
%23 = stream.async.dispatch @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_4::@matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32_dispatch_4_unset_encoding_RESULT_DxD[%10, %11](%21[%c0_7 to %19 for %19], %10, %11) {encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : (!stream.resource<*>{%19}, index, index) -> !stream.resource<*>{%22}
%24 = stream.async.transfer %23 : !stream.resource<*>{%22} -> !stream.resource<external>{%22}
%25 = stream.tensor.export %24 : tensor<?x?xf32>{%10, %11} in !stream.resource<external>{%22} -> !hal.buffer_view
util.return %25 : !hal.buffer_view
} Now I can start teaching (The selected IR dumps can be found at https://gist.github.com/hanhanW/8d4f77c6903ca773c6f60098b5e541b1) |
Nice progress on decoupling! I'll need to look more tomorrow, but I don't think the layering is quite right here - we don't want encoding attributes on ops but the encoding attribute on tensors (there's literally an We also need to decouple the exact padding and any per-dispatch information from the head of the pipeline - what we're trying to introduce is an aspect where something "may be" padded but not explicitly specify the padding. That's easiest if we can avoid needing to know the padding for as long as possible. We just need to know something may have padding, and then we can ask what that padding is once we know where the tensor is used (in EncodeHostTensors once affinities are assigned). (in the multi-device world, all stream ops now have affinities assigned so ops are like |
(I'll have to think more in the morning about the heterogeneous cases and how we resolve those, but it's likely to be an SSA value that lets us get the encoding of a tensor and pass that around - so instead of tensor.dim there'd be |
This is very tricky with what we have today. Encoding is an attribute but not an op. We want to somehow have an encoding showing in the IR. IR without my changes after converting to stream: %15:2 = iree_linalg_ext.upper_bound_tile_size tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>> -> index, index
%16 = affine.apply affine_map<()[s0, s1] -> ((s1 ceildiv s0) * s0)>()[%15#1, %1]
%17 = affine.apply affine_map<()[s0, s1] -> ((s1 ceildiv s0) * s0)>()[%15#0, %0]
%c0 = arith.constant 0 : index
%18 = stream.tensor.sizeof tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>]>>{%17, %16} : index We dropped the size calculation logics with my changes, IR dump: %15 = stream.tensor.sizeof
tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], user_indexing_maps = [#map, #map1, #map2]>>{%0, %1}
{encoding.round_dims_to = #iree_codegen.encoding.round_dims_to<16>} : index We can only have one encoding on tensor types today, so I need to attach the information as an attribute. Do you think we should teach |
you can either teach it to support it (which I think is the easiest/best option - move it out of iree_linalg_ext and into util or something else and let's standardize it for ourselves) or teach the rounding encoding to nest - so |
I'll try to teach the |
I tried a couple things and I realized that we can just set the #iree_linalg_ext.encoding<
role = LHS,
element_types = [f32, f32, f32],
original_type = tensor<?x?xf32>,
user_indexing_maps = [#map, #map1, #map2],
// Below is the new parameter! It can be a list because `role` and `maps` provide enough information
round_dims_to = 16 : index> The IR after SetEncoding could be cleaner -- no util.func public @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub, iree.reflection = {iree.abi.declaration = "sync func @matmul_accumulate_DYNxDYNxf32_times_DYNxDYNxf32_into_DYNxDYNxf32(%input0: tensor<?x?xf32>, %input1: tensor<?x?xf32>, %input2: tensor<?x?xf32>) -> (%output0: tensor<?x?xf32>)"}} {
%c1 = arith.constant 1 : index
%c0 = arith.constant 0 : index
%0 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[0] : index
%1 = hal.buffer_view.dim<%arg0 : !hal.buffer_view>[1] : index
%2 = hal.tensor.import %arg0 "input0" : !hal.buffer_view -> tensor<?x?xf32>{%0, %1}
%3 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[0] : index
%4 = hal.buffer_view.dim<%arg1 : !hal.buffer_view>[1] : index
%5 = hal.tensor.import %arg1 "input1" : !hal.buffer_view -> tensor<?x?xf32>{%3, %4}
%6 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[0] : index
%7 = hal.buffer_view.dim<%arg2 : !hal.buffer_view>[1] : index
%8 = hal.tensor.import %arg2 "input2" : !hal.buffer_view -> tensor<?x?xf32>{%6, %7}
%9 = iree_linalg_ext.set_encoding %2 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], original_type = tensor<?x?xf32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = 16 : index>>
%10 = iree_linalg_ext.set_encoding %5 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_linalg_ext.encoding<role = RHS, element_types = [f32, f32, f32], original_type = tensor<?x?xf32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = 16 : index>>
%11 = iree_linalg_ext.set_encoding %8 : tensor<?x?xf32> -> tensor<?x?xf32, #iree_linalg_ext.encoding<role = RESULT, element_types = [f32, f32, f32], original_type = tensor<?x?xf32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = 16 : index>>
%12 = linalg.matmul ins(%9, %10 : tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], original_type = tensor<?x?xf32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = 16 : index>>, tensor<?x?xf32, #iree_linalg_ext.encoding<role = RHS, element_types = [f32, f32, f32], original_type = tensor<?x?xf32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = 16 : index>>) outs(%11 : tensor<?x?xf32, #iree_linalg_ext.encoding<role = RESULT, element_types = [f32, f32, f32], original_type = tensor<?x?xf32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = 16 : index>>) -> tensor<?x?xf32, #iree_linalg_ext.encoding<role = RESULT, element_types = [f32, f32, f32], original_type = tensor<?x?xf32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = 16 : index>>
%dim = tensor.dim %8, %c0 : tensor<?x?xf32>
%dim_0 = tensor.dim %8, %c1 : tensor<?x?xf32>
%13 = iree_linalg_ext.unset_encoding %12 : tensor<?x?xf32, #iree_linalg_ext.encoding<role = RESULT, element_types = [f32, f32, f32], original_type = tensor<?x?xf32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = 16 : index>> -> tensor<?x?xf32>
%extracted_slice = tensor.extract_slice %13[0, 0] [%dim, %dim_0] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
%14 = hal.tensor.export %extracted_slice "output0" : tensor<?x?xf32>{%6, %7} -> !hal.buffer_view
util.return %14 : !hal.buffer_view
} On the stream side, we have stream.tensor.sizeof on a tensor with such encoding. So the %15 = stream.tensor.sizeof tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], original_type = tensor<?x?xf32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = 16 : index>>{%0, %1} : index On codegen side, we need to teach the materialization patterns to add padding values to the pack op, which should be easy to me. I'm going to
|
Great! That'll be a good improvement and get us to a place where doing the more sophisticated stuff is just for optimization - I suspect we'll change how things are done so that we can refine this based on device placement to only pad as much as required by the devices particular tensors are produced/consumed on, but that's much easier to reason about when starting from this point. Thanks for puzzling through this :) |
I have a chain of PR that makes this happen (#17055), which depends on some refactoring PRs (i.e., #17040 and #17053). In the prototype:
Some PRs are out for review. If people want to take a look at further changes, see #17055 However, I hit a cyclic dep issue when doing the refactoring, which happens between |
Here is the lit tests about #map = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d1)>
util.func public @sizeoflhsencoding(%arg0: index, %arg1: index) -> index {
%0 = stream.tensor.sizeof tensor<?x?xf32, #iree_linalg_ext.encoding<role = LHS, element_types = [f32, f32, f32], original_type = tensor<?x?xf32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = 16, 16, 16>>{%arg0, %arg1} : index
util.return %0 : index
}
// CHECK-LABEL: @sizeoflhsencoding
// CHECK-DAG: %[[C4:.+]] = arith.constant 4 : index
// CHECK-DAG: %[[C16:.+]] = arith.constant 16 : index
// CHECK: %[[CEIL_DIV_D0:.+]] = arith.ceildivui %arg0, %[[C16]]
// CHECK: %[[PAD_D0:.+]] = arith.muli %[[CEIL_DIV_D0]], %[[C16]]
// CHECK: %[[CEIL_DIV_D1:.+]] = arith.ceildivui %arg1, %[[C16]]
// CHECK: %[[PAD_D1:.+]] = arith.muli %[[CEIL_DIV_D1]], %[[C16]]
// CHECK: %[[T0:.+]] = arith.muli %[[PAD_D0]], %[[C4]]
// CHECK: %[[T1:.+]] = arith.muli %[[T0]], %[[PAD_D1]]
// CHECK: return %[[T1]]
|
Current overview
This is intended to be a tracking issue for adding support for data tiling on GPU backends. "Data tiling" here is being used to describe strategies for reorganizing the layout of tensors in memory to allow for better access patterns. So far this has been built out only for LLVMCPU, however the same principles should apply to GPU backends. The general flow for data tiling (targeting mmt4d and optionally ukernels) on CPU is summarized as:
a. Encodings capture how a particular tensor is used. This allows materializing a concrete layout during codegen when we know the target explicitly.
b. Currently MaterializeHomogenousEncodings will materialize these encodings as an explicit pack at the global optimization level to enable fusion. This breaks point (a) for multi-device.
a. Currently this pass includes a step in the HAL conversion to materialize the encoded tensors on the host side.
Materializing encoded tensors on the host side won't work for GPU without having a forced dependency between codegen backends, which we should avoid propagating.
Hand picked cases of interest
This is a list of a few examples of operations we're interested in focusing on for encodings.
Matmul
The first obvious case is a simple matmul.
For CPU backends, the set-encoding ops will turn into
tensor.pack
ops, the GEMM will turn into alinalg.mmt4d
op, and the unset encoding op will turn intotensor.unpack
. This is essentially packing to the size of the tile processing by a single iteration of the inner hot loop of the GEMM. For certain GPU architectures, we might want to go a step further to match an internal layout of a target intrinsic. Take MFMA for the CDNA3 architecture as an example (see section 7 of this architecture document). For the F16_16x16x16_F32 intrinsic, we might want three levels of packingFor such cases, a single pack + mmt4d op would not work. This poses a problem for kernel config as the CPU approach of materialize encoding -> kernel config requires kernel config to relearn what the meaning of the multi-level packed generic means. Additionally, the approach to materialize packs at the GlobalOptimization level by way of MaterializeHomogenousEncodings makes this even more difficult as there's no guarantee that some later transformation doesn't mess up the packed generic (also we might want to materialize more than a single pack, which could cause further problems with fusion).
On GPU backends the cost of encoding ops is much more likely to outweigh the gains from setting encodings, making fusion of encoding setting operations a requirement for good performance, so materializing multi-level packs like this at the GlobalOptimization level is unlikely to work.
Quantized Matmul
We've seen quantization to be extremely important to model performance recently, in particular the fusion of dequantization (quantization) operations with consumers (producers) being extremely necessary for good performance.
To set encodings for the consumer matmul (%35), and actually benefit from the encodings, we must propagate the encoding to the dispatch boundary, and thus to the fused dequantization operation. The way that encodings work today, however, require padding to align the data to the target tile size. For a simple matmul, this works because the padding value is simply the invariant value for an FMA, which is zero. This does not work here though, because we would need to pick a constant padding value for the inputs to the dequantization operation, namely this math
Which, when taking into account the way that scales and zero points are broadcasted, is not possible. Importantly, however, we don't actually care doing the correct math to produce
f32 = 0.0
from the dequantization, but rather:This could instead be codegen'ed by masking the execution for the inner most tile (and not even bothering to initialize the padding of the pack). This could be done by adding some kind of "poison" bit to an encoding/pack to indicate that the final padded tile gives no guarantee as to the value of the out-of-bounds data, e.g.
Where
_
indicates uninitialized or arbitrarily filled data.Fusion of pack with producers
As noted earlier, pack fusion is likely required to see significant performance gains on GPU backends given the often comparatively marginal gains from it for desktop/server grade GPUs, and also the additional overhead of allocating the memory for, and launching the encoding kernel. Fusing the pack with a consumer defeats the purpose of data tiling, so we need to be able to fuse a pack with consumers. Let's think through what that might look like for a case like this
In other words, we need to write back a different encoding from the producer matmul, which could include padding. There are three options for this fusion
a. Dedicate some threads of that matmul kernel to writing back the padding value.
b. Use a few threads within subgroups writing data "near" the padded values to also write the padding as well as the value computed with that thread.
For architectures like RDNA3 and CDNA1/2/3 which rely on subgroup operations to achieve good performance on matmuls, dedicating some threads of the matmul to writing the padding value is a significant mismatch in resource costs (dedicating threads of a resource intensive matmul kernel to just writing zeros is unlikely to perform well), so 1a is a bad option. Option 2 is more reasonable and composes nicely with graph based execution (e.g. HipGraphs/CUDA Graphs), however does require an extra kernel launch and eliding the extra kernel in cases where padding is not required. 1b) is probably the hardest to implement but makes the most sense, assuming the amount of padding is smaller than the number of threads launched (reasonable in most real world cases, except for completely arbitrary dynamic shapes that happen to end up being small at runtime, but those should be handled with specialization).
Attention
Attention internally does matrix operations and thus is another good target for data tiling. The current implementation of the attention operation, however, requires exactly 3d inputs and thus does not allow for setting encodings. One option could be to materialize encodings for attention after decomposing the operation, but the different options have not really been explored.
Convolution
Data tiling of convolution seems to primarily be a subset of what is required for matmul, as tiling the image dimensions of the convolution is not possible. TODO: Add more details here as necessary.
Task List
TODO
The text was updated successfully, but these errors were encountered: