Crash in LLVMGPUTileAndDistribute / Tiling interface #12824

nicolasvasilache · 2023-03-29T14:55:59Z

What happened?

I have been iterating on graph-level transformations and added this repro to iree-samples.

This is dependent on #12822

Running iree-compile --iree-hal-target-backends=cuda transform_dialect/graph/cuda/bugs/1-ir.mlir

crashes with:

```
libIREECompiler.so.0`std::_Function_handler<llvm::SmallVector<mlir::Value, 4u> (mlir::OpBuilder&, mlir::Operation*), mlir::iree_compiler::populateTilingToInvocationPatterns(mlir::RewritePatternSet&, llvm::SmallVectorImpl<long>&)::$_5>::_M_invoke(__functor=<unavailable>, __args=<unavailable>, __args=<unavailable>) at std_function.h:290:9
    frame #11: 0x00007ffff5298579 libIREECompiler.so.0`tileInterfaceOp(mlir::OpBuilder&, mlir::TilingInterface, mlir::linalg::LinalgTilingOptions const&) [inlined] std::function<llvm::SmallVector<mlir::Value, 4u> (mlir::OpBuilder&, mlir::Operation*)>::operator()(this=0x00007fff483844f8, __args=0x00007fff72ffa728, __args=0x00007fff483006c0) const at std_function.h:591:9
    frame #12: 0x00007ffff5298554 libIREECompiler.so.0`tileInterfaceOp(b=0x00007fff72ffa728, tilableOp=TilingInterface @ 0x00007fff72ff9de0, options=0x00007fff483844f8) at Tiling.cpp:202:7
    frame #13: 0x00007ffff5299e03 libIREECompiler.so.0`mlir::iree_compiler::IREE::LinalgExt::TilingInterfaceBaseTilingPattern::matchAndRewriteBase(this=0x00007fff48384410, tilableOp=TilingInterface @ 0x000016d57f052f40, rewriter=0x00007fff72ffa720, result=0x00007fff72ffa248) const at Tiling.cpp:260:28
    frame #14: 0x00007ffff5299fd1 libIREECompiler.so.0`mlir::iree_compiler::IREE::LinalgExt::TilingInterfaceTilingPattern::matchAndRewrite(this=0x00007fff48384410, tilableOp=TilingInterface @ 0x000016d57f053450, rewriter=0x00007fff72ffa720) const at Tiling.cpp:283:48
    frame #15: 0x00007ffff727f168 libIREECompiler.so.0`mlir::PatternApplicator::matchAndRewrite(this=0x00007fff72ffa8a0, op=<unavailable>, rewriter=0x00007fff72ffa720, canApply=mlir::function_ref<bool (const mlir::Pattern &)> @ 0x000016d57f053b70, onFailure=mlir::function_ref<void (const mlir::Pattern &)> @ 0x00007fff72ffa5f0, onSuccess=mlir::function_ref<mlir::LogicalResult (const mlir::Pattern &)> @ 0x00007fff72ffa600) at PatternApplicator.cpp:200:25
    frame #16: 0x00007ffff724e936 libIREECompiler.so.0`(anonymous namespace)::GreedyPatternRewriteDriver::processWorklist(this=0x00007fff72ffa720) at GreedyPatternRewriteDriver.cpp:229:19
    frame #17: 0x00007ffff724c04d libIREECompiler.so.0`mlir::applyPatternsAndFoldGreedily(mlir::Region&, mlir::FrozenRewritePatternSet const&, mlir::GreedyRewriteConfig) [inlined] (anonymous namespace)::RegionPatternRewriteDriver::simplify(this=0x00007fff72ffa720) && at GreedyPatternRewriteDriver.cpp:458:15
    frame #18: 0x00007ffff724be73 libIREECompiler.so.0`mlir::applyPatternsAndFoldGreedily(region=0x0000555555ef9b20, patterns=0x00007fff72ffab80, config=GreedyRewriteConfig @ 0x00007fff72ffa920) at GreedyPatternRewriteDriver.cpp:487:47
    frame #19: 0x00007ffff422e008 libIREECompiler.so.0`mlir::iree_compiler::(anonymous namespace)::LLVMGPUTileAndDistributePass::runOnOperation() at GreedyPatternRewriteDriver.h:115:15
    frame #20: 0x00007ffff422df66 libIREECompiler.so.0`mlir::iree_compiler::(anonymous namespace)::LLVMGPUTileAndDistributePass::runOnOperation(this=0x00007fff483740e0) at LLVMGPUTileAndDistribute.cpp:266:18
    frame #21: 0x00007ffff72a8085 libIREECompiler.so.0`mlir::detail::OpToOpPassAdaptor::run(mlir::Pass*, mlir::Operation*, mlir::AnalysisManager, bool, unsigned int) at Pass.cpp:482:17
```

Steps to reproduce your issue

Go to '...'
Click on '....'
Scroll down to '....'
See error

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

nicolasvasilache · 2023-03-29T14:57:28Z

With the latest integrate, this is hiding the error I used to have which was:

  // IREE fails lowering of tensor.pack/unpack ops with:
  // <stdin>:23295:20: error: 'tensor.extract_slice' op unhandled operation when 
  // converting to destination passing style
  // %unpack_3675 = tensor.unpack %7830 inner_dims_pos = [0, 1] 
  //                inner_tiles = [384, 16] into %7826 
  //                : tensor<1x1x384x16xf32> -> tensor<384x2xf32>

I expect fixing the crash will then trigger the error I am expecting.
I'll file another bug when confirmed.

MaheshRavishankar · 2023-03-29T16:48:25Z

Ill wait for #12822 to land before looking more.

nicolasvasilache · 2023-03-30T08:58:24Z

This is a simpler repro

MaheshRavishankar · 2023-03-30T16:48:23Z

This is a simpler repro

Looking at the CPU side, I think there is a folder here that is missing that will help.

    %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<999x3999xf32>) -> tensor<999x3999xf32>
    %pack_4 = tensor.pack %1 padding_value(%cst_3 : f32) inner_dims_pos = [0, 1] inner_tiles = [128, 256] into %4 : tensor<999x3999xf32> -> tensor<8x16x128x256xf32>

This could just be folded into

%fill_folded = linalg.fill ins(%cst : f32) outs(%4 : tensor<8x16x128x256xf32>) -> tensor<8x16x128x256xf32>

Then a lot of the backend code-generation works out better. Basically (at least on the CPU side , I havent looked at the GPU side yet, but probably something similar) the fill -> pack code generation using tile + fuse with bufferization etc results in a stack allocation. We can try to address that but a folder at graph level makes all the problems go away. I tried your example with those changes manually, most of the codegeneration works as expected (see dump ir after all here : https://gist.github.com/MaheshRavishankar/115fcb30c47135c53a16355ea27387d8 )

Some points of interest. After dispatch region formation this is the IR

// -----// IR Dump After FormDispatchWorkgroups (iree-flow-form-dispatch-workgroups) //----- //
func.func @fill_matmul_static(%arg0: !hal.buffer_view, %arg1: !hal.buffer_view, %arg2: !hal.buffer_view) -> !hal.buffer_view attributes {iree.abi.stub} {
  %c8 = arith.constant 8 : index
  %c63 = arith.constant 63 : index
  %c16 = arith.constant 16 : index
  %c1 = arith.constant 1 : index
  %c128 = arith.constant 128 : index
  %c256 = arith.constant 256 : index
  %c3999 = arith.constant 3999 : index
  %c999 = arith.constant 999 : index
  %0 = hal.tensor.import %arg0 "input 0" : !hal.buffer_view -> tensor<999x1999xf32>
  %1 = hal.tensor.import %arg1 "input 1" : !hal.buffer_view -> tensor<1999x3999xf32>
  %2 = hal.tensor.import %arg2 "input 2" : !hal.buffer_view -> tensor<999x3999xf32>
  %3 = flow.dispatch.workgroups[%c8, %c63](%0) : (tensor<999x1999xf32>) -> tensor<8x63x128x32xf32> =
      (%arg3: !flow.dispatch.tensor<readonly:tensor<999x1999xf32>>, %arg4: !flow.dispatch.tensor<writeonly:tensor<8x63x128x32xf32>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0], sizes = [999, 1999], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x1999xf32>> -> tensor<999x1999xf32>
    %9 = tensor.empty() : tensor<8x63x128x32xf32>
    %pack = tensor.pack %8 padding_value(%cst : f32) inner_dims_pos = [0, 1] inner_tiles = [128, 32] into %9 : tensor<999x1999xf32> -> tensor<8x63x128x32xf32>
    flow.dispatch.tensor.store %pack, %arg4, offsets = [0, 0, 0, 0], sizes = [8, 63, 128, 32], strides = [1, 1, 1, 1] : tensor<8x63x128x32xf32> -> !flow.dispatch.tensor<writeonly:tensor<8x63x128x32xf32>>
    flow.return
  } count(%arg3: index, %arg4: index) -> (index, index, index) {
    %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
    flow.return %x, %y, %z : index, index, index
  }
  %4 = flow.dispatch.workgroups[%c63, %c16](%1) : (tensor<1999x3999xf32>) -> tensor<63x16x256x32xf32> =
      (%arg3: !flow.dispatch.tensor<readonly:tensor<1999x3999xf32>>, %arg4: !flow.dispatch.tensor<writeonly:tensor<63x16x256x32xf32>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0], sizes = [1999, 3999], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<1999x3999xf32>> -> tensor<1999x3999xf32>
    %9 = tensor.empty() : tensor<63x16x256x32xf32>
    %pack = tensor.pack %8 padding_value(%cst : f32) inner_dims_pos = [1, 0] inner_tiles = [256, 32] into %9 : tensor<1999x3999xf32> -> tensor<63x16x256x32xf32>
    flow.dispatch.tensor.store %pack, %arg4, offsets = [0, 0, 0, 0], sizes = [63, 16, 256, 32], strides = [1, 1, 1, 1] : tensor<63x16x256x32xf32> -> !flow.dispatch.tensor<writeonly:tensor<63x16x256x32xf32>>
    flow.return
  } count(%arg3: index, %arg4: index) -> (index, index, index) {
    %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
    flow.return %x, %y, %z : index, index, index
  }
  %5 = flow.dispatch.workgroups[%c8, %c16, %c128, %c256, %c1, %c1](%3, %4) : (tensor<8x63x128x32xf32>, tensor<63x16x256x32xf32>) -> tensor<8x16x128x256xf32> =
      (%arg3: !flow.dispatch.tensor<readonly:tensor<8x63x128x32xf32>>, %arg4: !flow.dispatch.tensor<readonly:tensor<63x16x256x32xf32>>, %arg5: !flow.dispatch.tensor<writeonly:tensor<8x16x128x256xf32>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0, 0, 0], sizes = [8, 63, 128, 32], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x63x128x32xf32>> -> tensor<8x63x128x32xf32>
    %9 = flow.dispatch.tensor.load %arg4, offsets = [0, 0, 0, 0], sizes = [63, 16, 256, 32], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<63x16x256x32xf32>> -> tensor<63x16x256x32xf32>
    %10 = tensor.empty() : tensor<8x16x128x256xf32>
    %11 = linalg.fill ins(%cst : f32) outs(%10 : tensor<8x16x128x256xf32>) -> tensor<8x16x128x256xf32>
    %12 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d4, d2, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d4, d1, d3, d5)>, affine_map<(d0, d1, d2, d3, d4, d5) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "reduction", "reduction"]} ins(%8, %9 : tensor<8x63x128x32xf32>, tensor<63x16x256x32xf32>) outs(%11 : tensor<8x16x128x256xf32>) {
    ^bb0(%in: f32, %in_0: f32, %out: f32):
      %13 = arith.mulf %in, %in_0 : f32
      %14 = arith.addf %out, %13 : f32
      linalg.yield %14 : f32
    } -> tensor<8x16x128x256xf32>
    flow.dispatch.tensor.store %12, %arg5, offsets = [0, 0, 0, 0], sizes = [8, 16, 128, 256], strides = [1, 1, 1, 1] : tensor<8x16x128x256xf32> -> !flow.dispatch.tensor<writeonly:tensor<8x16x128x256xf32>>
    flow.return
  } count(%arg3: index, %arg4: index, %arg5: index, %arg6: index, %arg7: index, %arg8: index) -> (index, index, index) {
    %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6, %arg7, %arg8
    flow.return %x, %y, %z : index, index, index
  }
  %6 = flow.dispatch.workgroups[%c999, %c3999](%5, %2) : (tensor<8x16x128x256xf32>, tensor<999x3999xf32>) -> tensor<999x3999xf32> =
      (%arg3: !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>>, %arg4: !flow.dispatch.tensor<readonly:tensor<999x3999xf32>>, %arg5: !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>) {
    %8 = flow.dispatch.tensor.load %arg3, offsets = [0, 0, 0, 0], sizes = [8, 16, 128, 256], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>> -> tensor<8x16x128x256xf32>
    %9 = flow.dispatch.tensor.load %arg4, offsets = [0, 0], sizes = [999, 3999], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>> -> tensor<999x3999xf32>
    %unpack = tensor.unpack %8 inner_dims_pos = [0, 1] inner_tiles = [128, 256] into %9 : tensor<8x16x128x256xf32> -> tensor<999x3999xf32>
    flow.dispatch.tensor.store %unpack, %arg5, offsets = [0, 0], sizes = [999, 3999], strides = [1, 1] : tensor<999x3999xf32> -> !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>
    flow.return
  } count(%arg3: index, %arg4: index) -> (index, index, index) {
    %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4
    flow.return %x, %y, %z : index, index, index
  }
  %7 = hal.tensor.export %6 "output 0" : tensor<999x3999xf32> -> !hal.buffer_view
  return %7 : !hal.buffer_view
}

That looks good.

The issue seems to be with the unpack dispatch. Something very strange happening with bufferization. This is the IR for unpack dispatch before bufferization

module {
  func.func @fill_matmul_static_dispatch_3() {
    %cst = arith.constant 0.000000e+00 : f32
    %c256 = arith.constant 256 : index
    %c128 = arith.constant 128 : index
    %c3999 = arith.constant 3999 : index
    %c999 = arith.constant 999 : index
    %c41287680 = arith.constant 41287680 : index
    %c0 = arith.constant 0 : index
    %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c41287680) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>>
    %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>>
    %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>
    %workgroup_id_x = hal.interface.workgroup.id[0] : index
    %workgroup_count_x = hal.interface.workgroup.count[0] : index
    %workgroup_id_y = hal.interface.workgroup.id[1] : index
    %workgroup_count_y = hal.interface.workgroup.count[1] : index
    %3 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_id_y]
    %4 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_count_y]
    scf.for %arg0 = %3 to %c999 step %4 {
      %5 = affine.min affine_map<(d0) -> (-d0 + 999, 128)>(%arg0)
      %6 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_id_x]
      %7 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_count_x]
      scf.for %arg1 = %6 to %c3999 step %7 {
        %8 = affine.min affine_map<(d0) -> (-d0 + 3999, 256)>(%arg1)
        %9 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg0)
        %10 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg1)
        %11 = flow.dispatch.tensor.load %0, offsets = [%9, %10, 0, 0], sizes = [1, 1, 128, 256], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>> -> tensor<1x1x128x256xf32>
        %12 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1], sizes = [%5, %8], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>> -> tensor<?x?xf32>
        %13 = scf.for %arg2 = %c0 to %5 step %c128 iter_args(%arg3 = %12) -> (tensor<?x?xf32>) {
          %14 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 128)>(%arg2)[%5]
          %15 = scf.for %arg4 = %c0 to %8 step %c256 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
            %16 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 256)>(%arg4)[%8]
            %17 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg2)
            %18 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg4)
            %extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
            %19 = vector.transfer_read %11[%17, %18, %c0, %c0], %cst {in_bounds = [true, true]} : tensor<1x1x128x256xf32>, vector<128x256xf32>
            %extracted_slice_0 = tensor.extract_slice %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
            %20 = vector.transfer_write %19, %extracted_slice_0[%c0, %c0] : vector<128x256xf32>, tensor<?x?xf32>
            %inserted_slice = tensor.insert_slice %20 into %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
            %inserted_slice_1 = tensor.insert_slice %inserted_slice into %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
            scf.yield %inserted_slice_1 : tensor<?x?xf32>
          }
          scf.yield %15 : tensor<?x?xf32>
        }
        flow.dispatch.tensor.store %13, %2, offsets = [%arg0, %arg1], sizes = [%5, %8], strides = [1, 1] : tensor<?x?xf32> -> !flow.dispatch.tensor<writeonly:tensor<999x3999xf32>>
      }
    }
    return
  }
}

and this is the after

// -----// IR Dump Before CleanupBufferAllocView (iree-codegen-cleanup-buffer-alloc-view) //----- //                                                                                                                                                                                                               
func.func @fill_matmul_static_dispatch_3() {
  %cst = arith.constant 0.000000e+00 : f32
  %c256 = arith.constant 256 : index
  %c128 = arith.constant 128 : index
  %c3999 = arith.constant 3999 : index
  %c999 = arith.constant 999 : index
  %c41287680 = arith.constant 41287680 : index
  %c0 = arith.constant 0 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c41287680) flags(ReadOnly) : memref<8x16x128x256xf32, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %0, 64 : memref<8x16x128x256xf32, #hal.descriptor_type<storage_buffer>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %1, 64 : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
  %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
  memref.assume_alignment %2, 64 : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %3 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_id_y]
  %4 = affine.apply affine_map<()[s0] -> (s0 * 128)>()[%workgroup_count_y]
  scf.for %arg0 = %3 to %c999 step %4 {
    %5 = affine.min affine_map<(d0) -> (-d0 + 999, 128)>(%arg0)
    %6 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_id_x]
    %7 = affine.apply affine_map<()[s0] -> (s0 * 256)>()[%workgroup_count_x]
    scf.for %arg1 = %6 to %c3999 step %7 {
      %8 = affine.min affine_map<(d0) -> (-d0 + 3999, 256)>(%arg1)
      %9 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg0)
      %10 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg1)
      %subview = memref.subview %0[%9, %10, 0, 0] [1, 1, 128, 256] [1, 1, 1, 1] : memref<8x16x128x256xf32, #hal.descriptor_type<storage_buffer>> to memref<1x1x128x256xf32, strided<[524288, 32768, 256, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
      %subview_0 = memref.subview %1[%arg0, %arg1] [%5, %8] [1, 1] : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>> to memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
      %alloca = memref.alloca(%5, %8) {alignment = 64 : i64} : memref<?x?xf32, #hal.descriptor_type<storage_buffer>>
      linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%subview_0 : memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>) outs(%alloca : memref<?x?xf32, #hal.descriptor_type<\
storage_buffer>>) {
      ^bb0(%in: f32, %out: f32):
        linalg.yield %in : f32
      }
      scf.for %arg2 = %c0 to %5 step %c128 {
        %11 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 128)>(%arg2)[%5]
        scf.for %arg3 = %c0 to %8 step %c256 {
          %12 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 256)>(%arg3)[%8]
          %13 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg2)
          %14 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg3)
          %subview_2 = memref.subview %alloca[%arg2, %arg3] [%11, %12] [1, 1] : memref<?x?xf32, #hal.descriptor_type<storage_buffer>> to memref<?x?xf32, strided<[?, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
          %15 = vector.transfer_read %subview[%13, %14, %c0, %c0], %cst {in_bounds = [true, true]} : memref<1x1x128x256xf32, strided<[524288, 32768, 256, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>, vector<128x256xf32>
          vector.transfer_write %15, %subview_2[%c0, %c0] : vector<128x256xf32>, memref<?x?xf32, strided<[?, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
        }
      }
      %subview_1 = memref.subview %2[%arg0, %arg1] [%5, %8] [1, 1] : memref<999x3999xf32, #hal.descriptor_type<storage_buffer>> to memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<storage_buffer>>
      linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%alloca : memref<?x?xf32, #hal.descriptor_type<storage_buffer>>) outs(%subview_1 : memref<?x?xf32, strided<[3999, 1], offset: ?>, #hal.descriptor_type<\
storage_buffer>>) {
      ^bb0(%in: f32, %out: f32):
        linalg.yield %in : f32
      }
    }
  }
  return
}

so, something went off the rails in bufferization.

MaheshRavishankar · 2023-03-30T17:00:27Z

Oh, and --iree-llvmcpu-fail-on-out-of-bounds-stack-allocation=false avoids compilation errors. It compiles, but takes a really really long time cause the vectors being used are huge (16x128x256 if I am reading this right) is really big vectors. The compilation errors are there to flag things that might be going off rails, so removing those checks puts you on a bad compilation path.

hanhanW · 2023-03-31T21:15:32Z

The scf.for op should not return tensor types. We usually run hoistRedundantVectorTransfers and hoistRedundantVectorTransfersOnTensor methods to get rid of tensor types. I tried it, but it doesn't work. The util can't handle the case at this moment.

The dest of transfer_write, extract_slice ops and insert_slice ops are all around iter_arg %arg5. Can we enhance the hoisting utils for the case?

        %9 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg0)
        %10 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg1)
        %11 = flow.dispatch.tensor.load %0, offsets = [%9, %10, 0, 0], sizes = [1, 1, 128, 256], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<8x16x128x256xf32>> -> tensor<1x1x128x256xf32>
        %12 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1], sizes = [%5, %8], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<999x3999xf32>> -> tensor<?x?xf32>
        %13 = scf.for %arg2 = %c0 to %5 step %c128 iter_args(%arg3 = %12) -> (tensor<?x?xf32>) {
          %14 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 128)>(%arg2)[%5]
          %15 = scf.for %arg4 = %c0 to %8 step %c256 iter_args(%arg5 = %arg3) -> (tensor<?x?xf32>) {
            %16 = affine.min affine_map<(d0)[s0] -> (-d0 + s0, 256)>(%arg4)[%8]
            %17 = affine.apply affine_map<(d0) -> (d0 floordiv 128)>(%arg2)
            %18 = affine.apply affine_map<(d0) -> (d0 floordiv 256)>(%arg4)
            %extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
            %19 = vector.transfer_read %11[%17, %18, %c0, %c0], %cst {in_bounds = [true, true]} : tensor<1x1x128x256xf32>, vector<128x256xf32>
            %extracted_slice_0 = tensor.extract_slice %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
            %20 = vector.transfer_write %19, %extracted_slice_0[%c0, %c0] : vector<128x256xf32>, tensor<?x?xf32>
            %inserted_slice = tensor.insert_slice %20 into %extracted_slice[0, 0] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
            %inserted_slice_1 = tensor.insert_slice %inserted_slice into %arg5[%arg2, %arg4] [%14, %16] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
            scf.yield %inserted_slice_1 : tensor<?x?xf32>
          }
          scf.yield %15 : tensor<?x?xf32>
        }

aaron-schneider · 2023-04-05T00:22:35Z

ping on p0 bug?

MaheshRavishankar · 2023-04-05T17:50:07Z

Trying to narrow down the scope of this bug and boil this down to some concrete action items. There are broadly two options for convert a matmul to a shape that is better for the codegen. One is

Use pad/slice to just change the shape, but not the rank of the operations.
Use pack/unpack to move the operation into a higher dimensional compute operation.

With both these options the end state that we should be looking for is to have four dispatches.

One for pack/pad for the LHS
One for pack/pad for the RHS
One for the contraction-like operation
One for the unpack/slice of the result.

To achieve this for the pack/unpack path on the small repro that was added here #12824 (comment) we need to add a folding pattern.

Task 1: Add a folding pattern for

%1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<999x3999xf32>) -> tensor<999x3999xf32>
%pack_4 = tensor.pack %1 padding_value(%cst_3 : f32) inner_dims_pos = [0, 1] inner_tiles = [128, 256] into %4 : tensor<999x3999xf32> -> tensor<8x16x128x256xf32>

to

%fill_folded = linalg.fill ins(%cst : f32) outs(%4 : tensor<8x16x128x256xf32>) -> tensor<8x16x128x256xf32>

Even without this we will get four dispatches, but you will have a fill -> pack fused dispatch which we could fold after the fact as well, but without the folding the codegeneration gets unnecessarily complicated.

To achieve four dispatches on the pad/slice path we need to make sure that the tensor.pad operation is not generalized into a linalg.fill -> tensor.insert_slice. This is WIP since not all backends handle tensor.pad properly, but there is a flag --iree-flow-enable-pad-handling that prevents this generalization at the Flow level for now to make this a codegen issue.

The next steps then are to look at having "good" code-generation for pack or pad.

Task 2: Interaction between pack and conversion to pad.

For the pack operation, if only some dimensions are packed, and others are just padded, handling this requires some care. The pack operations are meant to get the data laid out such that each "tile" is the working set of the core compute operation. If only the M and N dimensions are packed, and K is just padded, then before tiling of the K dimension, the pack operation needs to be converted into a pad like representation. Tiling the inner dimensions of the a pack operation goes against the design of pack operation AFAICS.

nicolasvasilache · 2023-04-06T09:36:23Z

Thanks @MaheshRavishankar, let's split this down further as I seem to see some conflation here.

There are broadly two options for convert a matmul to a shape that is better for the codegen.

Sure, but this is not the problem that this issue refers to.
The exercise here is not a perf breakdown but solving a bug; how this bug was surfaced is orthogonal IMO.

The issue is: IREE receives the IR above and crashes in IREE::LinalgExt::TilingInterfaceBaseTilingPattern.
My understanding from offline discussion is that the issue happens in the GPU backend but that the same IR goes through the CPU backend fine.

So it seems the bug reduces to:

how does the GPU backend use TilingInterfaceBaseTilingPattern on pack/unpack ops differently than on the CPU backend ?
why does IREE::LinalgExt::TilingInterfaceBaseTilingPattern crash and can we make it return an actionnable error message instead of a stack trace ?

Now for the specific points looking ahead:

Add a folding pattern for ...

Yes, this is a good followup once the crash is fixed, this will remove one dispatch once we can run e2e and measure.

The next steps then are to look at having "good" code-generation for pack or pad.

If only the M and N dimensions are packed, and K is just padded, then before tiling of the K dimension, the pack operation needs to be converted into a pad like representation.

Yes, having a reasonable codegen strategy for pack and pad seems to be lacking on the GPU side, there are many low-hanging fruit opportunities. One of the avenues we have indeed been looking at is folding trivial packs into pads, which captures one of the use cases we're looking at for GPU.

Tiling the inner dimensions of the a pack operation goes against the design of pack operation AFAICS.

This deserves another full thread.

MaheshRavishankar · 2023-04-06T16:37:33Z

Thanks @MaheshRavishankar, let's split this down further as I seem to see some conflation here.
There are broadly two options for convert a matmul to a shape that is better for the codegen. 
Sure, but this is not the problem that this issue refers to. The exercise here is not a perf breakdown but solving a bug; how this bug was surfaced is orthogonal IMO.

The issue is: IREE receives the IR above and crashes in IREE::LinalgExt::TilingInterfaceBaseTilingPattern. My understanding from offline discussion is that the issue happens in the GPU backend but that the same IR goes through the CPU backend fine.

So it seems the bug reduces to:

how does the GPU backend use TilingInterfaceBaseTilingPattern on pack/unpack ops differently than on the CPU backend ?

why does IREE::LinalgExt::TilingInterfaceBaseTilingPattern crash and can we make it return an actionnable error message instead of a stack trace ?

This part is confusing to me. I dont see any uses of IREE::LinalgExtTilingInterfaceBaseTilingPattern on the CPU backend. That is really legacy and should not be used. AFAIK, CPU side doesnt use that and that might be the difference.

allieculp · 2023-04-20T16:15:08Z

From GPU sync: Nicolas out for 2 weeks, discuss priority offline.

allieculp · 2023-05-02T18:18:57Z

@MaheshRavishankar @mattwalsh We should discuss priority and resourcing for this issue during GPU sync.

allieculp · 2023-05-23T17:58:02Z

From meeting today: holding this open for Nicolas when he returns to work, not a blocking item.

nicolasvasilache added bug 🐞 Something isn't working awaiting-triage labels Mar 29, 2023

nicolasvasilache assigned MaheshRavishankar Mar 29, 2023

MaheshRavishankar mentioned this issue Mar 29, 2023

Rank-reducing linalg via reshape patterns + iree-hal-benchmark-dispatch-repeat-count seem to interact poorly and miscompile #12825

Open

allieculp added codegen/llvm LLVM code generation compiler backend and removed awaiting-triage labels Mar 30, 2023

hanhanW mentioned this issue Apr 4, 2023

Split VectorizePackUnPackOps into decomposition and vectorization passes #12865

Merged

julianwa added this to the Sprint: Compilation WS 1 milestone Apr 5, 2023

MaheshRavishankar added codegen/nvvm NVVM code generation compiler backend and removed codegen/llvm LLVM code generation compiler backend labels Apr 5, 2023

MaheshRavishankar assigned nicolasvasilache and qcolombet Apr 5, 2023

allieculp removed this from the Sprint: Compilation Q2 2023 milestone Apr 18, 2023

allieculp unassigned nicolasvasilache and qcolombet May 2, 2023

allieculp assigned nicolasvasilache and unassigned MaheshRavishankar May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in LLVMGPUTileAndDistribute / Tiling interface #12824

Crash in LLVMGPUTileAndDistribute / Tiling interface #12824

nicolasvasilache commented Mar 29, 2023

nicolasvasilache commented Mar 29, 2023

MaheshRavishankar commented Mar 29, 2023

nicolasvasilache commented Mar 30, 2023 •

edited

Loading

MaheshRavishankar commented Mar 30, 2023

MaheshRavishankar commented Mar 30, 2023 •

edited

Loading

hanhanW commented Mar 31, 2023

aaron-schneider commented Apr 5, 2023

MaheshRavishankar commented Apr 5, 2023

nicolasvasilache commented Apr 6, 2023

MaheshRavishankar commented Apr 6, 2023

allieculp commented Apr 20, 2023

allieculp commented May 2, 2023

allieculp commented May 23, 2023

Crash in LLVMGPUTileAndDistribute / Tiling interface #12824

Crash in LLVMGPUTileAndDistribute / Tiling interface #12824

Comments

nicolasvasilache commented Mar 29, 2023

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

nicolasvasilache commented Mar 29, 2023

MaheshRavishankar commented Mar 29, 2023

nicolasvasilache commented Mar 30, 2023 • edited Loading

MaheshRavishankar commented Mar 30, 2023

MaheshRavishankar commented Mar 30, 2023 • edited Loading

hanhanW commented Mar 31, 2023

aaron-schneider commented Apr 5, 2023

MaheshRavishankar commented Apr 5, 2023

nicolasvasilache commented Apr 6, 2023

MaheshRavishankar commented Apr 6, 2023

allieculp commented Apr 20, 2023

allieculp commented May 2, 2023

allieculp commented May 23, 2023

nicolasvasilache commented Mar 30, 2023 •

edited

Loading

MaheshRavishankar commented Mar 30, 2023 •

edited

Loading