[RISC-V] ~38% slowdown on MobileBert after #12894 integrate #12916

dcaballe · 2023-04-04T16:15:11Z

Around 38% slowdown and 38% code size increase on MobileBert after #12894.

hanhanW · 2023-04-10T21:25:14Z

Looks like it also regresses MobileBert on x86 for single-threaded

julianwa · 2023-04-10T21:29:04Z

@hanhanW has anybody started root-causing this? I just marked it P0

hanhanW · 2023-04-10T21:34:29Z

People are busy on many things recently. No one started on this, AFAIK.

allieculp · 2023-04-10T22:00:40Z

@mattwalsh for visibility as a P0 issue

allieculp · 2023-04-11T17:04:52Z

From triage meeting: Assigning to Diego to review when he returns this Thursday.

aaron-schneider · 2023-04-14T16:48:24Z

@julianwa and @dcaballe I saw a draft of a more complete regression analysis (with a task list of all of the issues filed). Is that ready to promote here?

dcaballe · 2023-04-17T20:10:05Z

https://github.com/orgs/openxla/projects/13/views/30?query=is%3Aopen+sort%3Aupdated-desc&pane=issue&itemId=25883714

pzread · 2023-04-18T07:43:33Z

It looks like the regression is due to the differences on the imported MLIRs. So the root cause might not be in the LLVM bump but due to the tensorflow bump.

Here are the imported MLIRs of TFLite MobileBertSquad_fp32 before and after the integration:

Before the integration (51dbeb8):
https://storage.googleapis.com/iree-github-actions-postsubmit-artifacts/4599127855/1/e2e-test-artifacts/iree_MobileBertSquad_fp32_af95be67ed750381753ca4fe4341853e7e2884d4c4bd1dd9a17d8ceab3ee8882.mlir

After the integration (31bbc95):
https://storage.googleapis.com/iree-github-actions-postsubmit-artifacts/4600147158/1/e2e-test-artifacts/iree_MobileBertSquad_fp32_af95be67ed750381753ca4fe4341853e7e2884d4c4bd1dd9a17d8ceab3ee8882.mlir

I used the iree-compile built on the integration commit to compile both MLIRs and got 38% of difference on the dispatch size:

With MLIR from 51dbeb8 (the parent commit): 26088 bytes
With MLIR from 31bbc95 (the integration commit): 36216 bytes

Build command:

iree-compile input.mlir -o output.vmfb --iree-hal-target-backends=llvm-cpu --iree-input-type=tosa --iree-llvmcpu-target-triple=riscv64-pc-linux-gnu --iree-llvmcpu-target-cpu=generic-rv64 --iree-llvmcpu-target-abi=lp64d --iree-llvmcpu-target-cpu-features=+m,+a,+f,+d,+zvl512b,+v --riscv-v-fixed-length-vector-lmul-max=8 --iree-vm-emit-polyglot-zip=true --iree-llvmcpu-debug-symbols=false -mlir-disable-threading

pzread · 2023-04-18T08:38:33Z

Diff the imported MLIRs and tensorflow/tensorflow@e0e966b is suspicious. It changes the way to convert softmax op into tosa.

I built iree-import-tflite with that commit reverted in tenserflow, imported the model into MLIR, and compiled it. The total dispatch size returns to 26056 bytes, which supports the guess.

Before that commit, the importer converts the softmax op into:

op1 = tosa.exp(logits)
op2 = tosa.reduce_sum(op1)
op3 = tosa.reciprocal(op2)
op4 = tosa.mul(op1, op3)

After that commit, it generates an extra tosa.reduce_max and tosa.sub:

op1 = tosa.reduce_max(logits)
op2 = tosa.sub(logits, op1)
op3 = tosa.exp(op2)
op4 = tosa.reduce_sum(op3)
op5 = tosa.reciprocal(op4)
op6 = tosa.mul(op3, op5)

I haven't fully understood why they made that change in tensorflow side. The reason seems due to correctness.

I think that explains the size increase and performance regression. It seems like the problem is not on the integration. Instead we want to see how to generate better code for that pattern, or generate better tosa for the softmax op.

pzread · 2023-04-18T22:17:10Z

I think we can create another issue to improve the performance on tensorflow softmax op (and re-prioritize) then close this issue as intended behavior. @dcaballe would you mind suggesting the next step to do?

dcaballe · 2023-04-19T01:05:31Z

Thanks for looking into this! Is there an easy way we can measure the performance impact of reverting that commit? (I'm not sure if it's too late to just revert that commit in ToT and create a draft PR to run CI benchmarks...) That would help us be 100% sure that that commit is the only responsible for the regression. If it's too complicated, forget about it.

The analysis makes sense. If we add extra computation (reduction + sub ops) the execution time will increase. It's not clear to me if we are doing anything wrong or it's just the extra computation. Could you do a -mlir-print-ir-after-all and post the IR after vectorization? Perhaps we distribute these ops across multiple dispatches. That should help with next steps.

For the rational behind the tosa change, maybe @rsuderman or @jpienaar can help?

jpienaar · 2023-04-19T01:10:03Z

Change was a correctness one addressing numerical stability, so doubt we can revert it.

pzread · 2023-04-21T07:32:00Z

After the integration, the pass iree-flow-raise-special-ops actually identifies the new correct pattern of tosa ops and transforms them into iree_linalg_ext.softmax. Since all iree_linalg_ext.softmax have the same shape, it turns into a single dispatch:

func.func @main_dispatch_18() {
  %c5308416 = arith.constant 5308416 : index
  %c1769472 = arith.constant 1769472 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  %2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x384x384xf32>
  %3 = tensor.empty() : tensor<4x384x384xf32>
  %4 = iree_linalg_ext.softmax dimension(2) ins(%2 : tensor<4x384x384xf32>) outs(%3 : tensor<4x384x384xf32>) -> tensor<4x384x384xf32>
  flow.dispatch.tensor.store %4, %1, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : tensor<4x384x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  return
}

And being decomposed into:

// -----// IR Dump After DecomposeSoftmax (iree-linalg-ext-decompose-softmax) //----- //
func.func @main_dispatch_18() {
  %c5308416 = arith.constant 5308416 : index
  %c1769472 = arith.constant 1769472 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  %2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x384x384xf32>
  %3 = tensor.empty() : tensor<4x384x384xf32>
  %4 = tensor.empty() : tensor<4x384x384xf32>
  %5 = tensor.empty() : tensor<4x384xf32>
  %cst = arith.constant -1.000000e+30 : f32
  %6 = linalg.fill ins(%cst : f32) outs(%5 : tensor<4x384xf32>) -> tensor<4x384xf32>
  %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%2 : tensor<4x384x384xf32>) outs(%6 : tensor<4x384xf32>) {
  ^bb0(%in: f32, %out: f32):
    %12 = arith.maxf %in, %out : f32
    linalg.yield %12 : f32
  } -> tensor<4x384xf32>
  %8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%2, %7 : tensor<4x384x384xf32>, tensor<4x384xf32>) outs(%4 : tensor<4x384x384xf32>) {
  ^bb0(%in: f32, %in_1: f32, %out: f32):
    %12 = arith.subf %in, %in_1 : f32
    %13 = math.exp %12 : f32
    linalg.yield %13 : f32
  } -> tensor<4x384x384xf32>
  %cst_0 = arith.constant 0.000000e+00 : f32
  %9 = linalg.fill ins(%cst_0 : f32) outs(%5 : tensor<4x384xf32>) -> tensor<4x384xf32>
  %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%8 : tensor<4x384x384xf32>) outs(%9 : tensor<4x384xf32>) {
  ^bb0(%in: f32, %out: f32):
    %12 = arith.addf %in, %out : f32
    linalg.yield %12 : f32
  } -> tensor<4x384xf32>
  %11 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%8, %10 : tensor<4x384x384xf32>, tensor<4x384xf32>) outs(%4 : tensor<4x384x384xf32>) {
  ^bb0(%in: f32, %in_1: f32, %out: f32):
    %12 = arith.divf %in, %in_1 : f32
    linalg.yield %12 : f32
  } -> tensor<4x384x384xf32>
  flow.dispatch.tensor.store %11, %1, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : tensor<4x384x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  return
}

Then being vectorized into:

----- LinalgStrategyVectorizePass 3437914 -----
func.func @main_dispatch_18() {
  %cst = arith.constant dense<-1.000000e+30> : vector<1x8xf32>
  %cst_0 = arith.constant dense<0.000000e+00> : vector<1x8xf32>
  %c32 = arith.constant 32 : index
  %c1 = arith.constant 1 : index
  %c8 = arith.constant 8 : index
  %c0 = arith.constant 0 : index
  %c384 = arith.constant 384 : index
  %c4 = arith.constant 4 : index
  %cst_1 = arith.constant 0.000000e+00 : f32
  %c5308416 = arith.constant 5308416 : index
  %c1769472 = arith.constant 1769472 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %2 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
  %3 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
  scf.for %arg0 = %2 to %c4 step %3 {
    %4 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_id_x]
    %5 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_count_x]
    scf.for %arg1 = %4 to %c384 step %5 {
      %6 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>> -> tensor<4x32x384xf32>
      %7 = flow.dispatch.tensor.load %0, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x32x384xf32>
      %8 = scf.for %arg2 = %c0 to %c4 step %c1 iter_args(%arg3 = %6) -> (tensor<4x32x384xf32>) {
        %9 = scf.for %arg4 = %c0 to %c32 step %c8 iter_args(%arg5 = %arg3) -> (tensor<4x32x384xf32>) {
          %10 = tensor.empty() : tensor<1x8xf32>
          %11 = vector.transfer_write %cst, %10[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
          %12 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %11) -> (tensor<1x8xf32>) {
            %16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
            %17 = vector.transfer_read %arg7[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %18 = vector.multi_reduction <maxf>, %16, %17 [2] : vector<1x8x32xf32> to vector<1x8xf32>
            %19 = vector.transfer_write %18, %arg7[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
            scf.yield %19 : tensor<1x8xf32>
          }
          %13 = vector.transfer_write %cst_0, %10[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
          %14 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %13) -> (tensor<1x8xf32>) {
            %16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
            %17 = vector.transfer_read %12[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %18 = vector.broadcast %17 : vector<1x8xf32> to vector<32x1x8xf32>
            %19 = vector.transpose %18, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
            %20 = vector.transfer_read %arg7[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %21 = arith.subf %16, %19 : vector<1x8x32xf32>
            %22 = math.exp %21 : vector<1x8x32xf32>
            %23 = vector.multi_reduction <add>, %22, %20 [2] : vector<1x8x32xf32> to vector<1x8xf32>
            %24 = vector.transfer_write %23, %arg7[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
            scf.yield %24 : tensor<1x8xf32>
          }
          %extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4, 0] [1, 8, 384] [1, 1, 1] : tensor<4x32x384xf32> to tensor<1x8x384xf32>
          %15 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %extracted_slice) -> (tensor<1x8x384xf32>) {
            %16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
            %17 = vector.transfer_read %12[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %18 = vector.broadcast %17 : vector<1x8xf32> to vector<32x1x8xf32>
            %19 = vector.transpose %18, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
            %20 = vector.transfer_read %14[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %21 = vector.broadcast %20 : vector<1x8xf32> to vector<32x1x8xf32>
            %22 = vector.transpose %21, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
            %23 = arith.subf %16, %19 : vector<1x8x32xf32>
            %24 = math.exp %23 : vector<1x8x32xf32>
            %25 = arith.divf %24, %22 : vector<1x8x32xf32>
            %26 = vector.transfer_write %25, %arg7[%c0, %c0, %arg6] {in_bounds = [true, true, true]} : vector<1x8x32xf32>, tensor<1x8x384xf32>
            scf.yield %26 : tensor<1x8x384xf32>
          }
          %inserted_slice = tensor.insert_slice %15 into %arg5[%arg2, %arg4, 0] [1, 8, 384] [1, 1, 1] : tensor<1x8x384xf32> into tensor<4x32x384xf32>
          scf.yield %inserted_slice : tensor<4x32x384xf32>
        }
        scf.yield %9 : tensor<4x32x384xf32>
      }
      flow.dispatch.tensor.store %8, %1, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : tensor<4x32x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
    }
  }
  return
}

I also noticed that IREE didn't recognize the previous incorrect softmax pattern from tensorflow, so compiler didn't generate iree_linalg_ext.softmax and just did normal dispatch formation to fuse those ops (instead of creating 1 big dispatch for the softmax, previously it generated 3 dispatches but also fused with its producer).

Here are the related dispatches generated from the previous softmax pattern (the first arith.addf is the producer of the softmax op).

%69 = flow.dispatch.region[%65, %66, %67, %68] -> (tensor<4x384x1x384xf32>) {
  %2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3, d1)>, affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%expanded_1075, %9 : tensor<4x1x384x384xf32>, tensor<384x1x384xf32>) outs(%64 : tensor<4x384x1x384xf32>) {
  ^bb0(%in: f32, %in_7714: f32, %out: f32):
    %2567 = arith.addf %in, %in_7714 : f32
    %2568 = math.exp %2567 : f32
    linalg.yield %2568 : f32
  } -> tensor<4x384x1x384xf32>
  flow.return %2566 : tensor<4x384x1x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index) -> (index, index, index) {
  %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6
  flow.return %x, %y, %z : index, index, index
}
%collapsed_1089 = tensor.collapse_shape %69 [[0], [1], [2, 3]] : tensor<4x384x1x384xf32> into tensor<4x384x384xf32>
%70 = tensor.empty() : tensor<4x384xf32>
%71 = linalg.fill ins(%cst_27 : f32) outs(%70 : tensor<4x384xf32>) -> tensor<4x384xf32>
%c1_1090 = arith.constant 1 : index
%c0_1091 = arith.constant 0 : index
%c4_1092 = arith.constant 4 : index
%c1_1093 = arith.constant 1 : index
%72 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1091, %c4_1092, %c1_1093]
%c0_1094 = arith.constant 0 : index
%c384_1095 = arith.constant 384 : index
%c1_1096 = arith.constant 1 : index
%73 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1094, %c384_1095, %c1_1096]
%c0_1097 = arith.constant 0 : index
%c1_1098 = arith.constant 1 : index
%74 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1097, %c1_1090, %c1_1098]
%75 = flow.dispatch.region[%72, %73, %74] -> (tensor<4x384xf32>) {
  %2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%collapsed_1089 : tensor<4x384x384xf32>) outs(%71 : tensor<4x384xf32>) {
  ^bb0(%in: f32, %out: f32):
    %2567 = arith.addf %in, %out : f32
    linalg.yield %2567 : f32
  } -> tensor<4x384xf32>
  flow.return %2566 : tensor<4x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index) -> (index, index, index) {
  %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5
  flow.return %x, %y, %z : index, index, index
}
%c1_1099 = arith.constant 1 : index
%c0_1100 = arith.constant 0 : index
%c4_1101 = arith.constant 4 : index
%c1_1102 = arith.constant 1 : index
%76 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1100, %c4_1101, %c1_1102]
%c0_1103 = arith.constant 0 : index
%c384_1104 = arith.constant 384 : index
%c1_1105 = arith.constant 1 : index
%77 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1103, %c384_1104, %c1_1105]
%c0_1106 = arith.constant 0 : index
%c1_1107 = arith.constant 1 : index
%c1_1108 = arith.constant 1 : index
%78 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1106, %c1_1107, %c1_1108]
%c0_1109 = arith.constant 0 : index
%c384_1110 = arith.constant 384 : index
%c1_1111 = arith.constant 1 : index
%79 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1109, %c384_1110, %c1_1111]
%80 = flow.dispatch.region[%76, %77, %78, %79] -> (tensor<4x384x1x384xf32>) {
  %2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%69, %75 : tensor<4x384x1x384xf32>, tensor<4x384xf32>) outs(%64 : tensor<4x384x1x384xf32>) {
  ^bb0(%in: f32, %in_7714: f32, %out: f32):
    %2567 = arith.divf %cst_25, %in_7714 : f32
    %2568 = arith.mulf %in, %2567 : f32
    linalg.yield %2568 : f32
  } -> tensor<4x384x1x384xf32>
  flow.return %2566 : tensor<4x384x1x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index) -> (index, index, index) {
  %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6
  flow.return %x, %y, %z : index, index, index
}

Is there an easy way we can measure the performance impact of reverting that commit?

Since the x86_64 regression isn't significant, I'll test with the risc-v benchmarks and post the results here.

jpienaar · 2023-04-21T12:16:54Z

Interesting, so you are saying it is due to the special case being recognized now that there is a regression?

pzread · 2023-04-21T18:24:04Z

Is there an easy way we can measure the performance impact of reverting that commit?

Tested with riscv. With that tensorflow commit reverted, the latency is same as the run before the integration. So it confirmed that tensorflow commit causes the regression.

MaheshRavishankar · 2023-04-21T18:27:57Z

@dcaballe or @hcindyl for input. Cindy, I am not sure if this is something you have seen/tracked/are concerned about. The regression on RISC-V seems to be because now softmax is fused into a single dispatch. This is actually the current expected state. Some input on how to handle RISC-V is useful. If such special case matching is not recommended on RISC-V we can think of disabling these for RISC-V. Ideally though, I would put this into the bucket of RISC-V issues.

pzread · 2023-04-21T19:06:22Z

Interesting, so you are saying it is due to the special case being recognized now that there is a regression?

Not totally. Now I tested with iree-flow-raise-special-ops disabled (so no iree_lingalg_ext.softmax). It only improves the binary size a little (36216 -> 31816 bytes, ~13%), because we still need to generate code for the extra two tosa.reduce_max and tosa.sub ops. The performance also only improves a little (~1.1X faster on RISC-V).

Not sure how much room left to improve but will need some more investigations if we want to push this further.

dcaballe · 2023-04-21T19:17:07Z

Great findings! Since the x86 regression seems to be justified by the extra computation, I would suggest that we stop here for now and move this to P2. If you are already running on the FPGA, perhaps it's worth that you quickly run perf and share the archived profiles internally. You can also post the reference commit you are using and we can follow up later. Thanks a lot for the analysis!

hcindyl · 2023-04-21T19:20:13Z

I would defer this to @dcaballe ; however, I am curious to see the perf of other models that contain softmax, such as https://github.com/openxla/iree/blob/ea0b28fe850a06123383cb1cf7671995362c06b1/build_tools/python/e2e_test_framework/models/tflite_models.py#L111-142. Is the hit only in the transformer encoder arch + RISCV (from [the](#12916 (comment) it seems x86 also takes a hit) ?

allieculp · 2023-04-25T19:23:59Z

@pzread @dcaballe Any update on this P0?

dcaballe · 2023-04-25T20:08:32Z

Moving this to P2 as the x86 side is resolved. Only a better understanding of the RISC-V side is pending.

ScottTodd · 2024-08-13T22:51:47Z

Nothing actionable here so much later.

dcaballe added codegen/llvm LLVM code generation compiler backend awaiting-triage codegen/riscv Code generation targeting RISC-V architecture labels Apr 4, 2023

julianwa removed the awaiting-triage label Apr 5, 2023

allieculp assigned dcaballe Apr 11, 2023

dcaballe added the performance ⚡ Performance/optimization related work across the compiler and runtime label Apr 13, 2023

dcaballe changed the title ~~[RISC-V] ~38% slowdown on MobileBert after #12894 integrate~~ [X86][RISC-V] ~38% slowdown on MobileBert after #12894 integrate Apr 13, 2023

pzread self-assigned this Apr 14, 2023

dcaballe changed the title ~~[X86][RISC-V] ~38% slowdown on MobileBert after #12894 integrate~~ [RISC-V] ~38% slowdown on MobileBert after #12894 integrate Apr 25, 2023

dcaballe unassigned pzread and dcaballe May 10, 2023

ScottTodd closed this as not planned Won't fix, can't repro, duplicate, stale Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RISC-V] ~38% slowdown on MobileBert after #12894 integrate #12916

[RISC-V] ~38% slowdown on MobileBert after #12894 integrate #12916

dcaballe commented Apr 4, 2023 •

edited

Loading

hanhanW commented Apr 10, 2023

julianwa commented Apr 10, 2023 •

edited

Loading

hanhanW commented Apr 10, 2023

allieculp commented Apr 10, 2023

allieculp commented Apr 11, 2023

aaron-schneider commented Apr 14, 2023

dcaballe commented Apr 17, 2023

pzread commented Apr 18, 2023 •

edited

Loading

pzread commented Apr 18, 2023 •

edited

Loading

pzread commented Apr 18, 2023 •

edited

Loading

dcaballe commented Apr 19, 2023

jpienaar commented Apr 19, 2023

pzread commented Apr 21, 2023 •

edited

Loading

jpienaar commented Apr 21, 2023

pzread commented Apr 21, 2023

MaheshRavishankar commented Apr 21, 2023

pzread commented Apr 21, 2023 •

edited

Loading

dcaballe commented Apr 21, 2023

hcindyl commented Apr 21, 2023

allieculp commented Apr 25, 2023

dcaballe commented Apr 25, 2023

ScottTodd commented Aug 13, 2024

[RISC-V] ~38% slowdown on MobileBert after #12894 integrate #12916

[RISC-V] ~38% slowdown on MobileBert after #12894 integrate #12916

Comments

dcaballe commented Apr 4, 2023 • edited Loading

hanhanW commented Apr 10, 2023

julianwa commented Apr 10, 2023 • edited Loading

hanhanW commented Apr 10, 2023

allieculp commented Apr 10, 2023

allieculp commented Apr 11, 2023

aaron-schneider commented Apr 14, 2023

dcaballe commented Apr 17, 2023

pzread commented Apr 18, 2023 • edited Loading

pzread commented Apr 18, 2023 • edited Loading

pzread commented Apr 18, 2023 • edited Loading

dcaballe commented Apr 19, 2023

jpienaar commented Apr 19, 2023

pzread commented Apr 21, 2023 • edited Loading

jpienaar commented Apr 21, 2023

pzread commented Apr 21, 2023

MaheshRavishankar commented Apr 21, 2023

pzread commented Apr 21, 2023 • edited Loading

dcaballe commented Apr 21, 2023

hcindyl commented Apr 21, 2023

allieculp commented Apr 25, 2023

dcaballe commented Apr 25, 2023

ScottTodd commented Aug 13, 2024

dcaballe commented Apr 4, 2023 •

edited

Loading

julianwa commented Apr 10, 2023 •

edited

Loading

pzread commented Apr 18, 2023 •

edited

Loading

pzread commented Apr 18, 2023 •

edited

Loading

pzread commented Apr 18, 2023 •

edited

Loading

pzread commented Apr 21, 2023 •

edited

Loading

pzread commented Apr 21, 2023 •

edited

Loading