Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISC-V] ~38% slowdown on MobileBert after #12894 integrate #12916

Closed
dcaballe opened this issue Apr 4, 2023 · 22 comments
Closed

[RISC-V] ~38% slowdown on MobileBert after #12894 integrate #12916

dcaballe opened this issue Apr 4, 2023 · 22 comments
Labels
codegen/llvm LLVM code generation compiler backend codegen/riscv Code generation targeting RISC-V architecture performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@dcaballe
Copy link
Contributor

dcaballe commented Apr 4, 2023

Around 38% slowdown and 38% code size increase on MobileBert after #12894.

@dcaballe dcaballe added codegen/llvm LLVM code generation compiler backend awaiting-triage codegen/riscv Code generation targeting RISC-V architecture labels Apr 4, 2023
@hanhanW
Copy link
Contributor

hanhanW commented Apr 10, 2023

Looks like it also regresses MobileBert on x86 for single-threaded

image

@julianwa
Copy link
Member

julianwa commented Apr 10, 2023

@hanhanW has anybody started root-causing this? I just marked it P0

@hanhanW
Copy link
Contributor

hanhanW commented Apr 10, 2023

People are busy on many things recently. No one started on this, AFAIK.

@allieculp
Copy link

@mattwalsh for visibility as a P0 issue

@allieculp
Copy link

From triage meeting: Assigning to Diego to review when he returns this Thursday.

@dcaballe dcaballe added the performance ⚡ Performance/optimization related work across the compiler and runtime label Apr 13, 2023
@dcaballe dcaballe changed the title [RISC-V] ~38% slowdown on MobileBert after #12894 integrate [X86][RISC-V] ~38% slowdown on MobileBert after #12894 integrate Apr 13, 2023
@aaron-schneider
Copy link

@julianwa and @dcaballe I saw a draft of a more complete regression analysis (with a task list of all of the issues filed). Is that ready to promote here?

@pzread pzread self-assigned this Apr 14, 2023
@pzread
Copy link
Contributor

pzread commented Apr 18, 2023

It looks like the regression is due to the differences on the imported MLIRs. So the root cause might not be in the LLVM bump but due to the tensorflow bump.

Here are the imported MLIRs of TFLite MobileBertSquad_fp32 before and after the integration:

Before the integration (51dbeb8):
https://storage.googleapis.com/iree-github-actions-postsubmit-artifacts/4599127855/1/e2e-test-artifacts/iree_MobileBertSquad_fp32_af95be67ed750381753ca4fe4341853e7e2884d4c4bd1dd9a17d8ceab3ee8882.mlir

After the integration (31bbc95):
https://storage.googleapis.com/iree-github-actions-postsubmit-artifacts/4600147158/1/e2e-test-artifacts/iree_MobileBertSquad_fp32_af95be67ed750381753ca4fe4341853e7e2884d4c4bd1dd9a17d8ceab3ee8882.mlir

I used the iree-compile built on the integration commit to compile both MLIRs and got 38% of difference on the dispatch size:

  • With MLIR from 51dbeb8 (the parent commit): 26088 bytes
  • With MLIR from 31bbc95 (the integration commit): 36216 bytes

Build command:

iree-compile input.mlir -o output.vmfb --iree-hal-target-backends=llvm-cpu --iree-input-type=tosa --iree-llvmcpu-target-triple=riscv64-pc-linux-gnu --iree-llvmcpu-target-cpu=generic-rv64 --iree-llvmcpu-target-abi=lp64d --iree-llvmcpu-target-cpu-features=+m,+a,+f,+d,+zvl512b,+v --riscv-v-fixed-length-vector-lmul-max=8 --iree-vm-emit-polyglot-zip=true --iree-llvmcpu-debug-symbols=false -mlir-disable-threading

@pzread
Copy link
Contributor

pzread commented Apr 18, 2023

Diff the imported MLIRs and tensorflow/tensorflow@e0e966b is suspicious. It changes the way to convert softmax op into tosa.

I built iree-import-tflite with that commit reverted in tenserflow, imported the model into MLIR, and compiled it. The total dispatch size returns to 26056 bytes, which supports the guess.

Before that commit, the importer converts the softmax op into:

op1 = tosa.exp(logits)
op2 = tosa.reduce_sum(op1)
op3 = tosa.reciprocal(op2)
op4 = tosa.mul(op1, op3)

After that commit, it generates an extra tosa.reduce_max and tosa.sub:

op1 = tosa.reduce_max(logits)
op2 = tosa.sub(logits, op1)
op3 = tosa.exp(op2)
op4 = tosa.reduce_sum(op3)
op5 = tosa.reciprocal(op4)
op6 = tosa.mul(op3, op5)

I haven't fully understood why they made that change in tensorflow side. The reason seems due to correctness.

I think that explains the size increase and performance regression. It seems like the problem is not on the integration. Instead we want to see how to generate better code for that pattern, or generate better tosa for the softmax op.

@pzread
Copy link
Contributor

pzread commented Apr 18, 2023

I think we can create another issue to improve the performance on tensorflow softmax op (and re-prioritize) then close this issue as intended behavior. @dcaballe would you mind suggesting the next step to do?

@dcaballe
Copy link
Contributor Author

Thanks for looking into this! Is there an easy way we can measure the performance impact of reverting that commit? (I'm not sure if it's too late to just revert that commit in ToT and create a draft PR to run CI benchmarks...) That would help us be 100% sure that that commit is the only responsible for the regression. If it's too complicated, forget about it.

The analysis makes sense. If we add extra computation (reduction + sub ops) the execution time will increase. It's not clear to me if we are doing anything wrong or it's just the extra computation. Could you do a -mlir-print-ir-after-all and post the IR after vectorization? Perhaps we distribute these ops across multiple dispatches. That should help with next steps.

For the rational behind the tosa change, maybe @rsuderman or @jpienaar can help?

@jpienaar
Copy link
Member

Change was a correctness one addressing numerical stability, so doubt we can revert it.

@pzread
Copy link
Contributor

pzread commented Apr 21, 2023

After the integration, the pass iree-flow-raise-special-ops actually identifies the new correct pattern of tosa ops and transforms them into iree_linalg_ext.softmax. Since all iree_linalg_ext.softmax have the same shape, it turns into a single dispatch:

func.func @main_dispatch_18() {
  %c5308416 = arith.constant 5308416 : index
  %c1769472 = arith.constant 1769472 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  %2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x384x384xf32>
  %3 = tensor.empty() : tensor<4x384x384xf32>
  %4 = iree_linalg_ext.softmax dimension(2) ins(%2 : tensor<4x384x384xf32>) outs(%3 : tensor<4x384x384xf32>) -> tensor<4x384x384xf32>
  flow.dispatch.tensor.store %4, %1, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : tensor<4x384x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  return
}

And being decomposed into:

// -----// IR Dump After DecomposeSoftmax (iree-linalg-ext-decompose-softmax) //----- //
func.func @main_dispatch_18() {
  %c5308416 = arith.constant 5308416 : index
  %c1769472 = arith.constant 1769472 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  %2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x384x384xf32>
  %3 = tensor.empty() : tensor<4x384x384xf32>
  %4 = tensor.empty() : tensor<4x384x384xf32>
  %5 = tensor.empty() : tensor<4x384xf32>
  %cst = arith.constant -1.000000e+30 : f32
  %6 = linalg.fill ins(%cst : f32) outs(%5 : tensor<4x384xf32>) -> tensor<4x384xf32>
  %7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%2 : tensor<4x384x384xf32>) outs(%6 : tensor<4x384xf32>) {
  ^bb0(%in: f32, %out: f32):
    %12 = arith.maxf %in, %out : f32
    linalg.yield %12 : f32
  } -> tensor<4x384xf32>
  %8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%2, %7 : tensor<4x384x384xf32>, tensor<4x384xf32>) outs(%4 : tensor<4x384x384xf32>) {
  ^bb0(%in: f32, %in_1: f32, %out: f32):
    %12 = arith.subf %in, %in_1 : f32
    %13 = math.exp %12 : f32
    linalg.yield %13 : f32
  } -> tensor<4x384x384xf32>
  %cst_0 = arith.constant 0.000000e+00 : f32
  %9 = linalg.fill ins(%cst_0 : f32) outs(%5 : tensor<4x384xf32>) -> tensor<4x384xf32>
  %10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%8 : tensor<4x384x384xf32>) outs(%9 : tensor<4x384xf32>) {
  ^bb0(%in: f32, %out: f32):
    %12 = arith.addf %in, %out : f32
    linalg.yield %12 : f32
  } -> tensor<4x384xf32>
  %11 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%8, %10 : tensor<4x384x384xf32>, tensor<4x384xf32>) outs(%4 : tensor<4x384x384xf32>) {
  ^bb0(%in: f32, %in_1: f32, %out: f32):
    %12 = arith.divf %in, %in_1 : f32
    linalg.yield %12 : f32
  } -> tensor<4x384x384xf32>
  flow.dispatch.tensor.store %11, %1, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : tensor<4x384x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  return
}

Then being vectorized into:

----- LinalgStrategyVectorizePass 3437914 -----
func.func @main_dispatch_18() {
  %cst = arith.constant dense<-1.000000e+30> : vector<1x8xf32>
  %cst_0 = arith.constant dense<0.000000e+00> : vector<1x8xf32>
  %c32 = arith.constant 32 : index
  %c1 = arith.constant 1 : index
  %c8 = arith.constant 8 : index
  %c0 = arith.constant 0 : index
  %c384 = arith.constant 384 : index
  %c4 = arith.constant 4 : index
  %cst_1 = arith.constant 0.000000e+00 : f32
  %c5308416 = arith.constant 5308416 : index
  %c1769472 = arith.constant 1769472 : index
  %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
  %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
  %workgroup_id_x = hal.interface.workgroup.id[0] : index
  %workgroup_count_x = hal.interface.workgroup.count[0] : index
  %workgroup_id_y = hal.interface.workgroup.id[1] : index
  %workgroup_count_y = hal.interface.workgroup.count[1] : index
  %2 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
  %3 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
  scf.for %arg0 = %2 to %c4 step %3 {
    %4 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_id_x]
    %5 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_count_x]
    scf.for %arg1 = %4 to %c384 step %5 {
      %6 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>> -> tensor<4x32x384xf32>
      %7 = flow.dispatch.tensor.load %0, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x32x384xf32>
      %8 = scf.for %arg2 = %c0 to %c4 step %c1 iter_args(%arg3 = %6) -> (tensor<4x32x384xf32>) {
        %9 = scf.for %arg4 = %c0 to %c32 step %c8 iter_args(%arg5 = %arg3) -> (tensor<4x32x384xf32>) {
          %10 = tensor.empty() : tensor<1x8xf32>
          %11 = vector.transfer_write %cst, %10[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
          %12 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %11) -> (tensor<1x8xf32>) {
            %16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
            %17 = vector.transfer_read %arg7[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %18 = vector.multi_reduction <maxf>, %16, %17 [2] : vector<1x8x32xf32> to vector<1x8xf32>
            %19 = vector.transfer_write %18, %arg7[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
            scf.yield %19 : tensor<1x8xf32>
          }
          %13 = vector.transfer_write %cst_0, %10[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
          %14 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %13) -> (tensor<1x8xf32>) {
            %16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
            %17 = vector.transfer_read %12[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %18 = vector.broadcast %17 : vector<1x8xf32> to vector<32x1x8xf32>
            %19 = vector.transpose %18, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
            %20 = vector.transfer_read %arg7[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %21 = arith.subf %16, %19 : vector<1x8x32xf32>
            %22 = math.exp %21 : vector<1x8x32xf32>
            %23 = vector.multi_reduction <add>, %22, %20 [2] : vector<1x8x32xf32> to vector<1x8xf32>
            %24 = vector.transfer_write %23, %arg7[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
            scf.yield %24 : tensor<1x8xf32>
          }
          %extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4, 0] [1, 8, 384] [1, 1, 1] : tensor<4x32x384xf32> to tensor<1x8x384xf32>
          %15 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %extracted_slice) -> (tensor<1x8x384xf32>) {
            %16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
            %17 = vector.transfer_read %12[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %18 = vector.broadcast %17 : vector<1x8xf32> to vector<32x1x8xf32>
            %19 = vector.transpose %18, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
            %20 = vector.transfer_read %14[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
            %21 = vector.broadcast %20 : vector<1x8xf32> to vector<32x1x8xf32>
            %22 = vector.transpose %21, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
            %23 = arith.subf %16, %19 : vector<1x8x32xf32>
            %24 = math.exp %23 : vector<1x8x32xf32>
            %25 = arith.divf %24, %22 : vector<1x8x32xf32>
            %26 = vector.transfer_write %25, %arg7[%c0, %c0, %arg6] {in_bounds = [true, true, true]} : vector<1x8x32xf32>, tensor<1x8x384xf32>
            scf.yield %26 : tensor<1x8x384xf32>
          }
          %inserted_slice = tensor.insert_slice %15 into %arg5[%arg2, %arg4, 0] [1, 8, 384] [1, 1, 1] : tensor<1x8x384xf32> into tensor<4x32x384xf32>
          scf.yield %inserted_slice : tensor<4x32x384xf32>
        }
        scf.yield %9 : tensor<4x32x384xf32>
      }
      flow.dispatch.tensor.store %8, %1, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : tensor<4x32x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
    }
  }
  return
}

I also noticed that IREE didn't recognize the previous incorrect softmax pattern from tensorflow, so compiler didn't generate iree_linalg_ext.softmax and just did normal dispatch formation to fuse those ops (instead of creating 1 big dispatch for the softmax, previously it generated 3 dispatches but also fused with its producer).

Here are the related dispatches generated from the previous softmax pattern (the first arith.addf is the producer of the softmax op).

%69 = flow.dispatch.region[%65, %66, %67, %68] -> (tensor<4x384x1x384xf32>) {
  %2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3, d1)>, affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%expanded_1075, %9 : tensor<4x1x384x384xf32>, tensor<384x1x384xf32>) outs(%64 : tensor<4x384x1x384xf32>) {
  ^bb0(%in: f32, %in_7714: f32, %out: f32):
    %2567 = arith.addf %in, %in_7714 : f32
    %2568 = math.exp %2567 : f32
    linalg.yield %2568 : f32
  } -> tensor<4x384x1x384xf32>
  flow.return %2566 : tensor<4x384x1x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index) -> (index, index, index) {
  %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6
  flow.return %x, %y, %z : index, index, index
}
%collapsed_1089 = tensor.collapse_shape %69 [[0], [1], [2, 3]] : tensor<4x384x1x384xf32> into tensor<4x384x384xf32>
%70 = tensor.empty() : tensor<4x384xf32>
%71 = linalg.fill ins(%cst_27 : f32) outs(%70 : tensor<4x384xf32>) -> tensor<4x384xf32>
%c1_1090 = arith.constant 1 : index
%c0_1091 = arith.constant 0 : index
%c4_1092 = arith.constant 4 : index
%c1_1093 = arith.constant 1 : index
%72 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1091, %c4_1092, %c1_1093]
%c0_1094 = arith.constant 0 : index
%c384_1095 = arith.constant 384 : index
%c1_1096 = arith.constant 1 : index
%73 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1094, %c384_1095, %c1_1096]
%c0_1097 = arith.constant 0 : index
%c1_1098 = arith.constant 1 : index
%74 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1097, %c1_1090, %c1_1098]
%75 = flow.dispatch.region[%72, %73, %74] -> (tensor<4x384xf32>) {
  %2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%collapsed_1089 : tensor<4x384x384xf32>) outs(%71 : tensor<4x384xf32>) {
  ^bb0(%in: f32, %out: f32):
    %2567 = arith.addf %in, %out : f32
    linalg.yield %2567 : f32
  } -> tensor<4x384xf32>
  flow.return %2566 : tensor<4x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index) -> (index, index, index) {
  %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5
  flow.return %x, %y, %z : index, index, index
}
%c1_1099 = arith.constant 1 : index
%c0_1100 = arith.constant 0 : index
%c4_1101 = arith.constant 4 : index
%c1_1102 = arith.constant 1 : index
%76 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1100, %c4_1101, %c1_1102]
%c0_1103 = arith.constant 0 : index
%c384_1104 = arith.constant 384 : index
%c1_1105 = arith.constant 1 : index
%77 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1103, %c384_1104, %c1_1105]
%c0_1106 = arith.constant 0 : index
%c1_1107 = arith.constant 1 : index
%c1_1108 = arith.constant 1 : index
%78 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1106, %c1_1107, %c1_1108]
%c0_1109 = arith.constant 0 : index
%c384_1110 = arith.constant 384 : index
%c1_1111 = arith.constant 1 : index
%79 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1109, %c384_1110, %c1_1111]
%80 = flow.dispatch.region[%76, %77, %78, %79] -> (tensor<4x384x1x384xf32>) {
  %2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%69, %75 : tensor<4x384x1x384xf32>, tensor<4x384xf32>) outs(%64 : tensor<4x384x1x384xf32>) {
  ^bb0(%in: f32, %in_7714: f32, %out: f32):
    %2567 = arith.divf %cst_25, %in_7714 : f32
    %2568 = arith.mulf %in, %2567 : f32
    linalg.yield %2568 : f32
  } -> tensor<4x384x1x384xf32>
  flow.return %2566 : tensor<4x384x1x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index) -> (index, index, index) {
  %x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6
  flow.return %x, %y, %z : index, index, index
}

Is there an easy way we can measure the performance impact of reverting that commit?

Since the x86_64 regression isn't significant, I'll test with the risc-v benchmarks and post the results here.

@jpienaar
Copy link
Member

Interesting, so you are saying it is due to the special case being recognized now that there is a regression?

@pzread
Copy link
Contributor

pzread commented Apr 21, 2023

Is there an easy way we can measure the performance impact of reverting that commit?

Tested with riscv. With that tensorflow commit reverted, the latency is same as the run before the integration. So it confirmed that tensorflow commit causes the regression.

@MaheshRavishankar
Copy link
Contributor

@dcaballe or @hcindyl for input. Cindy, I am not sure if this is something you have seen/tracked/are concerned about. The regression on RISC-V seems to be because now softmax is fused into a single dispatch. This is actually the current expected state. Some input on how to handle RISC-V is useful. If such special case matching is not recommended on RISC-V we can think of disabling these for RISC-V. Ideally though, I would put this into the bucket of RISC-V issues.

@pzread
Copy link
Contributor

pzread commented Apr 21, 2023

Interesting, so you are saying it is due to the special case being recognized now that there is a regression?

Not totally. Now I tested with iree-flow-raise-special-ops disabled (so no iree_lingalg_ext.softmax). It only improves the binary size a little (36216 -> 31816 bytes, ~13%), because we still need to generate code for the extra two tosa.reduce_max and tosa.sub ops. The performance also only improves a little (~1.1X faster on RISC-V).

Not sure how much room left to improve but will need some more investigations if we want to push this further.

@dcaballe
Copy link
Contributor Author

Great findings! Since the x86 regression seems to be justified by the extra computation, I would suggest that we stop here for now and move this to P2. If you are already running on the FPGA, perhaps it's worth that you quickly run perf and share the archived profiles internally. You can also post the reference commit you are using and we can follow up later. Thanks a lot for the analysis!

@hcindyl
Copy link
Contributor

hcindyl commented Apr 21, 2023

I would defer this to @dcaballe ; however, I am curious to see the perf of other models that contain softmax, such as https://github.com/openxla/iree/blob/ea0b28fe850a06123383cb1cf7671995362c06b1/build_tools/python/e2e_test_framework/models/tflite_models.py#L111-142. Is the hit only in the transformer encoder arch + RISCV (from [the](#12916 (comment) it seems x86 also takes a hit) ?

@allieculp
Copy link

@pzread @dcaballe Any update on this P0?

@dcaballe dcaballe changed the title [X86][RISC-V] ~38% slowdown on MobileBert after #12894 integrate [RISC-V] ~38% slowdown on MobileBert after #12894 integrate Apr 25, 2023
@dcaballe
Copy link
Contributor Author

Moving this to P2 as the x86 side is resolved. Only a better understanding of the RISC-V side is pending.

@ScottTodd
Copy link
Member

Nothing actionable here so much later.

@ScottTodd ScottTodd closed this as not planned Won't fix, can't repro, duplicate, stale Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen/llvm LLVM code generation compiler backend codegen/riscv Code generation targeting RISC-V architecture performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
None yet
Development

No branches or pull requests

10 participants