-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RISC-V] ~38% slowdown on MobileBert after #12894 integrate #12916
Comments
@hanhanW has anybody started root-causing this? I just marked it P0 |
People are busy on many things recently. No one started on this, AFAIK. |
@mattwalsh for visibility as a P0 issue |
From triage meeting: Assigning to Diego to review when he returns this Thursday. |
It looks like the regression is due to the differences on the imported MLIRs. So the root cause might not be in the LLVM bump but due to the tensorflow bump. Here are the imported MLIRs of TFLite MobileBertSquad_fp32 before and after the integration: Before the integration (51dbeb8): After the integration (31bbc95): I used the
Build command: iree-compile input.mlir -o output.vmfb --iree-hal-target-backends=llvm-cpu --iree-input-type=tosa --iree-llvmcpu-target-triple=riscv64-pc-linux-gnu --iree-llvmcpu-target-cpu=generic-rv64 --iree-llvmcpu-target-abi=lp64d --iree-llvmcpu-target-cpu-features=+m,+a,+f,+d,+zvl512b,+v --riscv-v-fixed-length-vector-lmul-max=8 --iree-vm-emit-polyglot-zip=true --iree-llvmcpu-debug-symbols=false -mlir-disable-threading |
Diff the imported MLIRs and tensorflow/tensorflow@e0e966b is suspicious. It changes the way to convert softmax op into tosa. I built Before that commit, the importer converts the softmax op into: op1 = tosa.exp(logits)
op2 = tosa.reduce_sum(op1)
op3 = tosa.reciprocal(op2)
op4 = tosa.mul(op1, op3) After that commit, it generates an extra op1 = tosa.reduce_max(logits)
op2 = tosa.sub(logits, op1)
op3 = tosa.exp(op2)
op4 = tosa.reduce_sum(op3)
op5 = tosa.reciprocal(op4)
op6 = tosa.mul(op3, op5) I haven't fully understood why they made that change in tensorflow side. The reason seems due to correctness. I think that explains the size increase and performance regression. It seems like the problem is not on the integration. Instead we want to see how to generate better code for that pattern, or generate better tosa for the softmax op. |
I think we can create another issue to improve the performance on tensorflow softmax op (and re-prioritize) then close this issue as intended behavior. @dcaballe would you mind suggesting the next step to do? |
Thanks for looking into this! Is there an easy way we can measure the performance impact of reverting that commit? (I'm not sure if it's too late to just revert that commit in ToT and create a draft PR to run CI benchmarks...) That would help us be 100% sure that that commit is the only responsible for the regression. If it's too complicated, forget about it. The analysis makes sense. If we add extra computation (reduction + sub ops) the execution time will increase. It's not clear to me if we are doing anything wrong or it's just the extra computation. Could you do a For the rational behind the tosa change, maybe @rsuderman or @jpienaar can help? |
Change was a correctness one addressing numerical stability, so doubt we can revert it. |
After the integration, the pass func.func @main_dispatch_18() {
%c5308416 = arith.constant 5308416 : index
%c1769472 = arith.constant 1769472 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
%2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x384x384xf32>
%3 = tensor.empty() : tensor<4x384x384xf32>
%4 = iree_linalg_ext.softmax dimension(2) ins(%2 : tensor<4x384x384xf32>) outs(%3 : tensor<4x384x384xf32>) -> tensor<4x384x384xf32>
flow.dispatch.tensor.store %4, %1, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : tensor<4x384x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
return
} And being decomposed into: // -----// IR Dump After DecomposeSoftmax (iree-linalg-ext-decompose-softmax) //----- //
func.func @main_dispatch_18() {
%c5308416 = arith.constant 5308416 : index
%c1769472 = arith.constant 1769472 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
%2 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x384x384xf32>
%3 = tensor.empty() : tensor<4x384x384xf32>
%4 = tensor.empty() : tensor<4x384x384xf32>
%5 = tensor.empty() : tensor<4x384xf32>
%cst = arith.constant -1.000000e+30 : f32
%6 = linalg.fill ins(%cst : f32) outs(%5 : tensor<4x384xf32>) -> tensor<4x384xf32>
%7 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%2 : tensor<4x384x384xf32>) outs(%6 : tensor<4x384xf32>) {
^bb0(%in: f32, %out: f32):
%12 = arith.maxf %in, %out : f32
linalg.yield %12 : f32
} -> tensor<4x384xf32>
%8 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%2, %7 : tensor<4x384x384xf32>, tensor<4x384xf32>) outs(%4 : tensor<4x384x384xf32>) {
^bb0(%in: f32, %in_1: f32, %out: f32):
%12 = arith.subf %in, %in_1 : f32
%13 = math.exp %12 : f32
linalg.yield %13 : f32
} -> tensor<4x384x384xf32>
%cst_0 = arith.constant 0.000000e+00 : f32
%9 = linalg.fill ins(%cst_0 : f32) outs(%5 : tensor<4x384xf32>) -> tensor<4x384xf32>
%10 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%8 : tensor<4x384x384xf32>) outs(%9 : tensor<4x384xf32>) {
^bb0(%in: f32, %out: f32):
%12 = arith.addf %in, %out : f32
linalg.yield %12 : f32
} -> tensor<4x384xf32>
%11 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>, affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} ins(%8, %10 : tensor<4x384x384xf32>, tensor<4x384xf32>) outs(%4 : tensor<4x384x384xf32>) {
^bb0(%in: f32, %in_1: f32, %out: f32):
%12 = arith.divf %in, %in_1 : f32
linalg.yield %12 : f32
} -> tensor<4x384x384xf32>
flow.dispatch.tensor.store %11, %1, offsets = [0, 0, 0], sizes = [4, 384, 384], strides = [1, 1, 1] : tensor<4x384x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
return
} Then being vectorized into: ----- LinalgStrategyVectorizePass 3437914 -----
func.func @main_dispatch_18() {
%cst = arith.constant dense<-1.000000e+30> : vector<1x8xf32>
%cst_0 = arith.constant dense<0.000000e+00> : vector<1x8xf32>
%c32 = arith.constant 32 : index
%c1 = arith.constant 1 : index
%c8 = arith.constant 8 : index
%c0 = arith.constant 0 : index
%c384 = arith.constant 384 : index
%c4 = arith.constant 4 : index
%cst_1 = arith.constant 0.000000e+00 : f32
%c5308416 = arith.constant 5308416 : index
%c1769472 = arith.constant 1769472 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c5308416) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c1769472) : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
%workgroup_id_x = hal.interface.workgroup.id[0] : index
%workgroup_count_x = hal.interface.workgroup.count[0] : index
%workgroup_id_y = hal.interface.workgroup.id[1] : index
%workgroup_count_y = hal.interface.workgroup.count[1] : index
%2 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_id_y]
%3 = affine.apply affine_map<()[s0] -> (s0 * 4)>()[%workgroup_count_y]
scf.for %arg0 = %2 to %c4 step %3 {
%4 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_id_x]
%5 = affine.apply affine_map<()[s0] -> (s0 * 32)>()[%workgroup_count_x]
scf.for %arg1 = %4 to %c384 step %5 {
%6 = flow.dispatch.tensor.load %1, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>> -> tensor<4x32x384xf32>
%7 = flow.dispatch.tensor.load %0, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<4x384x384xf32>> -> tensor<4x32x384xf32>
%8 = scf.for %arg2 = %c0 to %c4 step %c1 iter_args(%arg3 = %6) -> (tensor<4x32x384xf32>) {
%9 = scf.for %arg4 = %c0 to %c32 step %c8 iter_args(%arg5 = %arg3) -> (tensor<4x32x384xf32>) {
%10 = tensor.empty() : tensor<1x8xf32>
%11 = vector.transfer_write %cst, %10[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
%12 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %11) -> (tensor<1x8xf32>) {
%16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
%17 = vector.transfer_read %arg7[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
%18 = vector.multi_reduction <maxf>, %16, %17 [2] : vector<1x8x32xf32> to vector<1x8xf32>
%19 = vector.transfer_write %18, %arg7[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
scf.yield %19 : tensor<1x8xf32>
}
%13 = vector.transfer_write %cst_0, %10[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
%14 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %13) -> (tensor<1x8xf32>) {
%16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
%17 = vector.transfer_read %12[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
%18 = vector.broadcast %17 : vector<1x8xf32> to vector<32x1x8xf32>
%19 = vector.transpose %18, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
%20 = vector.transfer_read %arg7[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
%21 = arith.subf %16, %19 : vector<1x8x32xf32>
%22 = math.exp %21 : vector<1x8x32xf32>
%23 = vector.multi_reduction <add>, %22, %20 [2] : vector<1x8x32xf32> to vector<1x8xf32>
%24 = vector.transfer_write %23, %arg7[%c0, %c0] {in_bounds = [true, true]} : vector<1x8xf32>, tensor<1x8xf32>
scf.yield %24 : tensor<1x8xf32>
}
%extracted_slice = tensor.extract_slice %arg5[%arg2, %arg4, 0] [1, 8, 384] [1, 1, 1] : tensor<4x32x384xf32> to tensor<1x8x384xf32>
%15 = scf.for %arg6 = %c0 to %c384 step %c32 iter_args(%arg7 = %extracted_slice) -> (tensor<1x8x384xf32>) {
%16 = vector.transfer_read %7[%arg2, %arg4, %arg6], %cst_1 {in_bounds = [true, true, true]} : tensor<4x32x384xf32>, vector<1x8x32xf32>
%17 = vector.transfer_read %12[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
%18 = vector.broadcast %17 : vector<1x8xf32> to vector<32x1x8xf32>
%19 = vector.transpose %18, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
%20 = vector.transfer_read %14[%c0, %c0], %cst_1 {in_bounds = [true, true]} : tensor<1x8xf32>, vector<1x8xf32>
%21 = vector.broadcast %20 : vector<1x8xf32> to vector<32x1x8xf32>
%22 = vector.transpose %21, [1, 2, 0] : vector<32x1x8xf32> to vector<1x8x32xf32>
%23 = arith.subf %16, %19 : vector<1x8x32xf32>
%24 = math.exp %23 : vector<1x8x32xf32>
%25 = arith.divf %24, %22 : vector<1x8x32xf32>
%26 = vector.transfer_write %25, %arg7[%c0, %c0, %arg6] {in_bounds = [true, true, true]} : vector<1x8x32xf32>, tensor<1x8x384xf32>
scf.yield %26 : tensor<1x8x384xf32>
}
%inserted_slice = tensor.insert_slice %15 into %arg5[%arg2, %arg4, 0] [1, 8, 384] [1, 1, 1] : tensor<1x8x384xf32> into tensor<4x32x384xf32>
scf.yield %inserted_slice : tensor<4x32x384xf32>
}
scf.yield %9 : tensor<4x32x384xf32>
}
flow.dispatch.tensor.store %8, %1, offsets = [%arg0, %arg1, 0], sizes = [4, 32, 384], strides = [1, 1, 1] : tensor<4x32x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<4x384x384xf32>>
}
}
return
} I also noticed that IREE didn't recognize the previous incorrect softmax pattern from tensorflow, so compiler didn't generate Here are the related dispatches generated from the previous softmax pattern (the first %69 = flow.dispatch.region[%65, %66, %67, %68] -> (tensor<4x384x1x384xf32>) {
%2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d2, d3, d1)>, affine_map<(d0, d1, d2, d3) -> (d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%expanded_1075, %9 : tensor<4x1x384x384xf32>, tensor<384x1x384xf32>) outs(%64 : tensor<4x384x1x384xf32>) {
^bb0(%in: f32, %in_7714: f32, %out: f32):
%2567 = arith.addf %in, %in_7714 : f32
%2568 = math.exp %2567 : f32
linalg.yield %2568 : f32
} -> tensor<4x384x1x384xf32>
flow.return %2566 : tensor<4x384x1x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6
flow.return %x, %y, %z : index, index, index
}
%collapsed_1089 = tensor.collapse_shape %69 [[0], [1], [2, 3]] : tensor<4x384x1x384xf32> into tensor<4x384x384xf32>
%70 = tensor.empty() : tensor<4x384xf32>
%71 = linalg.fill ins(%cst_27 : f32) outs(%70 : tensor<4x384xf32>) -> tensor<4x384xf32>
%c1_1090 = arith.constant 1 : index
%c0_1091 = arith.constant 0 : index
%c4_1092 = arith.constant 4 : index
%c1_1093 = arith.constant 1 : index
%72 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1091, %c4_1092, %c1_1093]
%c0_1094 = arith.constant 0 : index
%c384_1095 = arith.constant 384 : index
%c1_1096 = arith.constant 1 : index
%73 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1094, %c384_1095, %c1_1096]
%c0_1097 = arith.constant 0 : index
%c1_1098 = arith.constant 1 : index
%74 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1097, %c1_1090, %c1_1098]
%75 = flow.dispatch.region[%72, %73, %74] -> (tensor<4x384xf32>) {
%2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"]} ins(%collapsed_1089 : tensor<4x384x384xf32>) outs(%71 : tensor<4x384xf32>) {
^bb0(%in: f32, %out: f32):
%2567 = arith.addf %in, %out : f32
linalg.yield %2567 : f32
} -> tensor<4x384xf32>
flow.return %2566 : tensor<4x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5
flow.return %x, %y, %z : index, index, index
}
%c1_1099 = arith.constant 1 : index
%c0_1100 = arith.constant 0 : index
%c4_1101 = arith.constant 4 : index
%c1_1102 = arith.constant 1 : index
%76 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1100, %c4_1101, %c1_1102]
%c0_1103 = arith.constant 0 : index
%c384_1104 = arith.constant 384 : index
%c1_1105 = arith.constant 1 : index
%77 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1103, %c384_1104, %c1_1105]
%c0_1106 = arith.constant 0 : index
%c1_1107 = arith.constant 1 : index
%c1_1108 = arith.constant 1 : index
%78 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1106, %c1_1107, %c1_1108]
%c0_1109 = arith.constant 0 : index
%c384_1110 = arith.constant 384 : index
%c1_1111 = arith.constant 1 : index
%79 = affine.apply affine_map<()[s0, s1, s2] -> ((s1 - s0) ceildiv s2)>()[%c0_1109, %c384_1110, %c1_1111]
%80 = flow.dispatch.region[%76, %77, %78, %79] -> (tensor<4x384x1x384xf32>) {
%2566 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>, affine_map<(d0, d1, d2, d3) -> (d0, d1)>, affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} ins(%69, %75 : tensor<4x384x1x384xf32>, tensor<4x384xf32>) outs(%64 : tensor<4x384x1x384xf32>) {
^bb0(%in: f32, %in_7714: f32, %out: f32):
%2567 = arith.divf %cst_25, %in_7714 : f32
%2568 = arith.mulf %in, %2567 : f32
linalg.yield %2568 : f32
} -> tensor<4x384x1x384xf32>
flow.return %2566 : tensor<4x384x1x384xf32>
} count(%arg3: index, %arg4: index, %arg5: index, %arg6: index) -> (index, index, index) {
%x, %y, %z = flow.dispatch.workgroup_count_from_dag_root %arg3, %arg4, %arg5, %arg6
flow.return %x, %y, %z : index, index, index
}
Since the x86_64 regression isn't significant, I'll test with the risc-v benchmarks and post the results here. |
Interesting, so you are saying it is due to the special case being recognized now that there is a regression? |
Tested with riscv. With that tensorflow commit reverted, the latency is same as the run before the integration. So it confirmed that tensorflow commit causes the regression. |
@dcaballe or @hcindyl for input. Cindy, I am not sure if this is something you have seen/tracked/are concerned about. The regression on RISC-V seems to be because now softmax is fused into a single dispatch. This is actually the current expected state. Some input on how to handle RISC-V is useful. If such special case matching is not recommended on RISC-V we can think of disabling these for RISC-V. Ideally though, I would put this into the bucket of RISC-V issues. |
Not totally. Now I tested with Not sure how much room left to improve but will need some more investigations if we want to push this further. |
Great findings! Since the x86 regression seems to be justified by the extra computation, I would suggest that we stop here for now and move this to P2. If you are already running on the FPGA, perhaps it's worth that you quickly run perf and share the archived profiles internally. You can also post the reference commit you are using and we can follow up later. Thanks a lot for the analysis! |
I would defer this to @dcaballe ; however, I am curious to see the perf of other models that contain softmax, such as https://github.com/openxla/iree/blob/ea0b28fe850a06123383cb1cf7671995362c06b1/build_tools/python/e2e_test_framework/models/tflite_models.py#L111-142. Is the hit only in the transformer encoder arch + RISCV (from [the](#12916 (comment) it seems x86 also takes a hit) ? |
Moving this to P2 as the x86 side is resolved. Only a better understanding of the RISC-V side is pending. |
Nothing actionable here so much later. |
Around 38% slowdown and 38% code size increase on MobileBert after #12894.
The text was updated successfully, but these errors were encountered: