-
Notifications
You must be signed in to change notification settings - Fork 614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] Understand why IREE is 2x slower than TFLite on ViT INT8 on ARM64 #15399
Comments
Note that the compile command with the patch ends up using cpu features cpu_features |
Interesting. This is a 5-thread benchmark across 2 tiers of cores (taskset 1f0) |
On 1 thread with taskset 100, IREE has latency 730ms and TFLite is 196ms. I ran at 1, 4 and 5 with tasksets 100, F0 and 1F0: TFLite:
IREE
|
Do you have a profile handy to see if we can narrow down the issue to a few dispatches without having the run/reproduce the full thing? |
What are the steps to run with TFLite? |
I'll try and get Tracy up and running in the next couple of days and circle back with a profile. |
Setting up Tracy ended up being straightforward. Here is the trace: https://storage.googleapis.com/iree-shared-files/mariewhite/profiles/vit.tracy. Not sure what to look for here but it looks like Softmax and 3 matmuls take the longest time. |
As far as I can tell, the 3 heaviest matmuls in the profile that you share, @mariecwhite, use µkernels. At this point I don't know if our implementation for these kernels is inferior to what TFLite uses or if we just spent our time differently. Ditto for the softmax. Could you share the repro steps with TFLite and/or a profile for TFLite? |
Here is a nice article on performance measurement of TFLite models, including ways to profile them: https://www.tensorflow.org/lite/performance/measurement In that article is a link to the prebuilt benchmark tool for Android ARM: https://storage.googleapis.com/tensorflow-nightly-public/prod/tensorflow/release/lite/tools/nightly/latest/android_aarch64_benchmark_model Once downloaded, Then on the device run:
|
You can also quickly get a profile by adding |
Reporting a few facts here, no much analysis yet:
|
Interestingly disabling XNNPack yields pretty much the same performance (176 ms vs. 173 ms; min of 50 runs).
|
These numbers are with just 1 thread. |
That's interesting! I saw softmax in the profile of other models but it was too far from the top to be worth investing on it. We do know that there are some improvements to be done, though: #15210 (comment) Some of these issues seem specific to IREE, perhaps due to changes needed on the GPU side. We can work on that. Feel free to create issues for this! Regarding matmuls, it's very likely that DT+UK is just not enough for this large matmuls. We probably need multiple levels of tiling targeting the different cache levels... |
If looking for a 2x, ignore the inefficient 4% thing for now? @bjacob knows the xnnpack and ukernel story well. Unless if something has changed, I don't think this is doing anything particularly advanced and probably just needs some catch-up. Any advice to keep the analysis on the most profitable path, Benoit? |
Yes. Since we are looking for a 2x and both the XNNPACK and the non-XNNPACK profile show respectively 74% and 63% time spent in FULLY_CONNECTED, let's focus on that. Getting us this far was what the coarse-grained TFLite "op profiling" was good for. Now that we know we're down to profiling inside matrix multiplication implementations, we need another profiler. You could always skip straight to Rebuild TFLite (specifically that |
I have this old script to break down TOSA ops/shapes... here is what it says about this model:
So these are pretty normal On the IREE side:
|
Also, it's interesting that Ruy is being 2x faster than IREE even though Ruy only knows how to use the |
The assembly vit_iree_ukernels.s.txt. For tracy, I have to rework my settings, I don't get the full source mapping. |
Here's the tracy profile |
@mariecwhite could you try to build the android benchmark utility with the option that @bjacob recommend ( I don't know what I'm doing wrong, but
I configured my workspace with the following environment:
|
I ran into the same error. The configure script does warn that NDK version 26 is not supported. Using NDK version 25 and SDK version 33 works, so I suggest downloading them, running ./configure, |
Ah thanks @mariecwhite! Trying with the older versions. |
Thanks again @mariecwhite that fixed it. @bjacob here is the output of the benchmarking tool when built with With 1 thread:
And with 5 threads:
|
@mariecwhite, I just realized that we were not using the same CPU features. and you have: I tried to use your target features and I hit a compiler error:
Any idea what I may be missing? |
We did find that using any of the |
Yes, avoid Thanks for gathering the detailed profiles above. Now the next step is to compare that to Tracy profiles of how IREE is spending its time. I see the tracy trace and disassembly that you shared above 2 days ago, so now the next step is to look into that. |
@mariecwhite Could you check whether you have this change: (see the update in LowerVectorContract.cpp).
Having said that, bug reports and repros are very welcome :) |
The input TFLite model has the batch matmul and so has the converted tosa IR. Anyhow, looking at the TFLite graph I see only one quantize and one dequantize node for the full graph whereas in the tosa input, we have such nodes ( |
Ah thanks! |
TFLite does avoid that -- it keeps i32 accumulators an internal detail of its fused quantized-matmul-and-rescale kernels (with both ruy and xnnpack backends), in registers. So that's one way in which TFLite is going to be fundamentally more efficient here. However, that won't explain a 2x difference by itself. I think that the bulk of the 2x gap has to be explained by somethind else, such as quantized_batch_matmul not being picked up by the above mentioned rewrite and thus staying on a default naive lowering path. |
Yep, it seems like a nice generally useful PR to write to let QuantizedMatmulToMatmul.cpp handle QuantizedBatchMatmul. I'll let your team handle it and return to other things, but let me know if there's any question. There's a comment at the top linking to the paper explaining the math that's going on here and the idea of that rewrite. |
Agree, but that's something that is likely to be more problematic on GPUs if we end up with unfused |
Would anything prevent the fusion (of the consumer rescaling into the producer matmul) on GPU? The only thing preventing that fusion here on CPU is the usage of ukernel for the matmul. Even then, we have a plan for eventually getting the fusion -- get the ukernel bitcode to fully inline (that's already the case on x86, just not yet on Arm, superficial issue), and tile outer loops to size 1 so they vanish, and ensure the consumer rescaling goes in the same dispatch -- at that point, there should be nothing preventing late (post linking) LLVM IR passes from coalescing the stores of i32 loads to memory at the end of the ukernel with their reloads at the start of the rescaling. |
Not that I can think of, but I don't know how IREE does fusion :). |
@LLITCHEV, when you're done with your softmax improvements or if the softmax improvements take too long, could you look at this? |
@qcolombet I'll follow up once the softmax is done. Thanks! |
The side note:
|
The i8mm s8s8s32 ukernel doesn't seem to be effective here. I compiled ViT INT8 with only the +i8mm or +dotprod CPU feature and compared the e2e latencies. dotprod is slightly faster.
This does not correlate with the microbenchmarks:
TOSA MLIR definitely contains I tried running with |
To reproduce:
|
simpleperf profile attached with L1/L2 cache counters. There looks to be 2 issues:
L1 cache misses are mostly due to |
Yes, the it really isn't hard to fix, one existing |
what Benoit said. We are not data-tiling the |
Filed #16599 for that specifically. Assuming you're not already working on this @mariecwhite , so we might be staffing this on our end -- sync there. |
I think @LLITCHEV is not currently looking into it but I'm not sure if he made any progress. Hopefully we can update. |
Ah - well on our side, @pashu123 has started looking into it. Whoever gets there first :-P |
@dcaballe The priority of this was reduced and I'm looking at something else at the moment. |
Looks like it's over to you @pashu123. Thanks! |
Cool, I have added the support here #16615 . Meanwhile, I will be adding regression tests. Thanks. |
Closing the issue because the performance concern is addressed. According to the benchmark report, IREE got 2x faster which should be competitive with TFLite. |
What happened?
On Pixel 8 Pro CPU, IREE latency on ViT is 236ms whereas TFLite is 118ms. Let's understand why.
Steps to reproduce your issue
Download https://storage.googleapis.com/iree-model-artifacts/tflite/tflite_models_1698315913/VIT_CLASSIFICATION_INT8_TFLITE_3X224X224XINT8/tosa.mlirbc
Build a version of IREE with #15387 patched.
Compile for Android
Run on device:
What component(s) does this issue relate to?
Compiler
Version information
d32d8ce
Additional context
No response
The text was updated successfully, but these errors were encountered: