New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify quantized multiplier #227
Conversation
Sorry that I didn't see this earlier! Thanks for the PR. I need to ponder this and get back to you. @jdduke @talumbau @silvasean any opinion? |
I've been discussing with folks at ARM for a while about the rescale operation and the single vs. double rounding. Some additional context is the issue here: I confirmed that this is actually the rescale that was in the x86 kernels before I changed it over to be the same as the ARM kernels. Generally, I am in favor of all of the following having the same rescale operation:
AFAICT it is superior from an accuracy perspective to use the single versus the double rounding. If this is correct, then I would be in favor of all of the above going to the single rounding. This change would be significant so most of my efforts around this have been related to testing issues on the TF Lite side. |
Thanks again for filing this PR. I have a few comments/questions.
Finally, regarding your above question:
Not really --- ruy's test framework isn't exercising that very well. TFLite's tests (which exercise ruy as a back-end) add some test coverage. Once we have narrowed down exactly what code change we want to make, we can discuss what kind of ad-hoc testing might be needed for a change like this. |
Hi @bjacob, Thanks for your comments on this PR. In your question (2) based on comment (1) the description of replacing double rounding by single rounding is referring to the base C implementation in apply_multiplier.cc where the two rounding operations (SaturatingRoundingDoublingHighMul and RoundingRightShift) are replaced by a single rounding operation (addition of 'round' to result and then shift right by total_shift). We think the benefits of this are twofold:
So, regarding your point (3), we are switching the base C implementation (in apply_multiplier.cc) from two rounding operations to one rounding and I think this part is similar to the conversation in tensorflow/tensorflow#25087. What we show in this patch is that it is possible to modify the optimized Arm code implementation to match the single rounding base C implementation proposed in apply_multiplier.cc. Although, as you point out, there are still two roundings (meaning 1) occurring in this optimized code, they combine to give the same effect as a single round (as in the C code). This is because the first rounding (meaning 1) is a truncating right shift and truncating right shifts can be combined ((x>>a)>>b)=x>>(a+b). So the single round (x+(1<<(30+k)))>>(31+k) where total_shift=31+k can be expressed as ((x>>31) + (1<<(k-1)))>>k without affecting the result when k>1. Regarding point (4), the reason for ensuring the right shift is greater or equal to 1 is because the equation above, (x+(1<<(30+k)))>>(31+k) = ((x>>31) + (1<<(k-1)))>>k, is only true for k>=1. When k=0 the left side would round to nearest but the right side would just truncate (the rounding right shift instruction takes 1<<(k-1) to be zero when k is 0 - it does not add any rounding offset). We don't think forcing the right shift k>=1 is an issue in practice, but as you say it is a difference to the accumulator bound for shift>=0. I hope the above helps to clarify but please let us know if you have any more questions or comments. Thanks. |
For what it's worth, independently from the people in this PR, we've been using this new scheme on Cortex-M and RiscV microcontrollers successfully for a while, where it too resolves to more efficient instructions than the old scheme (while also being more accurate). This is in TFLite Micro, but the code and issue was essentially the same. |
Thank you so much, Dominic and Tom, for the explanations, which addressed multiple misunderstandings on my part. I hadn't properly read the patch regarding Thanks for the explanation on how the switch to the truncating Overall this PR looks great. The merge conflict is easy: I've added comments to the local functions in
And please update that referred-to comment in Lines 65 to 73 in 2887692
While we're on the topic of comments: this sentence,
deserves some expansion, since it's the key to the whole approach and it contains a non-trivial idea (that SQDMULH is actually better than SQRDMULH because its truncating behavior is that of a truncating right shift, which allows perfect combining with another truncating right shift). This test introduces off-by-one test errors on x86 --- x86 behavior is unchanged but the reference code is changed. As @talumbau said above, this is OK and we'll want to update the x86 code to match. In the interim, I made PR #251 to relax tolerance, I'll merge it just before yours. Finally, there is one compilation issue which prevented me from trying your code on Android NDK r21d (default clang-based toolchain, clang version 9.0.8):
Could you take care of it? |
Alter sequence to a single rounded scaling with normal rounded shift. Double rounding and symmetric rounding are removed compared to reference. Double rounding seems unnecessary and can complicate implementations. Moreover, symmetric rounding also adds implementation complexity. For NEON the new sequence can be translated to VQDMULH + VRSHR.
- Elaborate on the proposed approach - Fix Android compilationion issue by using MVN instead of MOVI
Hello @bjacob , Thanks for your comments! Hope this helps, |
Thanks! I gave it a try and there is a test failure on Android/aarch64 in the tests where the destination type is Here are my steps to run all tests on Android/aarch64 with cmake/ctest assuming that you have an Android device connected to your workstation and accessible via
Result:
Running one with verbose output:
output:
|
Hello @bjacob , Thanks for your prompt response. Did test the dot-product ones but might have missed the i16 output kernels. Will have a look as soon as possible. Apologies for any inconvenience. |
No problem at all! Note that the same kernels handle all output types, different output types correspond to different conditional branches only at the very end of each kernel, and moreover, the rescaling arithmetic which you perform on accumulators still represented in EDIT: one possibility for why |
NEON dot-product kernels were conditionally applying the left shift only when `multiplier_exponent` was greater than zero. The new approach alters the way shifts are calculated thus removing these conditional paths. Moreover, no path seems to use this flag, so we extracted it from the common logic and we always set to true.
@bjacob the issue was related to the left shift that was conditionally applied only when Hope this resolves the issues on your side as well. |
Thanks! Now all tests pass on my side on arm64, arm32 and x86-64. |
We're getting there. Looks like only 1 google test needed a trivial relaxation. Also submitted PR #251. Doing another round now ... |
Evaluate Ruy requantization schema for ARM NEON, suggested in google/ruy#227 PiperOrigin-RevId: 385218562
Evaluate Ruy requantization schema for ARM NEON, suggested in google/ruy#227 PiperOrigin-RevId: 385218562
Evaluate Ruy requantization schema for ARM NEON, suggested in google/ruy#227 PiperOrigin-RevId: 385232964
Alter sequence to a single rounded scaling with normal rounded shift.
Double rounding and symmetric rounding are removed compared to
reference. Double rounding seems unnecessary and can complicate
implementations. Moreover, symmetric rounding also adds implementation
complexity.
For NEON the new sequence can be translated to VQDMULH + VRSHR.