Simplify quantized multiplier #227

GeorgeARM · 2020-11-30T14:50:36Z

Alter sequence to a single rounded scaling with normal rounded shift.
Double rounding and symmetric rounding are removed compared to
reference. Double rounding seems unnecessary and can complicate
implementations. Moreover, symmetric rounding also adds implementation
complexity.

For NEON the new sequence can be translated to VQDMULH + VRSHR.

GeorgeARM · 2020-11-30T14:53:33Z

CC: @jdduke , @talumbau

Tests seem to pass on our side.
Most of them operate though on negative shifts, is there a way to test a range of positive shifts through your test framework as well?

Moreover, apply_multiplier_test passes with minor changes on corner case scenario.

bjacob · 2021-01-19T20:11:01Z

Sorry that I didn't see this earlier! Thanks for the PR. I need to ponder this and get back to you. @jdduke @talumbau @silvasean any opinion?

talumbau · 2021-01-19T20:57:51Z

I've been discussing with folks at ARM for a while about the rescale operation and the single vs. double rounding. Some additional context is the issue here:

tensorflow/tensorflow#25087

I confirmed that this is actually the rescale that was in the x86 kernels before I changed it over to be the same as the ARM kernels. Generally, I am in favor of all of the following having the same rescale operation:

Ruy ARM kernels
Ruy x86 kernels
Ruy reference code
TF Lite reference code

AFAICT it is superior from an accuracy perspective to use the single versus the double rounding. If this is correct, then I would be in favor of all of the above going to the single rounding. This change would be significant so most of my efforts around this have been related to testing issues on the TF Lite side.

bjacob · 2021-01-20T17:20:15Z

Thanks again for filing this PR. I have a few comments/questions.

Comment on terminology: the term "rounding" seems to have two different meanings here:
1. Meaning 1: "rounding" means snapping a non-representable value to some nearby representable value, as opposed to an exact operation.
  1. Example: fixed-point multiplications SQDMULH and SQRDMULH are both "rounding" in this sense, as opposed to plain integer multiplications (MUL) which are exact.
2. Meaning 2: in the context of ARM ISAs, "rounding" means rounding to nearest with ties broken upwards, as opposed to rounding downwards (to negative infinity).
  1. Example: SQRDMULH is rounding in this sense as opposed to SQDMULH. Likewise, SRSHL as opposed to SSHL.
Question, following the above Comment: when in this PR we talk about replacing a double rounding by a single rounding, are we talking about the switching from SQRDMULH to SQDMULH? If so, since that means that the rounding that we are removing is a Meaning 2 rounding, not a Meaning 1 rounding, and we still have two Meaning 1 roundings (SQDMULH and SRSHL are both rounding in Meaning 1), what is the rationale for this change - why is SQDMULH+SRSHL an improvement over SQRDMULH+SRSHL? (This is a real question --- I'm not implying here that this change isn't useful --- but I'm pointing out that if it is useful it must be for a non-trivial reason, which I would like to understand, and which might be either mathematical or empirical.)
Since Two roundings in MultiplyByQuantizedMultiplier(TFLite) leads to inconsistent result with tensorflow tensorflow/tensorflow#25087 was mentioned above, I would like to check that we are on the same page: Two roundings in MultiplyByQuantizedMultiplier(TFLite) leads to inconsistent result with tensorflow tensorflow/tensorflow#25087 was about switching from 2 roundings to 1 rounding in the Meaning 1 , since it was implementing a scalar primitive performing both the multiplication and the shift in exact integer arithmetic before rounding at the end. I'm referring specifically to this comment's SaturatingRoundingDoublingHighMulWithShift function. This is unlike the present code diff, which is (assuming we agreed in the above point) about about switching from 2 roundings to 1 rounding in the Meaning 2, while keeping 2 roundings in Meaning 1. Do we agree that these are two quite different conversations?
Looking at the present code diff, I was intrigued by the change to how bit shifts amounts are computed: the new code is putting the constant -1 in register v8, then does (in a context where v14 holds the exponent): smin v11.4s, v8.4s, v14.4s and sub v12.4s, v14.4s, v11.4s. So it is setting: v11 = min(-1, exponent) and v12 = exponent - v11. Consider the case where exponent>=0. Then this sets v11 = -1 and v12 = exponent+1. As v12 is used as the left shift amount before the fixed-point multiplication, this means that any positive exponent is increased by +1. We never defined exactly the allowed value bounds for int32 accumulators and for positive exponents, so I can't say that this would push any real case over the overflow limit, but I just wanted to point this out. At least I would like to understand what is the rationale for adding +1 to left shifts and then subsequently performing a right shift by 1? Does it serve an accuracy purpose and how does it relate to the switch of fixed-point multiplication from SQRDMULH to SQDMULH? Is there a mathematical principle to this or is this empirical?

Finally, regarding your above question:

Most of them operate though on negative shifts, is there a way to test a range of positive shifts through your test framework as well?

Not really --- ruy's test framework isn't exercising that very well. TFLite's tests (which exercise ruy as a back-end) add some test coverage. Once we have narrowed down exactly what code change we want to make, we can discuss what kind of ad-hoc testing might be needed for a change like this.

dominicsymes · 2021-01-26T16:42:42Z

Hi @bjacob,

Thanks for your comments on this PR.

In your question (2) based on comment (1) the description of replacing double rounding by single rounding is referring to the base C implementation in apply_multiplier.cc where the two rounding operations (SaturatingRoundingDoublingHighMul and RoundingRightShift) are replaced by a single rounding operation (addition of 'round' to result and then shift right by total_shift). We think the benefits of this are twofold:

In terms of implementation, it can be implemented as a 32x32->64 bit multiply, add and right shift. These are standard operations available in programming languages and processors and so should enable efficient bit-exact implementations in software or hardware to the proposed C implementation in apply_multiplier.cc.
In terms of mathematical result, the double round can introduce a bias that means that the rounding is not to the nearest value. This is I think the issue commented by @Tombana in Two roundings in MultiplyByQuantizedMultiplier(TFLite) leads to inconsistent result with tensorflow tensorflow/tensorflow#25087 where @Tombana gives example of n+0.25 rounding to n+1.0 for n=0,1,2. If the parameters to MultiplyByQuantizedMultiplier() are x=1, quantized_multiplier=1<<30, shift=-1 then the correct mathematical result is 0.25 but in the original code, the SaturatingRoundingDoublingHighMul() will calculate (1*(1<<30) + (1<<30))>>31 = 1 and then RoundingRightShift() will calculate (1+1)>>1 = 1. The double rounding introduces an upward bias of 0.25 for positive values. More generally for shift=-k the original code introduces an upward bias of 0.5/(1<<k) for positive values.

So, regarding your point (3), we are switching the base C implementation (in apply_multiplier.cc) from two rounding operations to one rounding and I think this part is similar to the conversation in tensorflow/tensorflow#25087. What we show in this patch is that it is possible to modify the optimized Arm code implementation to match the single rounding base C implementation proposed in apply_multiplier.cc. Although, as you point out, there are still two roundings (meaning 1) occurring in this optimized code, they combine to give the same effect as a single round (as in the C code). This is because the first rounding (meaning 1) is a truncating right shift and truncating right shifts can be combined ((x>>a)>>b)=x>>(a+b). So the single round (x+(1<<(30+k)))>>(31+k) where total_shift=31+k can be expressed as ((x>>31) + (1<<(k-1)))>>k without affecting the result when k>1.

Regarding point (4), the reason for ensuring the right shift is greater or equal to 1 is because the equation above, (x+(1<<(30+k)))>>(31+k) = ((x>>31) + (1<<(k-1)))>>k, is only true for k>=1. When k=0 the left side would round to nearest but the right side would just truncate (the rounding right shift instruction takes 1<<(k-1) to be zero when k is 0 - it does not add any rounding offset). We don't think forcing the right shift k>=1 is an issue in practice, but as you say it is a difference to the accumulator bound for shift>=0.

I hope the above helps to clarify but please let us know if you have any more questions or comments. Thanks.

Tombana · 2021-01-27T10:20:42Z

For what it's worth, independently from the people in this PR, we've been using this new scheme on Cortex-M and RiscV microcontrollers successfully for a while, where it too resolves to more efficient instructions than the old scheme (while also being more accurate). This is in TFLite Micro, but the code and issue was essentially the same.

bjacob · 2021-02-02T03:32:04Z

Thank you so much, Dominic and Tom, for the explanations, which addressed multiple misunderstandings on my part.

I hadn't properly read the patch regardingMultiplyByQuantizedMultiplier - sorry.

Thanks for the explanation on how the switch to the truncating SQDMULH enables bit-exact agreement with the reference code thanks to truncating right shifts being combinable -- not only this was a key thing that I needed to understand to understand this PR, but this is a really interesting general idea that I'm thankful to have learned from your explanation.

Overall this PR looks great. The merge conflict is easy: I've added comments to the local functions in apply_multiplier.cc which you are deleting. My only request there would be please preserve this part of the comment which I just added:

// Warning: this code is not meant to be bit-exact-normative.
// Please refer to the class comment of ruy::MulParams, in mul_params.h.

And please update that referred-to comment in mul_params.h to match your new code --- hint: it was going into too much detail, so it's perhaps best to update it by trimming it.

ruy/ruy/mul_params.h

Lines 65 to 73 in 2887692

    
           // In the latter case (DstScalar integral and narrower than std::int32_t), 
        
           // the multiplier effect on accumulator values is as follows: 
        
           // 
        
           //   1. Left shift by max(0, multiplier_exponent). 
        
           //   2. Fixed-point multiplication by multiplier_fixedpoint in Q0.31 format. 
        
           //   3. Rounding right shift by max(0, -multiplier_exponent). 
        
           // 
        
           // Reference code for this can be found in the implementation of 
        
           // ruy::ApplyMultiplier. If you look there, you'll find warnings like this:

While we're on the topic of comments: this sentence,

`On NEON this can be translated to a SQDMULH + rounding shift right sequence`

deserves some expansion, since it's the key to the whole approach and it contains a non-trivial idea (that SQDMULH is actually better than SQRDMULH because its truncating behavior is that of a truncating right shift, which allows perfect combining with another truncating right shift).

This test introduces off-by-one test errors on x86 --- x86 behavior is unchanged but the reference code is changed. As @talumbau said above, this is OK and we'll want to update the x86 code to match. In the interim, I made PR #251 to relax tolerance, I'll merge it just before yours.

Finally, there is one compilation issue which prevented me from trying your code on Android NDK r21d (default clang-based toolchain, clang version 9.0.8):

ruy/kernel_arm64.cc:403:10: error: immediate must be an integer in range [0, 255].
        "movi v8.4s, #-1\n"
         ^
<inline asm>:170:13: note: instantiated into assembly here

Could you take care of it?

Alter sequence to a single rounded scaling with normal rounded shift. Double rounding and symmetric rounding are removed compared to reference. Double rounding seems unnecessary and can complicate implementations. Moreover, symmetric rounding also adds implementation complexity. For NEON the new sequence can be translated to VQDMULH + VRSHR.

- Elaborate on the proposed approach - Fix Android compilationion issue by using MVN instead of MOVI

GeorgeARM · 2021-02-05T11:17:17Z

Hello @bjacob ,

Thanks for your comments!
Updated the patch as per your suggestions.
Finally, I did build with r21 and tested the patch without any issues on my side.

Hope this helps,
George

bjacob · 2021-02-05T17:03:39Z

Thanks! I gave it a try and there is a test failure on Android/aarch64 in the tests where the destination type is int16 only. Moreover, as we can see in the test output below, it is only the Dotprod kernels that are affected, and the effect of the bug is off-by-2x output values, as if a shift amount were off-by-1.

Here are my steps to run all tests on Android/aarch64 with cmake/ctest assuming that you have an Android device connected to your workstation and accessible via adb.

mkdir ruy-android-build
cd ruy-android-build
# -G Ninja is entirely optional, should reproduce in the same way with default Makefile generator
# The precise ANDROID_PLATFORM shouldn't play any role either -
# just happens to be the latest in NDK r21d which I'm using.
# ~/android-ndk-r21d/ is just where i've untarred the Android NDK.
cmake ../ruy -G Ninja \
   -DCMAKE_TOOLCHAIN_FILE=~/android-ndk-r21d/build/cmake/android.toolchain.cmake \
   -DANDROID_ABI=arm64-v8a \
   -DANDROID_PLATFORM=android-29
cmake --build . -j12
ctest -j12

Result:

The following tests FAILED:
	 20 - ruy_test_fast_u8_u8_i32_i16 (Failed)
	 26 - ruy_test_slow_u8_u8_i32_i16 (Failed)

Running one with verbose output:

ctest -R ruy_test_fast_u8_u8_i32_i16 --verbose

output:

20: Error: Not all paths agree. 
20: These paths all agree with each other: kStandardCpp, kInternalStandardCppVariant1, kInternalStandardCppVariant2, kInternalStandardCppVariant3, kNeon/kA55ish, kNeon/kGeneric, kNeon, kReference, but disagree with the rest.
20: These paths all agree with each other: kNeonDotprod/kA55ish, kNeonDotprod/kGeneric, kNeonDotprod, but disagree with the rest.
20: Shape: rows = 1, cols = 1, depth = 1
20: Stats of the good result matrix: (median = -809, mean = -809, min = -809, max = -809)
20: Stats of the bad result matrix:  (median = -404, mean = -404, min = -404, max = -404)
20: Errors found in ALL rows.
20: Errors found in ALL cols.
20: The first error occurs at row 0, col 0
20: Good value: -809
20: Bad value : -404
20: Region of Good result matrix around first error:
20: 
20: -809 
20: 
20: Region of Bad result matrix around first error:
20: 
20: -404

GeorgeARM · 2021-02-05T17:09:51Z

Hello @bjacob ,

Thanks for your prompt response. Did test the dot-product ones but might have missed the i16 output kernels. Will have a look as soon as possible.

Apologies for any inconvenience.

bjacob · 2021-02-05T17:12:11Z

No problem at all! Note that the same kernels handle all output types, different output types correspond to different conditional branches only at the very end of each kernel, and moreover, the rescaling arithmetic which you perform on accumulators still represented in int32 is shared (the branching for different output types happens after that point). So not obvious to me how you can have a int16 specific bug there, but maybe comparison with the kNeon kernels, which don't have the bug, will shed light.

EDIT: one possibility for why int16 only is affected might be that int16 results in higher shift amounts, in the sense of "more likely to be a left-shift, as opposed to a right-shift" For example, if you had a bug specific to the case where the exponent is positive, or is zero, then that could manifest itself only in int16 tests.

NEON dot-product kernels were conditionally applying the left shift only when `multiplier_exponent` was greater than zero. The new approach alters the way shifts are calculated thus removing these conditional paths. Moreover, no path seems to use this flag, so we extracted it from the common logic and we always set to true.

GeorgeARM · 2021-02-08T01:30:25Z

@bjacob the issue was related to the left shift that was conditionally applied only when RUY_ASM_FLAG_NEEDS_LEFT_SHIFT was set, something that was true either in the case of per-channel quantization or when multiplier_exponent was greater than 0.
As the logic for computing the shifts has changed I removed these conditional paths from the NEON dot-product kernels, which from what it seems was the only place that this flag was used.

Hope this resolves the issues on your side as well.

bjacob · 2021-02-08T02:55:50Z

Thanks! Now all tests pass on my side on arm64, arm32 and x86-64.
RUY_ASM_FLAG_NEEDS_LEFT_SHIFT served the purpose of skipping the left shifts when they were not needed, however with current workloads now all using per-channel quantization, the opportunity to skip the left shifts was going away already, as you found.
This is looking ready as far as I can see. Now I'm going to pull that into google's internal mega repository. This process might take a few days in the case of this PR specifically because: since this PR changes quantized inference behavior, it might break (possibly overly strict) tests somewhere in google. To prevent that I'm going to request a full run of all of google's tests before we actually merge this. The iteration cycle on this is ~ 1 day and I might need to iterate a few times (hopefully just adjusting tests).

Closes #251 COPYBARA_INTEGRATE_REVIEW=#251 from bjacob:relax c8d2cf9 PiperOrigin-RevId: 356340585

bjacob · 2021-02-08T21:47:46Z

We're getting there. Looks like only 1 google test needed a trivial relaxation. Also submitted PR #251. Doing another round now ...

Evaluate Ruy requantization schema for ARM NEON, suggested in google/ruy#227 PiperOrigin-RevId: 385218562

Evaluate Ruy requantization schema for ARM NEON, suggested in google/ruy#227 PiperOrigin-RevId: 385232964

google-cla bot added the cla: yes label Nov 30, 2020

bjacob added a commit to bjacob/ruy that referenced this pull request Feb 2, 2021

Update test tolerance ahead of merging PR google#227

896f0a1

GeorgeARM added 2 commits February 4, 2021 15:18

Address PR comments

0aa3f8b

- Elaborate on the proposed approach - Fix Android compilationion issue by using MVN instead of MOVI

GeorgeARM force-pushed the mul_pr branch from ab7cc90 to 0aa3f8b Compare February 4, 2021 15:22

GeorgeARM force-pushed the mul_pr branch from 7e78ef4 to dec00bd Compare February 8, 2021 00:59

bjacob added the ready to pull label Feb 8, 2021

copybara-service bot pushed a commit that referenced this pull request Feb 8, 2021

Update test tolerance ahead of merging PR #227

287015c

Closes #251 COPYBARA_INTEGRATE_REVIEW=#251 from bjacob:relax c8d2cf9 PiperOrigin-RevId: 356340585

bjacob added ready to pull and removed ready to pull labels Feb 8, 2021

copybara-service bot closed this in be760b6 Feb 9, 2021

Tessil mentioned this pull request Jun 15, 2021

[TFLite] Update MultiplyByQuantizedMultiplier to use single-rounding instead of double-rounding tensorflow/tensorflow#50290

Merged

copybara-service bot pushed a commit to google/XNNPACK that referenced this pull request Jul 16, 2021

Evaluation stubs for Ruy requantization schema

0f18c95

Evaluate Ruy requantization schema for ARM NEON, suggested in google/ruy#227 PiperOrigin-RevId: 385218562

copybara-service bot mentioned this pull request Jul 16, 2021

Evaluation stubs for Ruy requantization schema google/XNNPACK#1647

Merged

copybara-service bot pushed a commit to google/XNNPACK that referenced this pull request Jul 16, 2021

Evaluation stubs for Ruy requantization schema

4313ad9

Evaluate Ruy requantization schema for ARM NEON, suggested in google/ruy#227 PiperOrigin-RevId: 385218562

copybara-service bot pushed a commit to google/XNNPACK that referenced this pull request Jul 16, 2021

Evaluation stubs for Ruy requantization schema

7b1aeb9

Evaluate Ruy requantization schema for ARM NEON, suggested in google/ruy#227 PiperOrigin-RevId: 385232964

bjacob mentioned this pull request May 12, 2022

apply_scale_32 not lowering to ARM NEON sqdmulh instructions iree-org/iree#9109

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify quantized multiplier #227

Simplify quantized multiplier #227

GeorgeARM commented Nov 30, 2020

GeorgeARM commented Nov 30, 2020 •

edited

bjacob commented Jan 19, 2021

talumbau commented Jan 19, 2021

bjacob commented Jan 20, 2021 •

edited

dominicsymes commented Jan 26, 2021

Tombana commented Jan 27, 2021 •

edited

bjacob commented Feb 2, 2021

GeorgeARM commented Feb 5, 2021

bjacob commented Feb 5, 2021 •

edited

GeorgeARM commented Feb 5, 2021 •

edited

bjacob commented Feb 5, 2021 •

edited

GeorgeARM commented Feb 8, 2021

bjacob commented Feb 8, 2021

bjacob commented Feb 8, 2021

Simplify quantized multiplier #227

Simplify quantized multiplier #227

Conversation

GeorgeARM commented Nov 30, 2020

GeorgeARM commented Nov 30, 2020 • edited

bjacob commented Jan 19, 2021

talumbau commented Jan 19, 2021

bjacob commented Jan 20, 2021 • edited

dominicsymes commented Jan 26, 2021

Tombana commented Jan 27, 2021 • edited

bjacob commented Feb 2, 2021

GeorgeARM commented Feb 5, 2021

bjacob commented Feb 5, 2021 • edited

GeorgeARM commented Feb 5, 2021 • edited

bjacob commented Feb 5, 2021 • edited

GeorgeARM commented Feb 8, 2021

bjacob commented Feb 8, 2021

bjacob commented Feb 8, 2021

GeorgeARM commented Nov 30, 2020 •

edited

bjacob commented Jan 20, 2021 •

edited

Tombana commented Jan 27, 2021 •

edited

bjacob commented Feb 5, 2021 •

edited

GeorgeARM commented Feb 5, 2021 •

edited

bjacob commented Feb 5, 2021 •

edited