Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify quantized multiplier #227

Closed
wants to merge 3 commits into from
Closed

Conversation

GeorgeARM
Copy link
Contributor

Alter sequence to a single rounded scaling with normal rounded shift.
Double rounding and symmetric rounding are removed compared to
reference. Double rounding seems unnecessary and can complicate
implementations. Moreover, symmetric rounding also adds implementation
complexity.

For NEON the new sequence can be translated to VQDMULH + VRSHR.

@google-cla google-cla bot added the cla: yes label Nov 30, 2020
@GeorgeARM
Copy link
Contributor Author

GeorgeARM commented Nov 30, 2020

CC: @jdduke , @talumbau

Tests seem to pass on our side.
Most of them operate though on negative shifts, is there a way to test a range of positive shifts through your test framework as well?

Moreover, apply_multiplier_test passes with minor changes on corner case scenario.

@bjacob
Copy link
Contributor

bjacob commented Jan 19, 2021

Sorry that I didn't see this earlier! Thanks for the PR. I need to ponder this and get back to you. @jdduke @talumbau @silvasean any opinion?

@talumbau
Copy link
Collaborator

I've been discussing with folks at ARM for a while about the rescale operation and the single vs. double rounding. Some additional context is the issue here:

tensorflow/tensorflow#25087

I confirmed that this is actually the rescale that was in the x86 kernels before I changed it over to be the same as the ARM kernels. Generally, I am in favor of all of the following having the same rescale operation:

  • Ruy ARM kernels
  • Ruy x86 kernels
  • Ruy reference code
  • TF Lite reference code

AFAICT it is superior from an accuracy perspective to use the single versus the double rounding. If this is correct, then I would be in favor of all of the above going to the single rounding. This change would be significant so most of my efforts around this have been related to testing issues on the TF Lite side.

@bjacob
Copy link
Contributor

bjacob commented Jan 20, 2021

Thanks again for filing this PR. I have a few comments/questions.

  1. Comment on terminology: the term "rounding" seems to have two different meanings here:

    1. Meaning 1: "rounding" means snapping a non-representable value to some nearby representable value, as opposed to an exact operation.
      1. Example: fixed-point multiplications SQDMULH and SQRDMULH are both "rounding" in this sense, as opposed to plain integer multiplications (MUL) which are exact.
    2. Meaning 2: in the context of ARM ISAs, "rounding" means rounding to nearest with ties broken upwards, as opposed to rounding downwards (to negative infinity).
      1. Example: SQRDMULH is rounding in this sense as opposed to SQDMULH. Likewise, SRSHL as opposed to SSHL.
  2. Question, following the above Comment: when in this PR we talk about replacing a double rounding by a single rounding, are we talking about the switching from SQRDMULH to SQDMULH? If so, since that means that the rounding that we are removing is a Meaning 2 rounding, not a Meaning 1 rounding, and we still have two Meaning 1 roundings (SQDMULH and SRSHL are both rounding in Meaning 1), what is the rationale for this change - why is SQDMULH+SRSHL an improvement over SQRDMULH+SRSHL? (This is a real question --- I'm not implying here that this change isn't useful --- but I'm pointing out that if it is useful it must be for a non-trivial reason, which I would like to understand, and which might be either mathematical or empirical.)

  3. Since Two roundings in MultiplyByQuantizedMultiplier(TFLite) leads to inconsistent result with tensorflow tensorflow/tensorflow#25087 was mentioned above, I would like to check that we are on the same page: Two roundings in MultiplyByQuantizedMultiplier(TFLite) leads to inconsistent result with tensorflow tensorflow/tensorflow#25087 was about switching from 2 roundings to 1 rounding in the Meaning 1 , since it was implementing a scalar primitive performing both the multiplication and the shift in exact integer arithmetic before rounding at the end. I'm referring specifically to this comment's SaturatingRoundingDoublingHighMulWithShift function. This is unlike the present code diff, which is (assuming we agreed in the above point) about about switching from 2 roundings to 1 rounding in the Meaning 2, while keeping 2 roundings in Meaning 1. Do we agree that these are two quite different conversations?

  4. Looking at the present code diff, I was intrigued by the change to how bit shifts amounts are computed: the new code is putting the constant -1 in register v8, then does (in a context where v14 holds the exponent): smin v11.4s, v8.4s, v14.4s and sub v12.4s, v14.4s, v11.4s. So it is setting: v11 = min(-1, exponent) and v12 = exponent - v11. Consider the case where exponent>=0. Then this sets v11 = -1 and v12 = exponent+1. As v12 is used as the left shift amount before the fixed-point multiplication, this means that any positive exponent is increased by +1. We never defined exactly the allowed value bounds for int32 accumulators and for positive exponents, so I can't say that this would push any real case over the overflow limit, but I just wanted to point this out. At least I would like to understand what is the rationale for adding +1 to left shifts and then subsequently performing a right shift by 1? Does it serve an accuracy purpose and how does it relate to the switch of fixed-point multiplication from SQRDMULH to SQDMULH? Is there a mathematical principle to this or is this empirical?

Finally, regarding your above question:

Most of them operate though on negative shifts, is there a way to test a range of positive shifts through your test framework as well?

Not really --- ruy's test framework isn't exercising that very well. TFLite's tests (which exercise ruy as a back-end) add some test coverage. Once we have narrowed down exactly what code change we want to make, we can discuss what kind of ad-hoc testing might be needed for a change like this.

@dominicsymes
Copy link

Hi @bjacob,

Thanks for your comments on this PR.

In your question (2) based on comment (1) the description of replacing double rounding by single rounding is referring to the base C implementation in apply_multiplier.cc where the two rounding operations (SaturatingRoundingDoublingHighMul and RoundingRightShift) are replaced by a single rounding operation (addition of 'round' to result and then shift right by total_shift). We think the benefits of this are twofold:

  • In terms of implementation, it can be implemented as a 32x32->64 bit multiply, add and right shift. These are standard operations available in programming languages and processors and so should enable efficient bit-exact implementations in software or hardware to the proposed C implementation in apply_multiplier.cc.
  • In terms of mathematical result, the double round can introduce a bias that means that the rounding is not to the nearest value. This is I think the issue commented by @Tombana in Two roundings in MultiplyByQuantizedMultiplier(TFLite) leads to inconsistent result with tensorflow tensorflow/tensorflow#25087 where @Tombana gives example of n+0.25 rounding to n+1.0 for n=0,1,2. If the parameters to MultiplyByQuantizedMultiplier() are x=1, quantized_multiplier=1<<30, shift=-1 then the correct mathematical result is 0.25 but in the original code, the SaturatingRoundingDoublingHighMul() will calculate (1*(1<<30) + (1<<30))>>31 = 1 and then RoundingRightShift() will calculate (1+1)>>1 = 1. The double rounding introduces an upward bias of 0.25 for positive values. More generally for shift=-k the original code introduces an upward bias of 0.5/(1<<k) for positive values.

So, regarding your point (3), we are switching the base C implementation (in apply_multiplier.cc) from two rounding operations to one rounding and I think this part is similar to the conversation in tensorflow/tensorflow#25087. What we show in this patch is that it is possible to modify the optimized Arm code implementation to match the single rounding base C implementation proposed in apply_multiplier.cc. Although, as you point out, there are still two roundings (meaning 1) occurring in this optimized code, they combine to give the same effect as a single round (as in the C code). This is because the first rounding (meaning 1) is a truncating right shift and truncating right shifts can be combined ((x>>a)>>b)=x>>(a+b). So the single round (x+(1<<(30+k)))>>(31+k) where total_shift=31+k can be expressed as ((x>>31) + (1<<(k-1)))>>k without affecting the result when k>1.

Regarding point (4), the reason for ensuring the right shift is greater or equal to 1 is because the equation above, (x+(1<<(30+k)))>>(31+k) = ((x>>31) + (1<<(k-1)))>>k, is only true for k>=1. When k=0 the left side would round to nearest but the right side would just truncate (the rounding right shift instruction takes 1<<(k-1) to be zero when k is 0 - it does not add any rounding offset). We don't think forcing the right shift k>=1 is an issue in practice, but as you say it is a difference to the accumulator bound for shift>=0.

I hope the above helps to clarify but please let us know if you have any more questions or comments. Thanks.

@Tombana
Copy link

Tombana commented Jan 27, 2021

For what it's worth, independently from the people in this PR, we've been using this new scheme on Cortex-M and RiscV microcontrollers successfully for a while, where it too resolves to more efficient instructions than the old scheme (while also being more accurate). This is in TFLite Micro, but the code and issue was essentially the same.

bjacob added a commit to bjacob/ruy that referenced this pull request Feb 2, 2021
@bjacob
Copy link
Contributor

bjacob commented Feb 2, 2021

Thank you so much, Dominic and Tom, for the explanations, which addressed multiple misunderstandings on my part.

I hadn't properly read the patch regardingMultiplyByQuantizedMultiplier - sorry.

Thanks for the explanation on how the switch to the truncating SQDMULH enables bit-exact agreement with the reference code thanks to truncating right shifts being combinable -- not only this was a key thing that I needed to understand to understand this PR, but this is a really interesting general idea that I'm thankful to have learned from your explanation.

Overall this PR looks great. The merge conflict is easy: I've added comments to the local functions in apply_multiplier.cc which you are deleting. My only request there would be please preserve this part of the comment which I just added:

// Warning: this code is not meant to be bit-exact-normative.
// Please refer to the class comment of ruy::MulParams, in mul_params.h.

And please update that referred-to comment in mul_params.h to match your new code --- hint: it was going into too much detail, so it's perhaps best to update it by trimming it.

ruy/ruy/mul_params.h

Lines 65 to 73 in 2887692

// In the latter case (DstScalar integral and narrower than std::int32_t),
// the multiplier effect on accumulator values is as follows:
//
// 1. Left shift by max(0, multiplier_exponent).
// 2. Fixed-point multiplication by multiplier_fixedpoint in Q0.31 format.
// 3. Rounding right shift by max(0, -multiplier_exponent).
//
// Reference code for this can be found in the implementation of
// ruy::ApplyMultiplier. If you look there, you'll find warnings like this:

While we're on the topic of comments: this sentence,

`On NEON this can be translated to a SQDMULH + rounding shift right sequence`

deserves some expansion, since it's the key to the whole approach and it contains a non-trivial idea (that SQDMULH is actually better than SQRDMULH because its truncating behavior is that of a truncating right shift, which allows perfect combining with another truncating right shift).

This test introduces off-by-one test errors on x86 --- x86 behavior is unchanged but the reference code is changed. As @talumbau said above, this is OK and we'll want to update the x86 code to match. In the interim, I made PR #251 to relax tolerance, I'll merge it just before yours.

Finally, there is one compilation issue which prevented me from trying your code on Android NDK r21d (default clang-based toolchain, clang version 9.0.8):

ruy/kernel_arm64.cc:403:10: error: immediate must be an integer in range [0, 255].
        "movi v8.4s, #-1\n"
         ^
<inline asm>:170:13: note: instantiated into assembly here

Could you take care of it?

Alter sequence to a single rounded scaling with normal rounded shift.
Double rounding and symmetric rounding are removed compared to
reference. Double rounding seems unnecessary and can complicate
implementations. Moreover, symmetric rounding also adds implementation
complexity.

For NEON the new sequence can be translated to VQDMULH + VRSHR.
- Elaborate on the proposed approach
- Fix Android compilationion issue by using MVN instead of MOVI
@GeorgeARM
Copy link
Contributor Author

Hello @bjacob ,

Thanks for your comments!
Updated the patch as per your suggestions.
Finally, I did build with r21 and tested the patch without any issues on my side.

Hope this helps,
George

@bjacob
Copy link
Contributor

bjacob commented Feb 5, 2021

Thanks! I gave it a try and there is a test failure on Android/aarch64 in the tests where the destination type is int16 only. Moreover, as we can see in the test output below, it is only the Dotprod kernels that are affected, and the effect of the bug is off-by-2x output values, as if a shift amount were off-by-1.

Here are my steps to run all tests on Android/aarch64 with cmake/ctest assuming that you have an Android device connected to your workstation and accessible via adb.

mkdir ruy-android-build
cd ruy-android-build
# -G Ninja is entirely optional, should reproduce in the same way with default Makefile generator
# The precise ANDROID_PLATFORM shouldn't play any role either -
# just happens to be the latest in NDK r21d which I'm using.
# ~/android-ndk-r21d/ is just where i've untarred the Android NDK.
cmake ../ruy -G Ninja \
   -DCMAKE_TOOLCHAIN_FILE=~/android-ndk-r21d/build/cmake/android.toolchain.cmake \
   -DANDROID_ABI=arm64-v8a \
   -DANDROID_PLATFORM=android-29
cmake --build . -j12
ctest -j12

Result:

The following tests FAILED:
	 20 - ruy_test_fast_u8_u8_i32_i16 (Failed)
	 26 - ruy_test_slow_u8_u8_i32_i16 (Failed)

Running one with verbose output:

ctest -R ruy_test_fast_u8_u8_i32_i16 --verbose

output:

20: Error: Not all paths agree. 
20: These paths all agree with each other: kStandardCpp, kInternalStandardCppVariant1, kInternalStandardCppVariant2, kInternalStandardCppVariant3, kNeon/kA55ish, kNeon/kGeneric, kNeon, kReference, but disagree with the rest.
20: These paths all agree with each other: kNeonDotprod/kA55ish, kNeonDotprod/kGeneric, kNeonDotprod, but disagree with the rest.
20: Shape: rows = 1, cols = 1, depth = 1
20: Stats of the good result matrix: (median = -809, mean = -809, min = -809, max = -809)
20: Stats of the bad result matrix:  (median = -404, mean = -404, min = -404, max = -404)
20: Errors found in ALL rows.
20: Errors found in ALL cols.
20: The first error occurs at row 0, col 0
20: Good value: -809
20: Bad value : -404
20: Region of Good result matrix around first error:
20: 
20: -809 
20: 
20: Region of Bad result matrix around first error:
20: 
20: -404 

@GeorgeARM
Copy link
Contributor Author

GeorgeARM commented Feb 5, 2021

Hello @bjacob ,

Thanks for your prompt response. Did test the dot-product ones but might have missed the i16 output kernels. Will have a look as soon as possible.

Apologies for any inconvenience.

@bjacob
Copy link
Contributor

bjacob commented Feb 5, 2021

No problem at all! Note that the same kernels handle all output types, different output types correspond to different conditional branches only at the very end of each kernel, and moreover, the rescaling arithmetic which you perform on accumulators still represented in int32 is shared (the branching for different output types happens after that point). So not obvious to me how you can have a int16 specific bug there, but maybe comparison with the kNeon kernels, which don't have the bug, will shed light.

EDIT: one possibility for why int16 only is affected might be that int16 results in higher shift amounts, in the sense of "more likely to be a left-shift, as opposed to a right-shift" For example, if you had a bug specific to the case where the exponent is positive, or is zero, then that could manifest itself only in int16 tests.

NEON dot-product kernels were conditionally applying the left shift
only when `multiplier_exponent` was greater than zero.
The new approach alters the way shifts are calculated thus removing
these conditional paths.

Moreover, no path seems to use this flag, so we extracted it from the
common logic and we always set to true.
@GeorgeARM
Copy link
Contributor Author

@bjacob the issue was related to the left shift that was conditionally applied only when RUY_ASM_FLAG_NEEDS_LEFT_SHIFT was set, something that was true either in the case of per-channel quantization or when multiplier_exponent was greater than 0.
As the logic for computing the shifts has changed I removed these conditional paths from the NEON dot-product kernels, which from what it seems was the only place that this flag was used.

Hope this resolves the issues on your side as well.

@bjacob
Copy link
Contributor

bjacob commented Feb 8, 2021

Thanks! Now all tests pass on my side on arm64, arm32 and x86-64.
RUY_ASM_FLAG_NEEDS_LEFT_SHIFT served the purpose of skipping the left shifts when they were not needed, however with current workloads now all using per-channel quantization, the opportunity to skip the left shifts was going away already, as you found.
This is looking ready as far as I can see. Now I'm going to pull that into google's internal mega repository. This process might take a few days in the case of this PR specifically because: since this PR changes quantized inference behavior, it might break (possibly overly strict) tests somewhere in google. To prevent that I'm going to request a full run of all of google's tests before we actually merge this. The iteration cycle on this is ~ 1 day and I might need to iterate a few times (hopefully just adjusting tests).

copybara-service bot pushed a commit that referenced this pull request Feb 8, 2021
Closes #251

COPYBARA_INTEGRATE_REVIEW=#251 from bjacob:relax c8d2cf9
PiperOrigin-RevId: 356340585
@bjacob
Copy link
Contributor

bjacob commented Feb 8, 2021

We're getting there. Looks like only 1 google test needed a trivial relaxation. Also submitted PR #251. Doing another round now ...

@copybara-service copybara-service bot closed this in be760b6 Feb 9, 2021
copybara-service bot pushed a commit to google/XNNPACK that referenced this pull request Jul 16, 2021
Evaluate Ruy requantization schema for ARM NEON, suggested in google/ruy#227

PiperOrigin-RevId: 385218562
copybara-service bot pushed a commit to google/XNNPACK that referenced this pull request Jul 16, 2021
Evaluate Ruy requantization schema for ARM NEON, suggested in google/ruy#227

PiperOrigin-RevId: 385218562
copybara-service bot pushed a commit to google/XNNPACK that referenced this pull request Jul 16, 2021
Evaluate Ruy requantization schema for ARM NEON, suggested in google/ruy#227

PiperOrigin-RevId: 385232964
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants