Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: Faster divisibility check for constants #30282

Open
TuomLarsen opened this Issue Feb 17, 2019 · 17 comments

Comments

Projects
None yet
6 participants
@TuomLarsen
Copy link

TuomLarsen commented Feb 17, 2019

What version of Go are you using (go version)?

1.11

What operating system and processor architecture are you using (go env)?

amd64

What did you do?

Please consider the following code which tests whether the given number is divisible by 19:

func Divisible19(x uint64) bool {
    return x%19 == 0
}

What did you expect to see?

It could be rewritten as:

func Divisible19(x uint64) bool {
    return x*0x86bca1af286bca1b <= 0x0d79435e50d79435
}

which compiles to:

MOVQ    "".X+8(SP), AX
MOVQ    $-8737931403336103397, CX
IMULQ   CX, AX
MOVQ    $970881267037344821, CX
CMPQ    AX, CX

What did you see instead?

Instead, the original code compiles to:

MOVQ    $-5825287602224068931, AX
MOVQ    "".X+8(SP), CX
MULQ    CX
ADDQ    CX, DX
RCRQ    $1, DX
SHRQ    $4, DX
LEAQ    (DX)(DX*8), BX
LEAQ    (DX)(BX*2), DX
CMPQ    CX, DX

Similar situation happens for other constants as well.

@josharian

This comment has been minimized.

Copy link
Contributor

josharian commented Feb 17, 2019

Dup of #15806? I started on that but got distracted. My initial implementation was broken and I never returned to it. Feel free to pick it up.

@josharian josharian added this to the Unplanned milestone Feb 17, 2019

@josharian

This comment has been minimized.

Copy link
Contributor

josharian commented Feb 17, 2019

cc @randall77 as FYI

@josharian josharian changed the title Faster divisibility check for constants cmd/compile: Faster divisibility check for constants Feb 17, 2019

@josharian

This comment has been minimized.

Copy link
Contributor

josharian commented Feb 17, 2019

Also possibly related:
https://lemire.me/blog/2019/02/08/faster-remainders-when-the-divisor-is-a-constant-beating-compilers-and-libdivide/ which I haven’t fully read and digested yet.

@TuomLarsen

This comment has been minimized.

Copy link
Author

TuomLarsen commented Feb 17, 2019

@josharian Interestingly, @bmkessler has a library implementing the above paper.

@josharian

This comment has been minimized.

Copy link
Contributor

josharian commented Feb 17, 2019

Nice!

@bmkessler, I’d be happy to help you find your way around if you’d like to contribute that to the compiler. It might also be helpful in a few parts of the runtime, IIRC.

@bmkessler

This comment has been minimized.

Copy link
Contributor

bmkessler commented Feb 22, 2019

Yeah, I thought about opening an issue and submitting the changes for that library. The benefit for that method is ~25% on 32-bit divisions/mods by constants for amd64, which I don't think is a very common case.

The main complication for submitting this change, is that the rewrite rules are at the generic level currently e.g. https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssa/gen/generic.rules#L957-L981 and following which are only broken out by register size and useHmul. The speed up of Lemire's method depends on the relative difference in using Hmul64 versus the additional shifts/average in the current method. So if the generic rewrite rules are going to be modified, you'd need to test on other architectures. I haven't tested any other architecture, but Lemire notes that 32-bit division and mod are slower on arm and power using that method. One benefit to Lemire's method is that the rule is considerably simpler than the Granlund-Montgomery method because there is just a single case that doesn't depend on the constant divisor. However, there would also be two methods, since the current method would continue to be employed for all the other divisions, so I am not sure if the extra complexity is worth it.

@josharian

This comment has been minimized.

Copy link
Contributor

josharian commented Feb 22, 2019

It’s easy enough to condition generic rules based on architecture, and they have minimal compiler performance impact.

It’s also pretty straightforward to check how many 32 bit divides there are in std; I vaguely recall that there are more than one would suspect.

In short, I’d be interested in going down this road at least a little. Particularly if #15806 got fixed along the way. :) But I’ll defer to Keith in the end.

@randall77

This comment has been minimized.

Copy link
Contributor

randall77 commented Feb 25, 2019

I'm ok with all of this if someone wants to implement it, and it really is faster.
My only demand is that we need good comments (like the ones in magic.go) and good tests.

@josharian

This comment has been minimized.

@bmkessler

This comment has been minimized.

Copy link
Contributor

bmkessler commented Mar 6, 2019

I implemented just the division portion of the rules based on Lemire's method and uint32 is faster on my amd64 machine, but int32 is slower.

name              old time/op  new time/op  delta
DivconstI32-4     1.52ns ± 0%  1.77ns ± 0%  +16.71%  (p=0.016 n=4+5)
DivconstU32-4     1.95ns ± 0%  1.38ns ± 0%  -29.09%  (p=0.000 n=5+4)

I can push that change. I didn't add the mod rule because I wasn't sure how to handle

q, r = x/d, x%d

currently the rules rewrite that as the equivalent of

q, r = x/d, x - (x/d)*d

So the x/d can be used in both calculations. If a different mod rule is put in place, there will be duplicate calculation for this case (three multiplications instead of two), but standalone mods would be faster as they are just two multiplications and no subtraction. So I am not sure if a stand alone mod rule is useful or if the case of division and mod can be handled some straightforward way. Just adding this div rule already speeds up the mod decently though.

ModconstU32-4     2.42ns ± 0%  2.07ns ± 0%  -14.46%  (p=0.008 n=5+5)
@TuomLarsen

This comment has been minimized.

Copy link
Author

TuomLarsen commented Mar 6, 2019

Thank you for looking into this! But, and I apologize if I misunderstood, the issue is about divisibility check by constant (i.e. x%d == 0), rather than division by constant (i.e. x%d). The former seem to be simpler and for small d (perhaps even as large as uint32?) it should be possible to test uint64%d == 0 with just one multiplication and compare, as shown at vert first comment. But again, maybe I misunderstood?

@bmkessler

This comment has been minimized.

Copy link
Contributor

bmkessler commented Mar 6, 2019

Sorry, I should have been more clear. It is possible to implement x%d == 0 in a single multiplication and compare as you mention for unsigned odd divisors of any size. For even divisors, there is an additional rotate required in the Granlund-Montgomery method and signed divisors require an additional add constant in both cases. (Hacker's Delight 10-17 is a better expanded reference than the note in the original paper).

As an initial update, I started to implement all of the changes from the Lemire paper for division, modulus and divisibility check on 32-bit integers, which are faster and don't involve any separate cases depending on the divisor (same magic numbers), but wasn't sure how to proceed for mod and divisibility since the current rules already rewrite Mod into a division. It looks like @josharian ran into the same issue on #15806 where he ended up using matching against the already expanded form of the Mod. Maybe he can help sort out how best to set up the rules and we can update the divisibility check for all the sizes of integers.

@josharian

This comment has been minimized.

Copy link
Contributor

josharian commented Mar 6, 2019

I’m afk for a bit. But...would something along the lines of https://go-review.googlesource.com/c/go/+/129379 work for you?

@randall77

This comment has been minimized.

Copy link
Contributor

randall77 commented Mar 6, 2019

I suspect it is often the case that when you do x%c == 0, you'll also be doing x/c. So I think we need some way to determine you're not doing the latter before we optimize the former.

It might be a good idea to grovel through some Go code to see how correct my suspicion is. If it is 99% correct, the x%c==0 optimization is ~useless. If it is only 10% correct, maybe we just do the optimization regardless. If it is in between, we might need to detect the divide somehow.

Do x/c and x%c==0 use the results of the same multiplies? Those are the most expensive sub-ops, so if they are shared that might be good enough.

@TuomLarsen

This comment has been minimized.

Copy link
Author

TuomLarsen commented Mar 6, 2019

@randall77 I know about this, which is x%c == 0 without the division.

@bmkessler

This comment has been minimized.

Copy link
Contributor

bmkessler commented Mar 7, 2019

@josharian Thanks! Disabling the Mod rule on the opt pass allowed the divisibility rule to be applied. @randall77 I miscounted instructions earlier, the multiplies use different constants (modular inverse for the divisibility check), but even when both x/d and x%d==0 are needed that is still only two multiplies in the old and new code. I benchmarked the divisibility check in both cases and the pure check x%d == 0 is about 35% faster on amd64 for uint64 and the case with both x/d and x%d == 0 is only 3% slower, so it seems like the divisibility check is a worthwhile optimization as long as the cases where we need both x/d and x%d == 0 are < 92% of the time.

name                     old time/op  new time/op  delta
DivisibleconstU64-4      3.00ns ± 0%  1.92ns ± 0%  -35.96%  (p=0.000 n=5+4)
DivisibleWDivconstU64-4  3.46ns ± 1%  3.56ns ± 1%   +2.89%  (p=0.008 n=5+5)

In addition to the isPrime above, the only other raw uses of numeric constants for x%n == 0 other than n a power of two in the standard library seem to be runtime and time/isLeap neither of which have a division with the same constant. There are a three uses of x%d != 0 as well.

I can push something up for review after fleshing out the tests and comments.

@gopherbot

This comment has been minimized.

Copy link

gopherbot commented Mar 18, 2019

Change https://golang.org/cl/168037 mentions this issue: cmd/compile: add unsigned divisibility rules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.