-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
math/bits: add extended precision Add, Sub, Mul, Div #24813
Comments
MulWW - this is same as math/big.mulWW |
This is useful, no doubt. Intel can definitely benefit too. I'd put them all in math/big. The ShrQXX should probably be generic? |
arm64 has an API EXTRconst which can be used to write a rule like (ShrQconst [c] x y) -> EXTRconst [c] x y). If other targets also have EXTRconst this can be generic. |
I don't like the idea of polluting math/big. It already has a million godoc entries, but at least they all deal in Int/Float/Rat, while these would split a fixed size uint128 in hi/lo uint64. |
Intel should have SHRD.
Maybe create something new? like math/uint128 |
Is there any chance of writing this in regular Go code without any extra APIs, and having the compiler recognise the pattern to optimise? |
are you suggesting a builtin [u]int128 type like how GCC does: https://godbolt.org/g/o6ka5e ? |
That is probably too big a change for a very niche use case. I think that less than 0.1% of Go projects would need these. |
If anywhere these would probably belong in math/bits, not math/big. I think some of them were even mooted during the original math/bits API discussion. (On phone so don’t have issue link handy.) |
(and maybe signed versions?) Even The expression you would use for a multiword shift should be directly detectable by the compiler. (It doesn't currently, but it should be straightforward.) |
I think AddQQ and AddQW have uint128 inputs, not uint64. Agreed on not being useful enough to change the language. I don't hate math/uint128, but if we can boil them down to a couple API in math/bits and detecting some patterns, that's better. |
If we have 64x64 -> 128, we might probably also have 32x32 -> 64 on 32-bit architectures. On 64-bit architectures this would be trivial. |
Can't you already get a 32x32->64 multiply on 32-bit archs? 386 can, at least:
compiles to
|
32x32 -> 64 can be handled by pattern matching thanks to uint64. This is about filling the uint128 void. |
You are right. 32x32->64 can already be handled. |
previous conversation about uint128: #9455 |
I like an idea about exposing mulWW and divWW, which we already intrinsify, for general use, but I'm not sure about right package. IIRC there was Gophercon talk about optimizing 64x64->128 multiplication, so this looks like something useful for others.
|
I think we might not need the intrinsic for multiword shiftright because we can create a pattern that can be detected by the compiler, something like this:
Yes, AddQQ takes two uint128 i.e two pairs of uint64 and outputs the result to uint128(hi, lo) and AddQW takes a uint128 and uint64 and outputs the result to uint128(hi, lo). |
I am strongly in favor of doing this in some form. If it's going to be explicit functions, then math/bits or a dedicated math/{intrinsic, uint128, multiword} is a better place for it than math/big, since we care a lot about the details of the underlying representation here. For crypto, my dream API would give me direct mappings to the basic |
Change https://golang.org/cl/106376 mentions this issue: |
Sorry, I jumped the gun and assumed we have a consensus on math/bits as the right package for the APIs and took a stab at it in CL106376. My sincere apologies. |
I strongly disagree about putting these in math/bits: these are not operations on bits. They may involve operations on bits, but ultimately so does everything. I really do not want to see math/bits turn into a junk drawer. The proposed operations are a new class of operations and they deserve their own package for ease of discovery, to make reading code that uses these operations clear, and so that they can have focused documentation. |
I don't see a need for both |
Porting musl's fma implementation to Go required Mul64 as well. Here is a possible full-width multiply implementation.
|
I have a prototype of the extended precision operations over at github.com/smasher164/extprec. All the implementations are adapted from "Hacker's Delight", and the relevant references are provided. However, it still needs real test cases in order to hammer out any possible bugs. I've only tested it with random hand-picked values. Edit: Completed unit tests! |
Change https://golang.org/cl/123157 mentions this issue: |
The above CL ports the go versions of these functions from math/big since those are previously tested. Additional SSA rules will still need to be implemented as this just handles the simple Mul and Div cases for amd64 and arm64 that are used in math/big. |
Really tiny nitpick. Wouldn't it make more sense to have |
Change https://golang.org/cl/129415 mentions this issue: |
Port math/big pure go versions of add-with-carry, subtract-with-borrow, full-width multiply, and full-width divide. Updates golang#24813 Change-Id: Ifae5d2f6ee4237137c9dcba931f69c91b80a4b1c
Add SSA rules to intrinsify Mul/Mul64 (AMD64 and ARM64) and Div/Div64 (AMD64). SSA rules for other functions and architectures are left as a future optimization. Benchmark results on AMD64/ARM64 before and after SSA implementation are below. amd64 name old time/op new time/op delta Add-4 1.82ns ± 2% 1.89ns ± 6% ~ (p=0.167 n=5+5) Add32-4 1.78ns ± 7% 1.80ns ± 9% ~ (p=0.690 n=5+5) Add64-4 1.82ns ± 3% 1.93ns ±23% ~ (p=0.810 n=5+5) Sub-4 1.85ns ± 4% 1.91ns ± 6% ~ (p=0.246 n=5+5) Sub32-4 1.82ns ± 4% 1.87ns ± 8% ~ (p=0.730 n=5+5) Sub64-4 1.91ns ± 7% 1.85ns ± 4% ~ (p=0.341 n=5+5) Mul-4 11.7ns ± 5% 1.8ns ± 0% -84.72% (p=0.000 n=5+4) Mul32-4 1.60ns ± 2% 1.61ns ± 6% ~ (p=0.651 n=5+5) Mul64-4 7.13ns ± 9% 2.08ns ±11% -70.88% (p=0.008 n=5+5) Div-4 59.5ns ± 4% 49.2ns ± 5% -17.43% (p=0.008 n=5+5) Div32-4 18.3ns ± 3% 18.2ns ± 1% ~ (p=0.333 n=5+5) Div64-4 58.6ns ±13% 48.5ns ± 5% -17.24% (p=0.008 n=5+5) arm64 name old time/op new time/op delta Add-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Add32-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Add64-96 5.52ns ± 0% 5.51ns ± 0% ~ (p=0.444 n=5+5) Sub-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Sub32-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Sub64-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Mul-96 34.6ns ± 0% 5.0ns ± 0% -85.52% (p=0.008 n=5+5) Mul32-96 4.51ns ± 0% 4.51ns ± 0% ~ (all equal) Mul64-96 21.1ns ± 0% 5.0ns ± 0% -76.26% (p=0.008 n=5+5) Div-96 64.7ns ± 0% 64.7ns ± 0% ~ (all equal) Div32-96 17.0ns ± 0% 17.0ns ± 0% ~ (all equal) Div64-96 53.1ns ± 0% 53.1ns ± 0% ~ (all equal) Updates golang#24813 Change-Id: I9bda6d2102f65cae3d436a2087b47ed8bafeb068
I was curious about this, so I cherry-picked the relevant CLs on top of current Go master and have some preliminary results from my Ed25519 field arithmetic. Rough benchmarks on a Kaby Lake laptop (i7-7560U @ 2.40GHz):
If anyone else is interested in playing with this, it's this code and this Go. |
Port math/big pure go versions of add-with-carry, subtract-with-borrow, full-width multiply, and full-width divide. Updates #24813 Change-Id: Ifae5d2f6ee4237137c9dcba931f69c91b80a4b1c Reviewed-on: https://go-review.googlesource.com/123157 Reviewed-by: Robert Griesemer <gri@golang.org> Run-TryBot: Robert Griesemer <gri@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org>
@gtank your func mul64(accLo, accHi, x, y uint64) (ol, oh uint64) {
oh, ol = bits.Mul64(x, y)
var carry uint64
ol, carry = bits.Add64(ol, accLo, 0)
oh, _ = bits.Add64(oh, accHi, carry)
return
} at some point that should get turned into ADD+ADC, which would probably get you another boost compared to assembly. |
Add SSA rules to intrinsify Mul/Mul64 (AMD64 and ARM64). SSA rules for other functions and architectures are left as a future optimization. Benchmark results on AMD64/ARM64 before and after SSA implementation are below. amd64 name old time/op new time/op delta Add-4 1.78ns ± 0% 1.85ns ±12% ~ (p=0.397 n=4+5) Add32-4 1.71ns ± 1% 1.70ns ± 0% ~ (p=0.683 n=5+5) Add64-4 1.80ns ± 2% 1.77ns ± 0% -1.22% (p=0.048 n=5+5) Sub-4 1.78ns ± 0% 1.78ns ± 0% ~ (all equal) Sub32-4 1.78ns ± 1% 1.78ns ± 0% ~ (p=1.000 n=5+5) Sub64-4 1.78ns ± 1% 1.78ns ± 0% ~ (p=0.968 n=5+4) Mul-4 11.5ns ± 1% 1.8ns ± 2% -84.39% (p=0.008 n=5+5) Mul32-4 1.39ns ± 0% 1.38ns ± 3% ~ (p=0.175 n=5+5) Mul64-4 6.85ns ± 1% 1.78ns ± 1% -73.97% (p=0.008 n=5+5) Div-4 57.1ns ± 1% 56.7ns ± 0% ~ (p=0.087 n=5+5) Div32-4 18.0ns ± 0% 18.0ns ± 0% ~ (all equal) Div64-4 56.4ns ±10% 53.6ns ± 1% ~ (p=0.071 n=5+5) arm64 name old time/op new time/op delta Add-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Add32-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Add64-96 5.52ns ± 0% 5.51ns ± 0% ~ (p=0.444 n=5+5) Sub-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Sub32-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Sub64-96 5.51ns ± 0% 5.51ns ± 0% ~ (all equal) Mul-96 34.6ns ± 0% 5.0ns ± 0% -85.52% (p=0.008 n=5+5) Mul32-96 4.51ns ± 0% 4.51ns ± 0% ~ (all equal) Mul64-96 21.1ns ± 0% 5.0ns ± 0% -76.26% (p=0.008 n=5+5) Div-96 64.7ns ± 0% 64.7ns ± 0% ~ (all equal) Div32-96 17.0ns ± 0% 17.0ns ± 0% ~ (all equal) Div64-96 53.1ns ± 0% 53.1ns ± 0% ~ (all equal) Updates #24813 Change-Id: I9bda6d2102f65cae3d436a2087b47ed8bafeb068 Reviewed-on: https://go-review.googlesource.com/129415 Run-TryBot: Keith Randall <khr@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Keith Randall <khr@golang.org>
Change https://golang.org/cl/138917 mentions this issue: |
Add SSA rules to intrinsify Mul/Mul64 on ppc64x. benchmark old ns/op new ns/op delta BenchmarkMul-40 8.80 0.93 -89.43% BenchmarkMul32-40 1.39 1.39 +0.00% BenchmarkMul64-40 5.39 0.93 -82.75% Updates #24813 Change-Id: I6e95bfbe976a2278bd17799df184a7fbc0e57829 Reviewed-on: https://go-review.googlesource.com/138917 Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
Should this issue be closed now that CL 123157 has been merged in? |
Maybe? @randall77, are there plans to intrinsify more of this? |
Only Mul has been intrinsified so far. |
Okay, closing. |
This method was using a handwritten long multiplication of uint64s. Since implementation of golang#24813 we can remove it and replace it by Mul64 from math/bits. This brings a small speedup for 64-bit platforms. Benchmarks on Haskell Celeron 2955U. benchmark old ns/op new ns/op delta BenchmarkAppendFloat/Decimal-2 127 127 +0.00% BenchmarkAppendFloat/Float-2 340 317 -6.76% BenchmarkAppendFloat/Exp-2 258 233 -9.69% BenchmarkAppendFloat/NegExp-2 256 231 -9.77% BenchmarkAppendFloat/Big-2 402 375 -6.72% BenchmarkAppendFloat/BinaryExp-2 113 114 +0.88% BenchmarkAppendFloat/32Integer-2 125 125 +0.00% BenchmarkAppendFloat/32ExactFraction-2 274 249 -9.12% BenchmarkAppendFloat/32Point-2 339 317 -6.49% BenchmarkAppendFloat/32Exp-2 255 229 -10.20% BenchmarkAppendFloat/32NegExp-2 254 229 -9.84% BenchmarkAppendFloat/64Fixed1-2 165 154 -6.67% BenchmarkAppendFloat/64Fixed2-2 184 176 -4.35% BenchmarkAppendFloat/64Fixed3-2 168 158 -5.95% BenchmarkAppendFloat/64Fixed4-2 187 177 -5.35% BenchmarkAppendFloat/Slowpath64-2 84977 84883 -0.11%
Change https://golang.org/cl/157717 mentions this issue: |
This method was using a handwritten long multiplication of uint64s. Since implementation of #24813 we can remove it and replace it by Mul64 from math/bits. This brings a small speedup for 64-bit platforms. Benchmarks on Haswell Celeron 2955U. benchmark old ns/op new ns/op delta BenchmarkAppendFloat/Decimal-2 127 127 +0.00% BenchmarkAppendFloat/Float-2 340 317 -6.76% BenchmarkAppendFloat/Exp-2 258 233 -9.69% BenchmarkAppendFloat/NegExp-2 256 231 -9.77% BenchmarkAppendFloat/Big-2 402 375 -6.72% BenchmarkAppendFloat/BinaryExp-2 113 114 +0.88% BenchmarkAppendFloat/32Integer-2 125 125 +0.00% BenchmarkAppendFloat/32ExactFraction-2 274 249 -9.12% BenchmarkAppendFloat/32Point-2 339 317 -6.49% BenchmarkAppendFloat/32Exp-2 255 229 -10.20% BenchmarkAppendFloat/32NegExp-2 254 229 -9.84% BenchmarkAppendFloat/64Fixed1-2 165 154 -6.67% BenchmarkAppendFloat/64Fixed2-2 184 176 -4.35% BenchmarkAppendFloat/64Fixed3-2 168 158 -5.95% BenchmarkAppendFloat/64Fixed4-2 187 177 -5.35% BenchmarkAppendFloat/Slowpath64-2 84977 84883 -0.11% Change-Id: If05784e856289b3b7bf136567882e7ee10234756 Reviewed-on: https://go-review.googlesource.com/c/go/+/157717 Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Robert Griesemer <gri@golang.org>
This is a proposal to speed up crypto/poly1305 using [u]int128 and multiword arithmetic. Arm64 has instructions that let you multiply two 64-bit registers to two 64-bit registers. It also has instructions for multiword addition. I have implemented some of these intrinsics and intrinsified them in https://go-review.googlesource.com/c/go/+/106376 which improved the performance of crypto/poly1305 by ~30% on arm64 (Amberwing). I have added these intrinsics for arm64 in poly1305 package but they might benefit other platforms as well. I am seeking advice on the design of this implementation. Is poly1305 package the right place to have these intrinsics or should they go in math/big or math/bit? This might also be a use case for #9455
The text was updated successfully, but these errors were encountered: