New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

math/bits: add extended precision Add, Sub, Mul, Div #24813

Closed
bmakam-qdt opened this Issue Apr 11, 2018 · 73 comments

Comments

Projects
None yet
@bmakam-qdt
Contributor

bmakam-qdt commented Apr 11, 2018

This is a proposal to speed up crypto/poly1305 using [u]int128 and multiword arithmetic. Arm64 has instructions that let you multiply two 64-bit registers to two 64-bit registers. It also has instructions for multiword addition. I have implemented some of these intrinsics and intrinsified them in https://go-review.googlesource.com/c/go/+/106376 which improved the performance of crypto/poly1305 by ~30% on arm64 (Amberwing). I have added these intrinsics for arm64 in poly1305 package but they might benefit other platforms as well. I am seeking advice on the design of this implementation. Is poly1305 package the right place to have these intrinsics or should they go in math/big or math/bit? This might also be a use case for #9455

@gopherbot gopherbot added this to the Proposal milestone Apr 11, 2018

@gopherbot gopherbot added the Proposal label Apr 11, 2018

@FiloSottile FiloSottile changed the title from proposal: improve crypto/poly1305 using multiword arithmetic to proposal: intrinsic multiword arithmetic Apr 11, 2018

@FiloSottile

This comment has been minimized.

Show comment
Hide comment
@FiloSottile

FiloSottile Apr 11, 2018

Member

amd64 also has these instructions, and access to them is one of the major blockers to writing performant crypto in Go instead of assembly.

I think the best place for them would be math/bits. Can you give us a preview of the exported APIs?

/cc @vkrasnov @gtank (they probably have opinions)

Member

FiloSottile commented Apr 11, 2018

amd64 also has these instructions, and access to them is one of the major blockers to writing performant crypto in Go instead of assembly.

I think the best place for them would be math/bits. Can you give us a preview of the exported APIs?

/cc @vkrasnov @gtank (they probably have opinions)

@mvdan

This comment has been minimized.

Show comment
Hide comment
@mvdan

mvdan Apr 11, 2018

Member

/cc @mundaym @TocarIP too, perhaps

Member

mvdan commented Apr 11, 2018

/cc @mundaym @TocarIP too, perhaps

@bmakam-qdt

This comment has been minimized.

Show comment
Hide comment
@bmakam-qdt

bmakam-qdt Apr 11, 2018

Contributor

MulWW - this is same as math/big.mulWW
AddQQ - adds two uint128. Each uint128 is represented as a pair of uint64 (hi, lo) returns uint128(hi,lo)
AddQW - adds a uint128 and uint64 returns uint128(hi, lo)
ShrQ44 - shiftright uint128 with constant 44 returns uint64
ShrQ42 - shiftright uint128 with constant 42 returns uint64
The arm64 assembly for these APIs are in https://go-review.googlesource.com/c/go/+/106376/3/src/vendor/golang_org/x/crypto/poly1305/asm_arm64.s
ShrQ44 and ShrQ42 can be combined into one API I think.

Contributor

bmakam-qdt commented Apr 11, 2018

MulWW - this is same as math/big.mulWW
AddQQ - adds two uint128. Each uint128 is represented as a pair of uint64 (hi, lo) returns uint128(hi,lo)
AddQW - adds a uint128 and uint64 returns uint128(hi, lo)
ShrQ44 - shiftright uint128 with constant 44 returns uint64
ShrQ42 - shiftright uint128 with constant 42 returns uint64
The arm64 assembly for these APIs are in https://go-review.googlesource.com/c/go/+/106376/3/src/vendor/golang_org/x/crypto/poly1305/asm_arm64.s
ShrQ44 and ShrQ42 can be combined into one API I think.

@vkrasnov

This comment has been minimized.

Show comment
Hide comment
@vkrasnov

vkrasnov Apr 11, 2018

Contributor

This is useful, no doubt. Intel can definitely benefit too. I'd put them all in math/big. The ShrQXX should probably be generic?

Contributor

vkrasnov commented Apr 11, 2018

This is useful, no doubt. Intel can definitely benefit too. I'd put them all in math/big. The ShrQXX should probably be generic?

@bmakam-qdt

This comment has been minimized.

Show comment
Hide comment
@bmakam-qdt

bmakam-qdt Apr 11, 2018

Contributor

arm64 has an API EXTRconst which can be used to write a rule like (ShrQconst [c] x y) -> EXTRconst [c] x y). If other targets also have EXTRconst this can be generic.

Contributor

bmakam-qdt commented Apr 11, 2018

arm64 has an API EXTRconst which can be used to write a rule like (ShrQconst [c] x y) -> EXTRconst [c] x y). If other targets also have EXTRconst this can be generic.

@FiloSottile

This comment has been minimized.

Show comment
Hide comment
@FiloSottile

FiloSottile Apr 11, 2018

Member

I don't like the idea of polluting math/big. It already has a million godoc entries, but at least they all deal in Int/Float/Rat, while these would split a fixed size uint128 in hi/lo uint64.

Member

FiloSottile commented Apr 11, 2018

I don't like the idea of polluting math/big. It already has a million godoc entries, but at least they all deal in Int/Float/Rat, while these would split a fixed size uint128 in hi/lo uint64.

@vkrasnov

This comment has been minimized.

Show comment
Hide comment
@vkrasnov

vkrasnov Apr 11, 2018

Contributor

arm64 has an API EXTRconst which can be used to write a rule like (ShrQconst [c] x y) -> EXTRconst [c] x y). If other targets also have EXTRconst this can be generic.

Intel should have SHRD.

I don't like the idea of polluting math/big

Maybe create something new? like math/uint128

Contributor

vkrasnov commented Apr 11, 2018

arm64 has an API EXTRconst which can be used to write a rule like (ShrQconst [c] x y) -> EXTRconst [c] x y). If other targets also have EXTRconst this can be generic.

Intel should have SHRD.

I don't like the idea of polluting math/big

Maybe create something new? like math/uint128

@mvdan

This comment has been minimized.

Show comment
Hide comment
@mvdan

mvdan Apr 11, 2018

Member

Is there any chance of writing this in regular Go code without any extra APIs, and having the compiler recognise the pattern to optimise?

Member

mvdan commented Apr 11, 2018

Is there any chance of writing this in regular Go code without any extra APIs, and having the compiler recognise the pattern to optimise?

@bmakam-qdt

This comment has been minimized.

Show comment
Hide comment
@bmakam-qdt

bmakam-qdt Apr 11, 2018

Contributor

are you suggesting a builtin [u]int128 type like how GCC does: https://godbolt.org/g/o6ka5e ?

Contributor

bmakam-qdt commented Apr 11, 2018

are you suggesting a builtin [u]int128 type like how GCC does: https://godbolt.org/g/o6ka5e ?

@FiloSottile

This comment has been minimized.

Show comment
Hide comment
@FiloSottile

FiloSottile Apr 11, 2018

Member

@mvdan Not really. The high bits of uint64 * uint64 are simply discarded. We would need uint128 in the language. (#9455)

Member

FiloSottile commented Apr 11, 2018

@mvdan Not really. The high bits of uint64 * uint64 are simply discarded. We would need uint128 in the language. (#9455)

@vkrasnov

This comment has been minimized.

Show comment
Hide comment
@vkrasnov

vkrasnov Apr 11, 2018

Contributor

That is probably too big a change for a very niche use case. I think that less than 0.1% of Go projects would need these.

Contributor

vkrasnov commented Apr 11, 2018

That is probably too big a change for a very niche use case. I think that less than 0.1% of Go projects would need these.

@josharian

This comment has been minimized.

Show comment
Hide comment
@josharian

josharian Apr 11, 2018

Contributor

If anywhere these would probably belong in math/bits, not math/big. I think some of them were even mooted during the original math/bits API discussion. (On phone so don’t have issue link handy.)

Contributor

josharian commented Apr 11, 2018

If anywhere these would probably belong in math/bits, not math/big. I think some of them were even mooted during the original math/bits API discussion. (On phone so don’t have issue link handy.)

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Apr 11, 2018

Contributor

math/bits seems like the right place.
I think you only need the following building blocks:

func Mul(x, y uint64) (uint64, uint64) // 64x64 -> 128 multiply (low, then high)
func Add(x, y uint64) (uint64, uint64) // 64x64 -> 65 add (low, then high)

(and maybe signed versions?)
Everything else can be written in Go.

Even Add is not really necessary, but might make it easier for the compiler to detect the carry. (In Go, the carry is low_result < low_x || low_result < low_y or something.)

The expression you would use for a multiword shift should be directly detectable by the compiler. (It doesn't currently, but it should be straightforward.)

Contributor

randall77 commented Apr 11, 2018

math/bits seems like the right place.
I think you only need the following building blocks:

func Mul(x, y uint64) (uint64, uint64) // 64x64 -> 128 multiply (low, then high)
func Add(x, y uint64) (uint64, uint64) // 64x64 -> 65 add (low, then high)

(and maybe signed versions?)
Everything else can be written in Go.

Even Add is not really necessary, but might make it easier for the compiler to detect the carry. (In Go, the carry is low_result < low_x || low_result < low_y or something.)

The expression you would use for a multiword shift should be directly detectable by the compiler. (It doesn't currently, but it should be straightforward.)

@FiloSottile

This comment has been minimized.

Show comment
Hide comment
@FiloSottile

FiloSottile Apr 11, 2018

Member

I think AddQQ and AddQW have uint128 inputs, not uint64.

Agreed on not being useful enough to change the language. I don't hate math/uint128, but if we can boil them down to a couple API in math/bits and detecting some patterns, that's better.

Member

FiloSottile commented Apr 11, 2018

I think AddQQ and AddQW have uint128 inputs, not uint64.

Agreed on not being useful enough to change the language. I don't hate math/uint128, but if we can boil them down to a couple API in math/bits and detecting some patterns, that's better.

@cherrymui

This comment has been minimized.

Show comment
Hide comment
@cherrymui

cherrymui Apr 12, 2018

Contributor

If we have 64x64 -> 128, we might probably also have 32x32 -> 64 on 32-bit architectures. On 64-bit architectures this would be trivial.

Contributor

cherrymui commented Apr 12, 2018

If we have 64x64 -> 128, we might probably also have 32x32 -> 64 on 32-bit architectures. On 64-bit architectures this would be trivial.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Apr 12, 2018

Contributor

Can't you already get a 32x32->64 multiply on 32-bit archs? 386 can, at least:

func f(x, y uint32) (uint32, uint32) {
	z := uint64(x) * uint64(y)
	return uint32(z), uint32(z >> 32)
}

compiles to

	0x0012 00018 (/home/khr/go/tmp1.go:3)	MOVL	"".y+8(SP), AX
	0x0016 00022 (/home/khr/go/tmp1.go:3)	MOVL	"".x+4(SP), CX
	0x001a 00026 (/home/khr/go/tmp1.go:4)	MULL	CX
	0x001c 00028 (/home/khr/go/tmp1.go:4)	MOVL	AX, "".~r2+12(SP)
	0x0020 00032 (/home/khr/go/tmp1.go:4)	MOVL	DX, "".~r3+16(SP)
Contributor

randall77 commented Apr 12, 2018

Can't you already get a 32x32->64 multiply on 32-bit archs? 386 can, at least:

func f(x, y uint32) (uint32, uint32) {
	z := uint64(x) * uint64(y)
	return uint32(z), uint32(z >> 32)
}

compiles to

	0x0012 00018 (/home/khr/go/tmp1.go:3)	MOVL	"".y+8(SP), AX
	0x0016 00022 (/home/khr/go/tmp1.go:3)	MOVL	"".x+4(SP), CX
	0x001a 00026 (/home/khr/go/tmp1.go:4)	MULL	CX
	0x001c 00028 (/home/khr/go/tmp1.go:4)	MOVL	AX, "".~r2+12(SP)
	0x0020 00032 (/home/khr/go/tmp1.go:4)	MOVL	DX, "".~r3+16(SP)
@FiloSottile

This comment has been minimized.

Show comment
Hide comment
@FiloSottile

FiloSottile Apr 12, 2018

Member

32x32 -> 64 can be handled by pattern matching thanks to uint64. This is about filling the uint128 void.

Member

FiloSottile commented Apr 12, 2018

32x32 -> 64 can be handled by pattern matching thanks to uint64. This is about filling the uint128 void.

@cherrymui

This comment has been minimized.

Show comment
Hide comment
@cherrymui

cherrymui Apr 12, 2018

Contributor

You are right. 32x32->64 can already be handled.

Contributor

cherrymui commented Apr 12, 2018

You are right. 32x32->64 can already be handled.

@ericlagergren

This comment has been minimized.

Show comment
Hide comment
@ericlagergren

ericlagergren Apr 12, 2018

Contributor

previous conversation about uint128: #9455

Contributor

ericlagergren commented Apr 12, 2018

previous conversation about uint128: #9455

@TocarIP

This comment has been minimized.

Show comment
Hide comment
@TocarIP

TocarIP Apr 12, 2018

Contributor

I like an idea about exposing mulWW and divWW, which we already intrinsify, for general use, but I'm not sure about right package. IIRC there was Gophercon talk about optimizing 64x64->128 multiplication, so this looks like something useful for others.
This should also help hash/fnv128, which currently does ugly stuff like this:

                 s1l := (s[1] & 0xffffffff) * prime128Lower
                 s1h := (s[1] >> 32) * prime128Lower
                 s0l := (s[0]&0xffffffff)*prime128Lower + (s[1]&0xffffffff)<<prime128Shift
                 s0h := (s[0]>>32)*prime128Lower + (s[1]>>32)<<prime128Shift
                 // Carries
                 s1h += s1l >> 32
                 s0l += s1h >> 32
                 s0h += s0l >> 32
                 // Update the values
                 s[1] = (s1l & 0xffffffff) + (s1h << 32)
                 s[0] = (s0l & 0xffffffff) + (s0h << 32)
Contributor

TocarIP commented Apr 12, 2018

I like an idea about exposing mulWW and divWW, which we already intrinsify, for general use, but I'm not sure about right package. IIRC there was Gophercon talk about optimizing 64x64->128 multiplication, so this looks like something useful for others.
This should also help hash/fnv128, which currently does ugly stuff like this:

                 s1l := (s[1] & 0xffffffff) * prime128Lower
                 s1h := (s[1] >> 32) * prime128Lower
                 s0l := (s[0]&0xffffffff)*prime128Lower + (s[1]&0xffffffff)<<prime128Shift
                 s0h := (s[0]>>32)*prime128Lower + (s[1]>>32)<<prime128Shift
                 // Carries
                 s1h += s1l >> 32
                 s0l += s1h >> 32
                 s0h += s0l >> 32
                 // Update the values
                 s[1] = (s1l & 0xffffffff) + (s1h << 32)
                 s[0] = (s0l & 0xffffffff) + (s0h << 32)
@bmakam-qdt

This comment has been minimized.

Show comment
Hide comment
@bmakam-qdt

bmakam-qdt Apr 12, 2018

Contributor

The expression you would use for a multiword shift should be directly detectable by the compiler. (It doesn't currently, but it should be straightforward.)

I think we might not need the intrinsic for multiword shiftright because we can create a pattern that can be detected by the compiler, something like this:
ShrQconst(42, z[5], z[4]) -> (z5<<22 | z4>>42)

I think AddQQ and AddQW have uint128 inputs, not uint64.

Yes, AddQQ takes two uint128 i.e two pairs of uint64 and outputs the result to uint128(hi, lo) and AddQW takes a uint128 and uint64 and outputs the result to uint128(hi, lo).

Contributor

bmakam-qdt commented Apr 12, 2018

The expression you would use for a multiword shift should be directly detectable by the compiler. (It doesn't currently, but it should be straightforward.)

I think we might not need the intrinsic for multiword shiftright because we can create a pattern that can be detected by the compiler, something like this:
ShrQconst(42, z[5], z[4]) -> (z5<<22 | z4>>42)

I think AddQQ and AddQW have uint128 inputs, not uint64.

Yes, AddQQ takes two uint128 i.e two pairs of uint64 and outputs the result to uint128(hi, lo) and AddQW takes a uint128 and uint64 and outputs the result to uint128(hi, lo).

@gtank

This comment has been minimized.

Show comment
Hide comment
@gtank

gtank Apr 12, 2018

I am strongly in favor of doing this in some form.

If it's going to be explicit functions, then math/bits or a dedicated math/{intrinsic, uint128, multiword} is a better place for it than math/big, since we care a lot about the details of the underlying representation here.

For crypto, my dream API would give me direct mappings to the basic func: uint64 x uint64 -> uint64 x uint64 functions (mulq, addq, etc for amd64), then have smart pattern matching do the carry/flag/pipeline-aware instruction optimizations when possible (think add/adc, or the newer mulx/adcx/adox). There's an example of this being a huge win in math/big already and Intel wrote a guide for how to apply those instructions to patterns that are common in crypto code.

gtank commented Apr 12, 2018

I am strongly in favor of doing this in some form.

If it's going to be explicit functions, then math/bits or a dedicated math/{intrinsic, uint128, multiword} is a better place for it than math/big, since we care a lot about the details of the underlying representation here.

For crypto, my dream API would give me direct mappings to the basic func: uint64 x uint64 -> uint64 x uint64 functions (mulq, addq, etc for amd64), then have smart pattern matching do the carry/flag/pipeline-aware instruction optimizations when possible (think add/adc, or the newer mulx/adcx/adox). There's an example of this being a huge win in math/big already and Intel wrote a guide for how to apply those instructions to patterns that are common in crypto code.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Apr 13, 2018

Change https://golang.org/cl/106376 mentions this issue: cmd/compile: add intrinsics for multiword arithmetic

gopherbot commented Apr 13, 2018

Change https://golang.org/cl/106376 mentions this issue: cmd/compile: add intrinsics for multiword arithmetic

@bmakam-qdt

This comment has been minimized.

Show comment
Hide comment
@bmakam-qdt

bmakam-qdt Apr 13, 2018

Contributor

Sorry, I jumped the gun and assumed we have a consensus on math/bits as the right package for the APIs and took a stab at it in CL106376. My sincere apologies.

Contributor

bmakam-qdt commented Apr 13, 2018

Sorry, I jumped the gun and assumed we have a consensus on math/bits as the right package for the APIs and took a stab at it in CL106376. My sincere apologies.

@jimmyfrasche

This comment has been minimized.

Show comment
Hide comment
@jimmyfrasche

jimmyfrasche Apr 13, 2018

Member

I strongly disagree about putting these in math/bits: these are not operations on bits. They may involve operations on bits, but ultimately so does everything.

I really do not want to see math/bits turn into a junk drawer.

The proposed operations are a new class of operations and they deserve their own package for ease of discovery, to make reading code that uses these operations clear, and so that they can have focused documentation.

Member

jimmyfrasche commented Apr 13, 2018

I strongly disagree about putting these in math/bits: these are not operations on bits. They may involve operations on bits, but ultimately so does everything.

I really do not want to see math/bits turn into a junk drawer.

The proposed operations are a new class of operations and they deserve their own package for ease of discovery, to make reading code that uses these operations clear, and so that they can have focused documentation.

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Apr 13, 2018

Contributor

I don't see a need for both AddQW and AddQQ. AddQW(x1, x0, y) = AddQQ(x1, x0, 0, y). The compiler should have no problem constant folding the 0. Just have a single Add.
In any case, I don't like Q and W. What do they stand for? (quadword and word? That makes no sense here.)

Contributor

randall77 commented Apr 13, 2018

I don't see a need for both AddQW and AddQQ. AddQW(x1, x0, y) = AddQQ(x1, x0, 0, y). The compiler should have no problem constant folding the 0. Just have a single Add.
In any case, I don't like Q and W. What do they stand for? (quadword and word? That makes no sense here.)

@FiloSottile

This comment has been minimized.

Show comment
Hide comment
@FiloSottile

FiloSottile Jun 18, 2018

Member

@rasky Do you have a use case for the signed versions?

Member

FiloSottile commented Jun 18, 2018

@rasky Do you have a use case for the signed versions?

@rasky

This comment has been minimized.

Show comment
Hide comment
@rasky

rasky Jun 18, 2018

Member

Any case where signed multiplication is used in computer science could benefit from having a stable low/hi API. I sometimes work with geometry and signal processing, for instance, and on ARMs you don't always have a FPU so using fixed points is an option. With a 32-bit fixed point, you need a 32x32->64 multiplication.

Member

rasky commented Jun 18, 2018

Any case where signed multiplication is used in computer science could benefit from having a stable low/hi API. I sometimes work with geometry and signal processing, for instance, and on ARMs you don't always have a FPU so using fixed points is an option. With a 32-bit fixed point, you need a 32x32->64 multiplication.

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Jun 18, 2018

Contributor

Proposal accepted per @rsc's comment #24813 (comment) -- iant for @golang/proposal-review

Contributor

ianlancetaylor commented Jun 18, 2018

Proposal accepted per @rsc's comment #24813 (comment) -- iant for @golang/proposal-review

@smasher164

This comment has been minimized.

Show comment
Hide comment
@smasher164

smasher164 Jul 8, 2018

Contributor

Porting musl's fma implementation to Go required Mul64 as well. Here is a possible full-width multiply implementation.
Edit: Link to musl libc

// Mul returns the high and low bits from the product of x and y.
func Mul(x, y uint) (hi, lo uint) {
	if UintSize == 32 {
		hi32, lo32 := Mul32(uint32(x), uint32(y))
		return uint(hi32), uint(lo32)
	}
	hi64, lo64 := Mul64(uint64(x), uint64(y))
	return uint(hi64), uint(lo64)
}

// Mul32 returns the high and low 32 bits from the 64-bit product of x and y.
func Mul32(x, y uint32) (hi, lo uint32) {
	z := uint64(x) * uint64(y)
	return uint32(z >> 32), uint32(z)
}

// Mul64 returns the high and low 64 bits from the 128-bit product of x and y.
func Mul64(x, y uint64) (hi, lo uint64) {
	xlo := uint64(uint32(x))
	xhi := x >> 32
	ylo := uint64(uint32(y))
	yhi := y >> 32

	t1 := xlo * ylo
	t2 := xlo*yhi + xhi*ylo
	t3 := xhi * yhi
	lo = t1 + (t2 << 32)
	hi = t3 + (t2 >> 32)
	if t1 > lo {
		hi += 1
	}
	return hi, lo
}
Contributor

smasher164 commented Jul 8, 2018

Porting musl's fma implementation to Go required Mul64 as well. Here is a possible full-width multiply implementation.
Edit: Link to musl libc

// Mul returns the high and low bits from the product of x and y.
func Mul(x, y uint) (hi, lo uint) {
	if UintSize == 32 {
		hi32, lo32 := Mul32(uint32(x), uint32(y))
		return uint(hi32), uint(lo32)
	}
	hi64, lo64 := Mul64(uint64(x), uint64(y))
	return uint(hi64), uint(lo64)
}

// Mul32 returns the high and low 32 bits from the 64-bit product of x and y.
func Mul32(x, y uint32) (hi, lo uint32) {
	z := uint64(x) * uint64(y)
	return uint32(z >> 32), uint32(z)
}

// Mul64 returns the high and low 64 bits from the 128-bit product of x and y.
func Mul64(x, y uint64) (hi, lo uint64) {
	xlo := uint64(uint32(x))
	xhi := x >> 32
	ylo := uint64(uint32(y))
	yhi := y >> 32

	t1 := xlo * ylo
	t2 := xlo*yhi + xhi*ylo
	t3 := xhi * yhi
	lo = t1 + (t2 << 32)
	hi = t3 + (t2 >> 32)
	if t1 > lo {
		hi += 1
	}
	return hi, lo
}
@smasher164

This comment has been minimized.

Show comment
Hide comment
@smasher164

smasher164 Jul 9, 2018

Contributor

I have a prototype of the extended precision operations over at github.com/smasher164/extprec. All the implementations are adapted from "Hacker's Delight", and the relevant references are provided.

However, it still needs real test cases in order to hammer out any possible bugs. I've only tested it with random hand-picked values.

Edit: Completed unit tests!

Contributor

smasher164 commented Jul 9, 2018

I have a prototype of the extended precision operations over at github.com/smasher164/extprec. All the implementations are adapted from "Hacker's Delight", and the relevant references are provided.

However, it still needs real test cases in order to hammer out any possible bugs. I've only tested it with random hand-picked values.

Edit: Completed unit tests!

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Jul 10, 2018

Change https://golang.org/cl/123157 mentions this issue: math/bits: add extended precision Add, Sub, Mul, Div

gopherbot commented Jul 10, 2018

Change https://golang.org/cl/123157 mentions this issue: math/bits: add extended precision Add, Sub, Mul, Div

@bmkessler

This comment has been minimized.

Show comment
Hide comment
@bmkessler

bmkessler Jul 10, 2018

Contributor

The above CL ports the go versions of these functions from math/big since those are previously tested. Additional SSA rules will still need to be implemented as this just handles the simple Mul and Div cases for amd64 and arm64 that are used in math/big.

Contributor

bmkessler commented Jul 10, 2018

The above CL ports the go versions of these functions from math/big since those are previously tested. Additional SSA rules will still need to be implemented as this just handles the simple Mul and Div cases for amd64 and arm64 that are used in math/big.

@TuomLarsen

This comment has been minimized.

Show comment
Hide comment
@TuomLarsen

TuomLarsen Aug 3, 2018

Really tiny nitpick. Wouldn't it make more sense to have func Div(hi, lo, y uint), instead of func Div(hi, lo, x uint)? Because in all other functions, x is the first operand and y the second one. hi and lo are 128-bits of the first operand so maybe an y would be more descriptive.

TuomLarsen commented Aug 3, 2018

Really tiny nitpick. Wouldn't it make more sense to have func Div(hi, lo, y uint), instead of func Div(hi, lo, x uint)? Because in all other functions, x is the first operand and y the second one. hi and lo are 128-bits of the first operand so maybe an y would be more descriptive.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Aug 15, 2018

Change https://golang.org/cl/129415 mentions this issue: cmd/compile: intrinsify math/bits.Mul and Div

gopherbot commented Aug 15, 2018

Change https://golang.org/cl/129415 mentions this issue: cmd/compile: intrinsify math/bits.Mul and Div

gtank added a commit to gtank/go that referenced this issue Sep 3, 2018

math/bits: add extended precision Add, Sub, Mul, Div
Port math/big pure go versions of add-with-carry, subtract-with-borrow,
full-width multiply, and full-width divide.

Updates golang#24813

Change-Id: Ifae5d2f6ee4237137c9dcba931f69c91b80a4b1c

gtank added a commit to gtank/go that referenced this issue Sep 3, 2018

cmd/compile: intrinsify math/bits.Mul and Div
Add SSA rules to intrinsify Mul/Mul64 (AMD64 and ARM64) and Div/Div64 (AMD64).
SSA rules for other functions and architectures are left as a future
optimization.  Benchmark results on AMD64/ARM64 before and after SSA
implementation are below.

amd64
name     old time/op  new time/op  delta
Add-4    1.82ns ± 2%  1.89ns ± 6%     ~     (p=0.167 n=5+5)
Add32-4  1.78ns ± 7%  1.80ns ± 9%     ~     (p=0.690 n=5+5)
Add64-4  1.82ns ± 3%  1.93ns ±23%     ~     (p=0.810 n=5+5)
Sub-4    1.85ns ± 4%  1.91ns ± 6%     ~     (p=0.246 n=5+5)
Sub32-4  1.82ns ± 4%  1.87ns ± 8%     ~     (p=0.730 n=5+5)
Sub64-4  1.91ns ± 7%  1.85ns ± 4%     ~     (p=0.341 n=5+5)
Mul-4    11.7ns ± 5%   1.8ns ± 0%  -84.72%  (p=0.000 n=5+4)
Mul32-4  1.60ns ± 2%  1.61ns ± 6%     ~     (p=0.651 n=5+5)
Mul64-4  7.13ns ± 9%  2.08ns ±11%  -70.88%  (p=0.008 n=5+5)
Div-4    59.5ns ± 4%  49.2ns ± 5%  -17.43%  (p=0.008 n=5+5)
Div32-4  18.3ns ± 3%  18.2ns ± 1%     ~     (p=0.333 n=5+5)
Div64-4  58.6ns ±13%  48.5ns ± 5%  -17.24%  (p=0.008 n=5+5)

arm64
name      old time/op  new time/op  delta
Add-96    5.51ns ± 0%  5.51ns ± 0%     ~     (all equal)
Add32-96  5.51ns ± 0%  5.51ns ± 0%     ~     (all equal)
Add64-96  5.52ns ± 0%  5.51ns ± 0%     ~     (p=0.444 n=5+5)
Sub-96    5.51ns ± 0%  5.51ns ± 0%     ~     (all equal)
Sub32-96  5.51ns ± 0%  5.51ns ± 0%     ~     (all equal)
Sub64-96  5.51ns ± 0%  5.51ns ± 0%     ~     (all equal)
Mul-96    34.6ns ± 0%   5.0ns ± 0%  -85.52%  (p=0.008 n=5+5)
Mul32-96  4.51ns ± 0%  4.51ns ± 0%     ~     (all equal)
Mul64-96  21.1ns ± 0%   5.0ns ± 0%  -76.26%  (p=0.008 n=5+5)
Div-96    64.7ns ± 0%  64.7ns ± 0%     ~     (all equal)
Div32-96  17.0ns ± 0%  17.0ns ± 0%     ~     (all equal)
Div64-96  53.1ns ± 0%  53.1ns ± 0%     ~     (all equal)

Updates golang#24813

Change-Id: I9bda6d2102f65cae3d436a2087b47ed8bafeb068
@gtank

This comment has been minimized.

Show comment
Hide comment
@gtank

gtank Sep 3, 2018

I was curious about this, so I cherry-picked the relevant CLs on top of current Go master and have some preliminary results from my Ed25519 field arithmetic. bits.Mul64 got me about halfway to the speed of an assembly implementation without much effort. I suspect I could make the pure Go a bit faster as well, but haven't tried yet.

Rough benchmarks on a Kaby Lake laptop (i7-7560U @ 2.40GHz):

$ benchstat old32 bits64
name             old time/op  new time/op  delta
KeyGeneration-4  48.5µs ± 0%  33.9µs ± 2%  -30.22%  (p=0.004 n=5+6)
Signing-4        49.6µs ± 1%  34.8µs ± 2%  -29.84%  (p=0.002 n=6+6)
Verification-4    133µs ± 1%    93µs ± 3%  -29.81%  (p=0.002 n=6+6)
$ benchstat bits64 amd64
name             old time/op  new time/op  delta
KeyGeneration-4  33.9µs ± 2%  24.9µs ± 0%  -26.36%  (p=0.001 n=6+7)
Signing-4        34.8µs ± 2%  25.9µs ± 0%  -25.70%  (p=0.001 n=6+7)
Verification-4   93.2µs ± 3%  60.5µs ± 0%  -35.06%  (p=0.002 n=6+6)
$ benchstat old32 amd64
name             old time/op  new time/op  delta
KeyGeneration-4  48.5µs ± 0%  24.9µs ± 0%  -48.61%  (p=0.003 n=5+7)
Signing-4        49.6µs ± 1%  25.9µs ± 0%  -47.87%  (p=0.001 n=6+7)
Verification-4    133µs ± 1%    61µs ± 0%  -54.42%  (p=0.002 n=6+6)

If anyone else is interested in playing with this, it's this code and this Go.

gtank commented Sep 3, 2018

I was curious about this, so I cherry-picked the relevant CLs on top of current Go master and have some preliminary results from my Ed25519 field arithmetic. bits.Mul64 got me about halfway to the speed of an assembly implementation without much effort. I suspect I could make the pure Go a bit faster as well, but haven't tried yet.

Rough benchmarks on a Kaby Lake laptop (i7-7560U @ 2.40GHz):

$ benchstat old32 bits64
name             old time/op  new time/op  delta
KeyGeneration-4  48.5µs ± 0%  33.9µs ± 2%  -30.22%  (p=0.004 n=5+6)
Signing-4        49.6µs ± 1%  34.8µs ± 2%  -29.84%  (p=0.002 n=6+6)
Verification-4    133µs ± 1%    93µs ± 3%  -29.81%  (p=0.002 n=6+6)
$ benchstat bits64 amd64
name             old time/op  new time/op  delta
KeyGeneration-4  33.9µs ± 2%  24.9µs ± 0%  -26.36%  (p=0.001 n=6+7)
Signing-4        34.8µs ± 2%  25.9µs ± 0%  -25.70%  (p=0.001 n=6+7)
Verification-4   93.2µs ± 3%  60.5µs ± 0%  -35.06%  (p=0.002 n=6+6)
$ benchstat old32 amd64
name             old time/op  new time/op  delta
KeyGeneration-4  48.5µs ± 0%  24.9µs ± 0%  -48.61%  (p=0.003 n=5+7)
Signing-4        49.6µs ± 1%  25.9µs ± 0%  -47.87%  (p=0.001 n=6+7)
Verification-4    133µs ± 1%    61µs ± 0%  -54.42%  (p=0.002 n=6+6)

If anyone else is interested in playing with this, it's this code and this Go.

gopherbot pushed a commit that referenced this issue Sep 11, 2018

math/bits: add extended precision Add, Sub, Mul, Div
Port math/big pure go versions of add-with-carry, subtract-with-borrow,
full-width multiply, and full-width divide.

Updates #24813

Change-Id: Ifae5d2f6ee4237137c9dcba931f69c91b80a4b1c
Reviewed-on: https://go-review.googlesource.com/123157
Reviewed-by: Robert Griesemer <gri@golang.org>
Run-TryBot: Robert Griesemer <gri@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@rasky

This comment has been minimized.

Show comment
Hide comment
@rasky

rasky Sep 15, 2018

Member

@gtank your mul64 function (which is actually a mad64 AFAICT) should use bits.Add to implement double-word add:

func mul64(accLo, accHi, x, y uint64) (ol, oh uint64) {
 	oh, ol = bits.Mul64(x, y)
        var carry uint64
        ol, carry = bits.Add64(ol, accLo, 0)
        oh, _ = bits.Add64(oh, accHi, carry)
 	return
 }

at some point that should get turned into ADD+ADC, which would probably get you another boost compared to assembly.

Member

rasky commented Sep 15, 2018

@gtank your mul64 function (which is actually a mad64 AFAICT) should use bits.Add to implement double-word add:

func mul64(accLo, accHi, x, y uint64) (ol, oh uint64) {
 	oh, ol = bits.Mul64(x, y)
        var carry uint64
        ol, carry = bits.Add64(ol, accLo, 0)
        oh, _ = bits.Add64(oh, accHi, carry)
 	return
 }

at some point that should get turned into ADD+ADC, which would probably get you another boost compared to assembly.

gopherbot pushed a commit that referenced this issue Sep 26, 2018

cmd/compile: intrinsify math/bits.Mul
Add SSA rules to intrinsify Mul/Mul64 (AMD64 and ARM64).
SSA rules for other functions and architectures are left as a future
optimization.  Benchmark results on AMD64/ARM64 before and after SSA
implementation are below.

amd64
name     old time/op  new time/op  delta
Add-4    1.78ns ± 0%  1.85ns ±12%     ~     (p=0.397 n=4+5)
Add32-4  1.71ns ± 1%  1.70ns ± 0%     ~     (p=0.683 n=5+5)
Add64-4  1.80ns ± 2%  1.77ns ± 0%   -1.22%  (p=0.048 n=5+5)
Sub-4    1.78ns ± 0%  1.78ns ± 0%     ~     (all equal)
Sub32-4  1.78ns ± 1%  1.78ns ± 0%     ~     (p=1.000 n=5+5)
Sub64-4  1.78ns ± 1%  1.78ns ± 0%     ~     (p=0.968 n=5+4)
Mul-4    11.5ns ± 1%   1.8ns ± 2%  -84.39%  (p=0.008 n=5+5)
Mul32-4  1.39ns ± 0%  1.38ns ± 3%     ~     (p=0.175 n=5+5)
Mul64-4  6.85ns ± 1%  1.78ns ± 1%  -73.97%  (p=0.008 n=5+5)
Div-4    57.1ns ± 1%  56.7ns ± 0%     ~     (p=0.087 n=5+5)
Div32-4  18.0ns ± 0%  18.0ns ± 0%     ~     (all equal)
Div64-4  56.4ns ±10%  53.6ns ± 1%     ~     (p=0.071 n=5+5)

arm64
name      old time/op  new time/op  delta
Add-96    5.51ns ± 0%  5.51ns ± 0%     ~     (all equal)
Add32-96  5.51ns ± 0%  5.51ns ± 0%     ~     (all equal)
Add64-96  5.52ns ± 0%  5.51ns ± 0%     ~     (p=0.444 n=5+5)
Sub-96    5.51ns ± 0%  5.51ns ± 0%     ~     (all equal)
Sub32-96  5.51ns ± 0%  5.51ns ± 0%     ~     (all equal)
Sub64-96  5.51ns ± 0%  5.51ns ± 0%     ~     (all equal)
Mul-96    34.6ns ± 0%   5.0ns ± 0%  -85.52%  (p=0.008 n=5+5)
Mul32-96  4.51ns ± 0%  4.51ns ± 0%     ~     (all equal)
Mul64-96  21.1ns ± 0%   5.0ns ± 0%  -76.26%  (p=0.008 n=5+5)
Div-96    64.7ns ± 0%  64.7ns ± 0%     ~     (all equal)
Div32-96  17.0ns ± 0%  17.0ns ± 0%     ~     (all equal)
Div64-96  53.1ns ± 0%  53.1ns ± 0%     ~     (all equal)

Updates #24813

Change-Id: I9bda6d2102f65cae3d436a2087b47ed8bafeb068
Reviewed-on: https://go-review.googlesource.com/129415
Run-TryBot: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Oct 2, 2018

Change https://golang.org/cl/138917 mentions this issue: cmd/compile: instrinsify math/bits.Mul on ppc64x

gopherbot commented Oct 2, 2018

Change https://golang.org/cl/138917 mentions this issue: cmd/compile: instrinsify math/bits.Mul on ppc64x

gopherbot pushed a commit that referenced this issue Oct 2, 2018

cmd/compile: instrinsify math/bits.Mul on ppc64x
Add SSA rules to intrinsify Mul/Mul64 on ppc64x.

benchmark             old ns/op     new ns/op     delta
BenchmarkMul-40       8.80          0.93          -89.43%
BenchmarkMul32-40     1.39          1.39          +0.00%
BenchmarkMul64-40     5.39          0.93          -82.75%

Updates #24813

Change-Id: I6e95bfbe976a2278bd17799df184a7fbc0e57829
Reviewed-on: https://go-review.googlesource.com/138917
Reviewed-by: Lynn Boger <laboger@linux.vnet.ibm.com>
@smasher164

This comment has been minimized.

Show comment
Hide comment
@smasher164

smasher164 Oct 17, 2018

Contributor

Should this issue be closed now that CL 123157 has been merged in?

Contributor

smasher164 commented Oct 17, 2018

Should this issue be closed now that CL 123157 has been merged in?

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz Oct 17, 2018

Member

Maybe? @randall77, are there plans to intrinsify more of this?

Member

bradfitz commented Oct 17, 2018

Maybe? @randall77, are there plans to intrinsify more of this?

@randall77

This comment has been minimized.

Show comment
Hide comment
@randall77

randall77 Oct 17, 2018

Contributor

Only Mul has been intrinsified so far.
I think we want Add/Sub intrinsified also (or better, recognize the Go code and generate the add-with-carry instruction).
That said, those tasks don't need to live in this issue, which is more about API. If concerned people can open issues for their architecture, we can close this issue.

Contributor

randall77 commented Oct 17, 2018

Only Mul has been intrinsified so far.
I think we want Add/Sub intrinsified also (or better, recognize the Go code and generate the add-with-carry instruction).
That said, those tasks don't need to live in this issue, which is more about API. If concerned people can open issues for their architecture, we can close this issue.

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz Oct 18, 2018

Member

Okay, closing.

Member

bradfitz commented Oct 18, 2018

Okay, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment