Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: math/rand: speed up Int31n with multiply/shift instead of modulo #16213

Closed
josharian opened this issue Jun 29, 2016 · 11 comments

Comments

@josharian
Copy link
Contributor

commented Jun 29, 2016

See http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ for details.

Marking as Go2, since the compatibility guarantee prevents us from changing the output of Int31n given a seed.

We can still use this in the runtime, though; there are several places where we do fastrand1() % n. I will send a CL for this for Go 1.8 and have it update this issue.

cc @dgryski

@josharian josharian added the Go2 label Jun 29, 2016
@josharian josharian added this to the Unplanned milestone Jun 29, 2016
@josharian josharian self-assigned this Jun 29, 2016
@robpike

This comment has been minimized.

Copy link
Contributor

commented Jun 29, 2016

If we are going to change this package, it has more severe problems. https://go-review.googlesource.com/#/c/10161/ isn't ready but it is based on a much better generator, and we should use the Go 2 opportunity, should it arise, to replace the core algorithm.

@josharian

This comment has been minimized.

Copy link
Contributor Author

commented Jun 29, 2016

Here's the obvious fastrandn for the runtime package

func fastrandn(n uint32) uint32 {
    // See http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/.
    // Note that we use >> 31 instead of >> 32 because fastrand only produces 31 bit numbers.
    return uint32((uint64(n) * uint64(fastrand())) >> 31)
}

Unfortunately, the additional function call (instead of an inline %) more than obviates the performance gains. This means implementing fastrandn in assembly and copy/pasting the core of fastrand, which is a pity.

Results on amd64, with an assembly implementation:

name old time/op new time/op delta
FastrandnNonConstant-8 5.90ns ±18% 4.62ns ±13% -21.63% (p=0.008 n=5+5)
FastrandnConstantNonPowerOfTwo-8 5.24ns ± 8% 5.14ns ±25% ~ (p=0.052 n=24+25)
FastrandnConstantPowerOfTwo-8 3.49ns ±11% 4.65ns ± 4% +33.47% (p=0.000 n=22+23)

Worth doing for the non-constant case. This might get delayed going in, though, since I lack hardware (ppc, mips, s390x) to test asm implementations on other platforms.

@josharian

This comment has been minimized.

Copy link
Contributor Author

commented Jun 29, 2016

@robpike yes. The number of "can't fix this because compatibility" issues for math/rand is sad. (And I still don't really understand why it was ok to fix #16124.) I wish there was a way to fix them all in the stdlib without waiting for Go2. Oh well.

@lemire

This comment has been minimized.

Copy link

commented Sep 9, 2016

TensorFlow adopted this technique for better hashing performance : tensorflow/tensorflow@a47a300

@robpike

This comment has been minimized.

Copy link
Contributor

commented Sep 10, 2016

@josharian Well, I could prioritize the replacement package that's already started and we can put it somewhere supported and encourage people to go there instead.

@josharian

This comment has been minimized.

Copy link
Contributor Author

commented Sep 10, 2016

@robpike I think getting the replacement package wrapped up would be a service to the community. As for where to put it, how about go4.org? cc @bradfitz

My CL to use this approach in the runtime is still in my queue.

@josharian

This comment has been minimized.

Copy link
Contributor Author

commented Jan 5, 2017

Using this technique in the runtime will be a lot nicer once CL 34782 lands in 1.9.

@gopherbot

This comment has been minimized.

Copy link

commented Feb 13, 2017

CL https://golang.org/cl/36932 mentions this issue.

gopherbot pushed a commit that referenced this issue Feb 14, 2017
This occurs a fair amount in the runtime for non-power-of-two n.
Use an alternative, faster formulation.

name           old time/op  new time/op  delta
Fastrandn/2-8  4.45ns ± 2%  2.09ns ± 3%  -53.12%  (p=0.000 n=14+14)
Fastrandn/3-8  4.78ns ±11%  2.06ns ± 2%  -56.94%  (p=0.000 n=15+15)
Fastrandn/4-8  4.76ns ± 9%  1.99ns ± 3%  -58.28%  (p=0.000 n=15+13)
Fastrandn/5-8  4.96ns ±13%  2.03ns ± 6%  -59.14%  (p=0.000 n=15+15)

name                    old time/op  new time/op  delta
SelectUncontended-8     33.7ns ± 2%  33.9ns ± 2%  +0.70%  (p=0.000 n=49+50)
SelectSyncContended-8   1.68µs ± 4%  1.65µs ± 4%  -1.54%  (p=0.000 n=50+45)
SelectAsyncContended-8   282ns ± 1%   277ns ± 1%  -1.50%  (p=0.000 n=48+43)
SelectNonblock-8        5.31ns ± 1%  5.32ns ± 1%    ~     (p=0.275 n=45+44)
SelectProdCons-8         585ns ± 3%   577ns ± 2%  -1.35%  (p=0.000 n=50+50)
GoroutineSelect-8       1.59ms ± 2%  1.59ms ± 1%    ~     (p=0.084 n=49+48)

Updates #16213

Change-Id: Ib555a4d7da2042a25c3976f76a436b536487d5b7
Reviewed-on: https://go-review.googlesource.com/36932
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
josharian added a commit to josharian/go that referenced this issue Feb 14, 2017
This occurs a fair amount in the runtime for non-power-of-two n.
Use an alternative, faster formulation.

name           old time/op  new time/op  delta
Fastrandn/2-8  4.45ns ± 2%  2.09ns ± 3%  -53.12%  (p=0.000 n=14+14)
Fastrandn/3-8  4.78ns ±11%  2.06ns ± 2%  -56.94%  (p=0.000 n=15+15)
Fastrandn/4-8  4.76ns ± 9%  1.99ns ± 3%  -58.28%  (p=0.000 n=15+13)
Fastrandn/5-8  4.96ns ±13%  2.03ns ± 6%  -59.14%  (p=0.000 n=15+15)

name                    old time/op  new time/op  delta
SelectUncontended-8     33.7ns ± 2%  33.9ns ± 2%  +0.70%  (p=0.000 n=49+50)
SelectSyncContended-8   1.68µs ± 4%  1.65µs ± 4%  -1.54%  (p=0.000 n=50+45)
SelectAsyncContended-8   282ns ± 1%   277ns ± 1%  -1.50%  (p=0.000 n=48+43)
SelectNonblock-8        5.31ns ± 1%  5.32ns ± 1%    ~     (p=0.275 n=45+44)
SelectProdCons-8         585ns ± 3%   577ns ± 2%  -1.35%  (p=0.000 n=50+50)
GoroutineSelect-8       1.59ms ± 2%  1.59ms ± 1%    ~     (p=0.084 n=49+48)

Updates golang#16213

Change-Id: Ib555a4d7da2042a25c3976f76a436b536487d5b7
@rsc rsc changed the title math/rand: speed up Int31n with multiply/shift instead of modulo proposal: math/rand: speed up Int31n with multiply/shift instead of modulo Jun 17, 2017
@gopherbot gopherbot added the Proposal label Jun 17, 2017
@gopherbot

This comment has been minimized.

Copy link

commented Jul 29, 2017

Change https://golang.org/cl/51891 mentions this issue: math/rand: add Shuffle

josharian added a commit to josharian/go that referenced this issue Aug 12, 2017
Shuffle uses the Fisher-Yates algorithm.

Since this is new API, it affords us the opportunity
to use a much faster Int31n implementation that mostly avoids division.
As a result, BenchmarkPerm30ViaShuffle is
about 30% faster than BenchmarkPerm30,
despite requiring a separate initialization loop
and using function calls to swap elements.

Fixes golang#20480
Updates golang#16213
Updates golang#21211

Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5
@gopherbot

This comment has been minimized.

Copy link

commented Aug 15, 2017

Change https://golang.org/cl/55972 mentions this issue: math/rand: make Perm match Shuffle

gopherbot pushed a commit that referenced this issue Sep 8, 2017
Shuffle uses the Fisher-Yates algorithm.

Since this is new API, it affords us the opportunity
to use a much faster Int31n implementation that mostly avoids division.
As a result, BenchmarkPerm30ViaShuffle is
about 30% faster than BenchmarkPerm30,
despite requiring a separate initialization loop
and using function calls to swap elements.

Fixes #20480
Updates #16213
Updates #21211

Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5
Reviewed-on: https://go-review.googlesource.com/51891
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Sep 8, 2017
Perm and Shuffle are fundamentally doing the same work.
This change makes Perm's algorithm match Shuffle's.
In addition to allowing developers to switch more
easily between the two methods, it affords a nice speed-up:

name      old time/op  new time/op  delta
Perm3-8   75.7ns ± 1%  51.8ns ± 1%  -31.59%  (p=0.000 n=9+8)
Perm30-8   610ns ± 1%   405ns ± 1%  -33.67%  (p=0.000 n=9+9)

This change alters the output from Perm,
given the same Source and seed.
This is a change from Go 1.0 behavior.
This necessitates updating the regression test.

This also changes the number of calls made to the Source
during Perm, which changes the output of the math/rand examples.

This also slightly perturbs the output of Perm,
nudging it out of the range currently accepted by TestUniformFactorial.
However, it is complete unclear that the helpers relied on
by TestUniformFactorial are correct. That is #21211.
This change updates checkSimilarDistribution to respect
closeEnough for standard deviations, which makes the test pass.
The whole situation is muddy; see #21211 for details.

There is an alternative implementation of Perm
that avoids initializing m, which is more similar
to the existing implementation, plus some optimizations:

func (r *Rand) Perm(n int) []int {
	m := make([]int, n)
	max31 := n
	if n > 1<<31-1-1 {
		max31 = 1<<31 - 1 - 1
	}
	i := 1
	for ; i < max31; i++ {
		j := r.int31n(int32(i + 1))
		m[i] = m[j]
		m[j] = i
	}
	for ; i < n; i++ {
		j := r.Int63n(int64(i + 1))
		m[i] = m[j]
		m[j] = i
	}
	return m
}

This is a tiny bit faster than the implementation
actually used in this change:

name      old time/op  new time/op  delta
Perm3-8   51.8ns ± 1%  50.3ns ± 1%  -2.83%  (p=0.000 n=8+9)
Perm30-8   405ns ± 1%   394ns ± 1%  -2.66%  (p=0.000 n=9+8)

However, 3% in performance doesn't seem worth
having the two algorithms diverge,
nor the reduced readability of this alternative.

Updates #16213.

Change-Id: I11a7441ff8837ee9c241b4c88f7aa905348be781
Reviewed-on: https://go-review.googlesource.com/55972
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rob Pike <r@golang.org>
josharian added a commit to josharian/go that referenced this issue Nov 13, 2017
Shuffle uses the Fisher-Yates algorithm.

Since this is new API, it affords us the opportunity
to use a much faster Int31n implementation that mostly avoids division.
As a result, BenchmarkPerm30ViaShuffle is
about 30% faster than BenchmarkPerm30,
despite requiring a separate initialization loop
and using function calls to swap elements.

Fixes golang#20480
Updates golang#16213
Updates golang#21211

Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5
josharian added a commit to josharian/go that referenced this issue Nov 13, 2017
Perm and Shuffle are fundamentally doing the same work.
This change makes Perm's algorithm match Shuffle's.
In addition to allowing developers to switch more
easily between the two methods, it affords a nice speed-up:

name      old time/op  new time/op  delta
Perm3-8   75.7ns ± 1%  51.8ns ± 1%  -31.59%  (p=0.000 n=9+8)
Perm30-8   610ns ± 1%   405ns ± 1%  -33.67%  (p=0.000 n=9+9)

This change alters the output from Perm,
given the same Source and seed.
This is a change from Go 1.0 behavior.
This necessitates updating regress test.

This also changes the number of calls made to the Source
during Perm, which changes the output of the math/rand examples.

This also slightly perturbs the output of Perm,
nudging it out of the range currently accepted by TestUniformFactorial.
However, it is complete unclear that the helpers relied on
by TestUniformFactorial are correct. That is golang#21211.
This change updates checkSimilarDistribution to respect
closeEnough for standard deviations, which makes the test pass.
The whole situation is muddy; see golang#21211 for details.


There is an alternative implementation of Perm
that avoids initializing m, which is more similar
to the existing implementation, plus some optimizations:

func (r *Rand) Perm(n int) []int {
	m := make([]int, n)
	max31 := n
	if n > 1<<31-1-1 {
		max31 = 1<<31 - 1 - 1
	}
	i := 1
	for ; i < max31; i++ {
		j := r.int31n(int32(i + 1))
		m[i] = m[j]
		m[j] = i
	}
	for ; i < n; i++ {
		j := r.Int63n(int64(i + 1))
		m[i] = m[j]
		m[j] = i
	}
	return m
}

This is a tiny bit faster than the implementation
actually used in this change:

name      old time/op  new time/op  delta
Perm3-8   51.8ns ± 1%  50.3ns ± 1%  -2.83%  (p=0.000 n=8+9)
Perm30-8   405ns ± 1%   394ns ± 1%  -2.66%  (p=0.000 n=9+8)

However, 3% in performance doesn't seem worth
having the two algorithms diverge,
nor the reduced readability of this alternative.


Updates golang#16213.

Change-Id: I11a7441ff8837ee9c241b4c88f7aa905348be781
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2018

Closing this proposal in favor of #21835.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
5 participants
You can’t perform that action at this time.