Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: math/rand: speed up Int31n with multiply/shift instead of modulo #16213

Closed
josharian opened this issue Jun 29, 2016 · 11 comments
Closed
Labels
FrozenDueToAge Proposal v2 A language change or incompatible library change
Milestone

Comments

@josharian
Copy link
Contributor

See http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ for details.

Marking as Go2, since the compatibility guarantee prevents us from changing the output of Int31n given a seed.

We can still use this in the runtime, though; there are several places where we do fastrand1() % n. I will send a CL for this for Go 1.8 and have it update this issue.

cc @dgryski

@josharian josharian added the v2 A language change or incompatible library change label Jun 29, 2016
@josharian josharian added this to the Unplanned milestone Jun 29, 2016
@josharian josharian self-assigned this Jun 29, 2016
@robpike
Copy link
Contributor

robpike commented Jun 29, 2016

If we are going to change this package, it has more severe problems. https://go-review.googlesource.com/#/c/10161/ isn't ready but it is based on a much better generator, and we should use the Go 2 opportunity, should it arise, to replace the core algorithm.

@josharian
Copy link
Contributor Author

Here's the obvious fastrandn for the runtime package

func fastrandn(n uint32) uint32 {
    // See http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/.
    // Note that we use >> 31 instead of >> 32 because fastrand only produces 31 bit numbers.
    return uint32((uint64(n) * uint64(fastrand())) >> 31)
}

Unfortunately, the additional function call (instead of an inline %) more than obviates the performance gains. This means implementing fastrandn in assembly and copy/pasting the core of fastrand, which is a pity.

Results on amd64, with an assembly implementation:

name old time/op new time/op delta
FastrandnNonConstant-8 5.90ns ±18% 4.62ns ±13% -21.63% (p=0.008 n=5+5)
FastrandnConstantNonPowerOfTwo-8 5.24ns ± 8% 5.14ns ±25% ~ (p=0.052 n=24+25)
FastrandnConstantPowerOfTwo-8 3.49ns ±11% 4.65ns ± 4% +33.47% (p=0.000 n=22+23)

Worth doing for the non-constant case. This might get delayed going in, though, since I lack hardware (ppc, mips, s390x) to test asm implementations on other platforms.

@josharian
Copy link
Contributor Author

@robpike yes. The number of "can't fix this because compatibility" issues for math/rand is sad. (And I still don't really understand why it was ok to fix #16124.) I wish there was a way to fix them all in the stdlib without waiting for Go2. Oh well.

@lemire
Copy link

lemire commented Sep 9, 2016

TensorFlow adopted this technique for better hashing performance : tensorflow/tensorflow@a47a300

@robpike
Copy link
Contributor

robpike commented Sep 10, 2016

@josharian Well, I could prioritize the replacement package that's already started and we can put it somewhere supported and encourage people to go there instead.

@josharian
Copy link
Contributor Author

@robpike I think getting the replacement package wrapped up would be a service to the community. As for where to put it, how about go4.org? cc @bradfitz

My CL to use this approach in the runtime is still in my queue.

@josharian
Copy link
Contributor Author

Using this technique in the runtime will be a lot nicer once CL 34782 lands in 1.9.

@gopherbot
Copy link

CL https://golang.org/cl/36932 mentions this issue.

gopherbot pushed a commit that referenced this issue Feb 14, 2017
This occurs a fair amount in the runtime for non-power-of-two n.
Use an alternative, faster formulation.

name           old time/op  new time/op  delta
Fastrandn/2-8  4.45ns ± 2%  2.09ns ± 3%  -53.12%  (p=0.000 n=14+14)
Fastrandn/3-8  4.78ns ±11%  2.06ns ± 2%  -56.94%  (p=0.000 n=15+15)
Fastrandn/4-8  4.76ns ± 9%  1.99ns ± 3%  -58.28%  (p=0.000 n=15+13)
Fastrandn/5-8  4.96ns ±13%  2.03ns ± 6%  -59.14%  (p=0.000 n=15+15)

name                    old time/op  new time/op  delta
SelectUncontended-8     33.7ns ± 2%  33.9ns ± 2%  +0.70%  (p=0.000 n=49+50)
SelectSyncContended-8   1.68µs ± 4%  1.65µs ± 4%  -1.54%  (p=0.000 n=50+45)
SelectAsyncContended-8   282ns ± 1%   277ns ± 1%  -1.50%  (p=0.000 n=48+43)
SelectNonblock-8        5.31ns ± 1%  5.32ns ± 1%    ~     (p=0.275 n=45+44)
SelectProdCons-8         585ns ± 3%   577ns ± 2%  -1.35%  (p=0.000 n=50+50)
GoroutineSelect-8       1.59ms ± 2%  1.59ms ± 1%    ~     (p=0.084 n=49+48)

Updates #16213

Change-Id: Ib555a4d7da2042a25c3976f76a436b536487d5b7
Reviewed-on: https://go-review.googlesource.com/36932
Run-TryBot: Josh Bleecher Snyder <josharian@gmail.com>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Keith Randall <khr@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
josharian added a commit to josharian/go that referenced this issue Feb 14, 2017
This occurs a fair amount in the runtime for non-power-of-two n.
Use an alternative, faster formulation.

name           old time/op  new time/op  delta
Fastrandn/2-8  4.45ns ± 2%  2.09ns ± 3%  -53.12%  (p=0.000 n=14+14)
Fastrandn/3-8  4.78ns ±11%  2.06ns ± 2%  -56.94%  (p=0.000 n=15+15)
Fastrandn/4-8  4.76ns ± 9%  1.99ns ± 3%  -58.28%  (p=0.000 n=15+13)
Fastrandn/5-8  4.96ns ±13%  2.03ns ± 6%  -59.14%  (p=0.000 n=15+15)

name                    old time/op  new time/op  delta
SelectUncontended-8     33.7ns ± 2%  33.9ns ± 2%  +0.70%  (p=0.000 n=49+50)
SelectSyncContended-8   1.68µs ± 4%  1.65µs ± 4%  -1.54%  (p=0.000 n=50+45)
SelectAsyncContended-8   282ns ± 1%   277ns ± 1%  -1.50%  (p=0.000 n=48+43)
SelectNonblock-8        5.31ns ± 1%  5.32ns ± 1%    ~     (p=0.275 n=45+44)
SelectProdCons-8         585ns ± 3%   577ns ± 2%  -1.35%  (p=0.000 n=50+50)
GoroutineSelect-8       1.59ms ± 2%  1.59ms ± 1%    ~     (p=0.084 n=49+48)

Updates golang#16213

Change-Id: Ib555a4d7da2042a25c3976f76a436b536487d5b7
@rsc rsc changed the title math/rand: speed up Int31n with multiply/shift instead of modulo proposal: math/rand: speed up Int31n with multiply/shift instead of modulo Jun 17, 2017
@gopherbot
Copy link

Change https://golang.org/cl/51891 mentions this issue: math/rand: add Shuffle

josharian added a commit to josharian/go that referenced this issue Aug 12, 2017
Shuffle uses the Fisher-Yates algorithm.

Since this is new API, it affords us the opportunity
to use a much faster Int31n implementation that mostly avoids division.
As a result, BenchmarkPerm30ViaShuffle is
about 30% faster than BenchmarkPerm30,
despite requiring a separate initialization loop
and using function calls to swap elements.

Fixes golang#20480
Updates golang#16213
Updates golang#21211

Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5
@gopherbot
Copy link

Change https://golang.org/cl/55972 mentions this issue: math/rand: make Perm match Shuffle

gopherbot pushed a commit that referenced this issue Sep 8, 2017
Shuffle uses the Fisher-Yates algorithm.

Since this is new API, it affords us the opportunity
to use a much faster Int31n implementation that mostly avoids division.
As a result, BenchmarkPerm30ViaShuffle is
about 30% faster than BenchmarkPerm30,
despite requiring a separate initialization loop
and using function calls to swap elements.

Fixes #20480
Updates #16213
Updates #21211

Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5
Reviewed-on: https://go-review.googlesource.com/51891
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Sep 8, 2017
Perm and Shuffle are fundamentally doing the same work.
This change makes Perm's algorithm match Shuffle's.
In addition to allowing developers to switch more
easily between the two methods, it affords a nice speed-up:

name      old time/op  new time/op  delta
Perm3-8   75.7ns ± 1%  51.8ns ± 1%  -31.59%  (p=0.000 n=9+8)
Perm30-8   610ns ± 1%   405ns ± 1%  -33.67%  (p=0.000 n=9+9)

This change alters the output from Perm,
given the same Source and seed.
This is a change from Go 1.0 behavior.
This necessitates updating the regression test.

This also changes the number of calls made to the Source
during Perm, which changes the output of the math/rand examples.

This also slightly perturbs the output of Perm,
nudging it out of the range currently accepted by TestUniformFactorial.
However, it is complete unclear that the helpers relied on
by TestUniformFactorial are correct. That is #21211.
This change updates checkSimilarDistribution to respect
closeEnough for standard deviations, which makes the test pass.
The whole situation is muddy; see #21211 for details.

There is an alternative implementation of Perm
that avoids initializing m, which is more similar
to the existing implementation, plus some optimizations:

func (r *Rand) Perm(n int) []int {
	m := make([]int, n)
	max31 := n
	if n > 1<<31-1-1 {
		max31 = 1<<31 - 1 - 1
	}
	i := 1
	for ; i < max31; i++ {
		j := r.int31n(int32(i + 1))
		m[i] = m[j]
		m[j] = i
	}
	for ; i < n; i++ {
		j := r.Int63n(int64(i + 1))
		m[i] = m[j]
		m[j] = i
	}
	return m
}

This is a tiny bit faster than the implementation
actually used in this change:

name      old time/op  new time/op  delta
Perm3-8   51.8ns ± 1%  50.3ns ± 1%  -2.83%  (p=0.000 n=8+9)
Perm30-8   405ns ± 1%   394ns ± 1%  -2.66%  (p=0.000 n=9+8)

However, 3% in performance doesn't seem worth
having the two algorithms diverge,
nor the reduced readability of this alternative.

Updates #16213.

Change-Id: I11a7441ff8837ee9c241b4c88f7aa905348be781
Reviewed-on: https://go-review.googlesource.com/55972
Run-TryBot: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Rob Pike <r@golang.org>
josharian added a commit to josharian/go that referenced this issue Nov 13, 2017
Shuffle uses the Fisher-Yates algorithm.

Since this is new API, it affords us the opportunity
to use a much faster Int31n implementation that mostly avoids division.
As a result, BenchmarkPerm30ViaShuffle is
about 30% faster than BenchmarkPerm30,
despite requiring a separate initialization loop
and using function calls to swap elements.

Fixes golang#20480
Updates golang#16213
Updates golang#21211

Change-Id: Ib8956c4bebed9d84f193eb98282ec16ee7c2b2d5
josharian added a commit to josharian/go that referenced this issue Nov 13, 2017
Perm and Shuffle are fundamentally doing the same work.
This change makes Perm's algorithm match Shuffle's.
In addition to allowing developers to switch more
easily between the two methods, it affords a nice speed-up:

name      old time/op  new time/op  delta
Perm3-8   75.7ns ± 1%  51.8ns ± 1%  -31.59%  (p=0.000 n=9+8)
Perm30-8   610ns ± 1%   405ns ± 1%  -33.67%  (p=0.000 n=9+9)

This change alters the output from Perm,
given the same Source and seed.
This is a change from Go 1.0 behavior.
This necessitates updating regress test.

This also changes the number of calls made to the Source
during Perm, which changes the output of the math/rand examples.

This also slightly perturbs the output of Perm,
nudging it out of the range currently accepted by TestUniformFactorial.
However, it is complete unclear that the helpers relied on
by TestUniformFactorial are correct. That is golang#21211.
This change updates checkSimilarDistribution to respect
closeEnough for standard deviations, which makes the test pass.
The whole situation is muddy; see golang#21211 for details.


There is an alternative implementation of Perm
that avoids initializing m, which is more similar
to the existing implementation, plus some optimizations:

func (r *Rand) Perm(n int) []int {
	m := make([]int, n)
	max31 := n
	if n > 1<<31-1-1 {
		max31 = 1<<31 - 1 - 1
	}
	i := 1
	for ; i < max31; i++ {
		j := r.int31n(int32(i + 1))
		m[i] = m[j]
		m[j] = i
	}
	for ; i < n; i++ {
		j := r.Int63n(int64(i + 1))
		m[i] = m[j]
		m[j] = i
	}
	return m
}

This is a tiny bit faster than the implementation
actually used in this change:

name      old time/op  new time/op  delta
Perm3-8   51.8ns ± 1%  50.3ns ± 1%  -2.83%  (p=0.000 n=8+9)
Perm30-8   405ns ± 1%   394ns ± 1%  -2.66%  (p=0.000 n=9+8)

However, 3% in performance doesn't seem worth
having the two algorithms diverge,
nor the reduced readability of this alternative.


Updates golang#16213.

Change-Id: I11a7441ff8837ee9c241b4c88f7aa905348be781
@ianlancetaylor
Copy link
Contributor

Closing this proposal in favor of #21835.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge Proposal v2 A language change or incompatible library change
Projects
None yet
Development

No branches or pull requests

5 participants