Speed improvement to np.random.choice under replace=False and p=None #9855

ryanpeach · 2017-10-12T01:54:57Z

This is a fringe but also pretty common use-case for me. I want to choose a small sample of a large array, without replacement. The way this is currently written is O(n) where n is your array size. In the small case, however, with large arrays, this is very slow. The following algorithm's best performance is O(k) where k is your sample size. The worst case is unfortunately unbounded. There is always a probability that the random sample will continue to select the same values every time.

However, the probability that it will work in O(k) time, is the following:

$P(O(k)) = \frac{n-1}{n} \cdot \frac{n-2}{n} \cdot ... \cdot \frac{n-k}{n} = \frac{n!}{n^k}$

Lets say we want it to choose this task if the probability of O(k) time is > .99 percent, and call that z.

That probability will be such that n! < z*n^k where z = .99.

Let's cancel z out instead and say z=1 (approximately). Thus, this algorithm should run whenever:

$z \approx 1.$

Now, I haven't tested this code implemented in numpy, and I know running math.factorial is going to break at large numbers. Any way to simplify this inequality further? Does anyone want to dispute my math?

This is a fringe but also pretty common use-case for me. I want to choose a small sample of a large array, without replacement. The way this is currently written is O(n). In the small case, however, this algorithm's best performance is O(k) where k is your sample size. The worst case is unfortunately unbounded. There is always a probability that the random sample will continue to select the same value every time. The probability that it will work in O(k) time, however, is n/k * (n-1)/k * (n-2)/k * ... * (n-k)/n==(n!)/n^k. Lets say we want it to choose this task if the probability of O(k) time is > .99 percent. That probability will be such that n! < z*n^k where z = .99. Let's cancel z out instead and say z=1 (approximately). Thus, this algorithm should run whenever n! < n^k.

rkern · 2017-10-12T02:08:35Z

This general approach has been discussed previously and is blocked by stream-backwards compatibility issues until we figure out a good API for people to select newer (or older) implementations of the these methods: #2764

To your question about math.factorial(), there is an implementation of loggam() (the natural logarithm of the gamma function) in distributions.c for similar purposes. It's currently declared static for local use, but certain could be exposed for use in mtrand.pyx.

ryanpeach · 2017-10-12T02:21:42Z

I don't know how to prove this, but all my graphics show that n! < n^k is linearly true so long as n<k<0. Could be divergent at high values though. I'd like a proof, but I bet there's some easy solution like 1<n/k<y or something. I just want the switching to be non-arbitrary.

ryanpeach · 2017-10-12T02:58:46Z

I have left my complete math and code snippet on the linked issue #2764

Thanks for your help.

Fixed my math with better conditional proven in my pull request.

ryanpeach · 2017-10-12T03:26:58Z

Rest of my proof here:

$log(\Gamma (1 + n))/log(n)<k$

Sterlings approximation says:

$log(\Gamma (1 + n)) = ((1+n)-1/2) \cdot log(1+n) - (1+n) + (1/2) \cdot log(2\pi)$

$= (n+1/2) \cdot log(1+n)- n + (1/2) \cdot log(2\pi) - 1$

So to wrap it all up, our new code should run whenever:

$\frac{(n+1/2) \cdot log(1+n)- n + (1/2) \cdot log(2\pi) - 1}{log(n)}<k$

Implemented in code.

ryanpeach · 2017-10-12T03:28:46Z

Sorry, just realized the inequality flipped.

Missed a flipped inequality

Fixed some basic bugs.

ryanpeach · 2017-10-12T04:17:56Z

The unittest which fails due to a mismatch in arrays clearly is working, just isn't what's expected. I call shuffle so the in-order sorting of a length 3 array is just a coincidence most likely.

eric-wieser · 2017-10-12T06:56:28Z

The unittest which fails due to a mismatch in arrays clearly is working, just isn't what's expected

This is exactly the "stream-backwards compatibility" issue that @rkern is talking about. You function may be correct, but given the same random.seed, it produces a different shuffle order.

This is only fine if we allow users to request the original shuffle order somehow.

@bashtage

Added @bashtage 's variation on my algorithm using windowed randints of reducing size. Should give a significant speed boost.

ryanpeach · 2017-10-12T13:14:06Z

Ok, so I've added a legacy option at @eric-wieser 's request and done a speed improving tweak at @bashtage 's suggestion. Should we add a depreciation warning as well so we can remove the legacy option in future versions? What about a warning for code that run's the new algorithm, saying that order may change?

Found out I hadn't added the denominator to the equation. I also approximated a constant for better speed.

ryanpeach · 2017-10-12T13:30:26Z

BTW, it seems that this condition is true, in general, for ratio's of (n, k) around {(10,6.3), (100,78.8), (1000, 855.8)}. That's pretty good! But we may want to timeit at these high values. I'd almost be inclined to restrict it to the inverse of that, such that k<<n, because of the pythonic implementation.

Below this number, the formula explodes to infinity or is less than 1.

Not sure if the version being greater than 0 is the right number to make, but I think this is all that's needed.

ryanpeach · 2017-10-12T13:45:19Z

Now using self.version as per #5911

Fixed some math, made cleaner by using loggam. Not sure if loggam is included however?

ryanpeach · 2017-10-12T18:48:23Z

I just discovered my math was quite a bit off. P(O(k)) actually equals n!/(n^k*(n-k-1)!) To save us all some time, the derivation comes out to log_n(n!)>k+log_n((n-k-1)!). Using the loggam function makes this cleaner subracting 1 from each and dividing by the base.

ryanpeach · 2017-10-12T19:01:29Z

That should be much better in terms of maintaining k << n. Hopefully it compiles, loggam may not be imported or the numbers may be mistyped.

https://www.wolframalpha.com/input/?i=plot+(LogGamma%5Bn%2B1%5D-LogGamma%5Bn-k%5D)%2Flog%5Bn%5D+%3E+k+%7C+0%3Cx%3C10

rkern · 2017-10-12T21:03:42Z

Your vote with respect to stream compatibility is noted, but we get many more complaints when we break the stream. People rely on their seeded simulations reproducing the exact same results. Our longstanding (and thoroughly-discussed) policy for numpy.random is that we will break the stream if there is a correctness bug that needs to be fixed, but not in the case of performance improvements.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.RandomState.html

Note that if you want to use self.version, then you need to base your PR off of the the branch in #5911, not master.

ryanpeach · 2017-10-12T23:42:19Z

Ok, I'll rebase the pull request once I work everything out in terms of this code on my local machine.

bashtage · 2017-10-13T08:02:40Z

I wouldn't rebase on that. Unfortunately there isn't a clear path right now to multiple versions. The correct method (IMO) isn't really possible since it isn't possible to override new in Cython classes which could be is to swap classes so that nested classes.could be used. This method isn't perfect, and in particular it may not even work for set_state. The other alternatives so far may get very messy over time.

You may want to finish the PR and then just leave it open so that if the version problem is fixed then this can be incorporated.

ghost · 2017-10-17T21:02:03Z

They're completely different random generators though; they should have completely different classes. Swapping yourself out for a different class when you get passed an argument is just odd.

Raising every major version when someone wants to use a state from 1.N-1 doesn't seem like a good API.

This is in fact the best possible API:

generator = RandomStateV1()
generator.beta()  # or whatever

It allows progress. It is explicit. People get the generator they want, and they know it won't change. But we can add a RandomState() function for backwards compatibility.

ghost · 2017-10-17T21:06:49Z

This would avoid ending up on RandomState27 since there would be only one RandomState

Also, do you realize what this means? It means that there will something like this:

if self.version == 1:
   pass
elif self.version == 2:
   pass
elif self.version == 3:
..
elif self.version == 27:
..

That is horrible for maintainability. If you have 27 different classes, you can still maintain them much more simply than a function with 27 different branches.

bashtage · 2017-10-17T21:10:20Z

That is horrible for maintainability.

Versioning of individual random generators would happen at the function level, not at the extension level.

Having a path that end up with RandomStateXX may become prohibitive in terms of build time and library size since I think Cython treats each version as stand alone extension from the POV of the C code output.

bashtage · 2017-10-17T21:15:54Z

If you want to see the size cost of multiple RandomStates, see the file sizes here: https://pypi.org/project/randomstate/#description

There are 10 RandomStates, one for each different core PRNG, and the library is nearly the size on NumPy (bigger in some cases).

ghost · 2017-10-17T21:22:43Z

WIthin the current RandomState structure I think it might make sense to use a dictionary to hold versioned functions.

Actually, I read this again, and it's not a bad idea. But what would the public API look like?

ryanpeach · 2017-10-18T03:47:49Z

Definitions

n is the length of our array
k is our sample size
P(O(k)) is the probability that our algorithm will run in order k time.
z is the percentage confidence desired in P(O(k)) in order to select it to run.

$log_n(x)=\frac{ln(x)}{ln(n)}$

$\Gamma(x+1)=x!$

Proof

$P(O(k))=\frac{n-1}{n}\cdot\frac{n-2}{n}\cdot...\cdot\frac{n-k}{n}=\frac{n!}{n^{k+1}(n-k-1)!}$

$P(O(k))=\frac{n!}{n^(k+1)(n-k-1)!}>z$

$\frac{n!}{n^{k+1}(n-k-1)!}>z$

$n!>zn^{k+1}(n-k-1)!$

$log_n(n!)>log_n(zn^{k+1}(n-k-1)!)$

$log_n(n!)>log_n(n^{k+1})+log_n((n-k-1)!)+log_n(z)$

$log_n(\Gamma(n+1))-log_n(\Gamma(n-k))>k+ln(z)/ln(n)+1$

$\frac{ln(\Gamma(n+1))}{ln(n)}-\frac{ln(\Gamma(n-k))}{ln(n)}>k+ln(z)/ln(n)+1$

Finally:

$\frac{ln(\Gamma(n+1))-ln(\Gamma(n-k))}{ln(n)}>k+ln(z)/ln(n)+1$

Plot (z=.9)

https://www.wolframalpha.com/input/?i=plot+(LogGamma%5Bn%2B1%5D-LogGamma%5Bn-k%5D)%2Flog%5Bn%5D+%3E+k+%2B+log_n(.9)+%2B+1

ryanpeach · 2017-10-18T05:00:43Z

I’ll admit I must not understand the math, because the z can cancel at 1.0, but how could it really given that there’s always the possibility of repeating selection at k>1. Something is wrong.

NVM, the +1 fixed it. So now, z doesn't cancel, we need to set it to a confidence. Like 99% or something.

These make the selection tuneable by some confidence value in the probability that the algorithm will run in O(k) time.

Eight1911 · 2018-01-20T13:22:43Z

At this point, do we have a consensus on what the new API should be? I'd be willing to take up the work if we have a finalized structure for the API.

rkern · 2018-01-21T02:51:08Z

No, there isn't. We're discussing the subject on the mailing list right now.

https://mail.python.org/pipermail/numpy-discussion/2018-January/077608.html

If it's a greater than 50% chance of being a better algorithm, then it's worth trying right, because the expected value is better.

ryanpeach · 2018-02-20T22:10:55Z

Could we maybe just make a flag for this. Make it default false?

mattip · 2018-10-16T09:32:52Z

This PR should be part of the random number improvement NEP 19, specifically it should be part of the bashtage/randomgen project which will eventually be vendored or copied into NumPy.

ryanpeach · 2018-10-16T12:22:20Z

I'm glad this policy has been changed, it was always a bit silly to begin with. Versioning and or dockerizing is always the only way to maintain perfect state compatibility. Even so, a lot of engineers are using seeds in their code as a way to maintain variability, but in violation of basic scientific principles (in my own field of data science, random seeds prevent truly valid cross validation over multiple runs, I assume this principle holds for most sciences). For maybe a game like minecraft that depends on seeds for its save state, it should not, it should keep a record, because like we said before it's not realistic for us to maintain stability between versions and between machines.

Random numbers should be random.

Improve coice when the array dim is large but the number of elements needed is small xref numpy/numpy#9855

Add tail shuffle to choice and rule to switch between tail and Floyd xref numpy/numpy#5299 xref numpy/numpy#2764 xref numpy/numpy#9855

Improve performance in all cases Large improvement with size is small xref numpy#5299 xref numpy#2764 xref numpy#9855 xref numpy#7810

bashtage · 2019-04-14T22:15:27Z

I think the algorithm in #13163 is strictly faster than this method and always completed in O(size) ops

xref #13164

ryanpeach · 2019-04-15T15:43:42Z

I do as well. Please disregard this PR. It was made in an attempt to provide a switchable algorithm, which could handle past-version compatibility. As that is being removed, I don't think it's useful.

Improve performance in all cases Large improvement with size is small xref numpy#5299 xref numpy#2764 xref numpy#9855 xref numpy#7810

Update mtrand.pyx

ad3d334

Fixed my math with better conditional proven in my pull request.

ryanpeach added 2 commits October 11, 2017 23:29

Update mtrand.pyx

0bc3ec1

Missed a flipped inequality

Update mtrand.pyx

565d5db

Fixed some basic bugs.

ryanpeach added 2 commits October 12, 2017 09:06

Added @bashtage's variation for significant speed boost.

8590eea

Added @bashtage 's variation on my algorithm using windowed randints of reducing size. Should give a significant speed boost.

Added legacy option at @eric-wieser's request.

a8c1715

Fixed the math and expanded a constant.

d833f0b

Found out I hadn't added the denominator to the equation. I also approximated a constant for better speed.

Restricting to pop_size>3

d15f5e7

Below this number, the formula explodes to infinity or is less than 1.

ryanpeach mentioned this pull request Oct 12, 2017

Allow multiple versions of RandomState: init' work. #5911

Closed

Using new self.version instead.

3ce462c

Not sure if the version being greater than 0 is the right number to make, but I think this is all that's needed.

Fixed math, used loggam

741502d

Fixed some math, made cleaner by using loggam. Not sure if loggam is included however?

Update mtrand.pyx

7e590fc

charris added 03 - Maintenance 51 - In progress component: numpy.random labels Oct 15, 2017

Added the confidence (.99) and +1 term.

68fd23a

These make the selection tuneable by some confidence value in the probability that the algorithm will run in O(k) time.

eric-wieser mentioned this pull request Nov 29, 2017

ENH: Faster algorithm for np.random.choice. #10124

Closed

Replaced .99 confidence with .5

fa1e5bb

If it's a greater than 50% chance of being a better algorithm, then it's worth trying right, because the expected value is better.

ryanpeach mentioned this pull request Feb 20, 2018

Random.choice small subsample kwarg (default False) instead of RandomState version. #10637

Closed

mattip added the 57 - Close? Issues which may be closable unless discussion continued label Oct 16, 2018

bashtage added a commit to bashtage/randomgen that referenced this pull request Apr 10, 2019

ENH: Improve choice when selecting a small number of elem

8e15903

Improve coice when the array dim is large but the number of elements needed is small xref numpy/numpy#9855

bashtage added a commit to bashtage/randomgen that referenced this pull request Apr 13, 2019

ENH: Add tail shuffle

0d8d2ee

Add tail shuffle to choice and rule to switch between tail and Floyd xref numpy/numpy#5299 xref numpy/numpy#2764 xref numpy/numpy#9855

bashtage added a commit to bashtage/randomgen that referenced this pull request Apr 13, 2019

ENH: Add tail shuffle

0a94726

Add tail shuffle to choice and rule to switch between tail and Floyd xref numpy/numpy#5299 xref numpy/numpy#2764 xref numpy/numpy#9855

bashtage added a commit to bashtage/numpy that referenced this pull request Apr 13, 2019

ENH: Improvce choice without replacement

6c4bc0c

Improve performance in all cases Large improvement with size is small xref numpy#5299 xref numpy#2764 xref numpy#9855 xref numpy#7810

bashtage mentioned this pull request Apr 14, 2019

ENH: tracking issue for merging randomgen into numpy #13164

Closed

16 tasks

ryanpeach closed this Apr 15, 2019

mattip pushed a commit to mattip/numpy that referenced this pull request May 20, 2019

ENH: Improvce choice without replacement

c6627b4

Improve performance in all cases Large improvement with size is small xref numpy#5299 xref numpy#2764 xref numpy#9855 xref numpy#7810

mattip pushed a commit to mattip/numpy that referenced this pull request May 20, 2019

ENH: Improvce choice without replacement

b2f9bea

Improve performance in all cases Large improvement with size is small xref numpy#5299 xref numpy#2764 xref numpy#9855 xref numpy#7810

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed improvement to np.random.choice under replace=False and p=None #9855

Speed improvement to np.random.choice under replace=False and p=None #9855

ryanpeach commented Oct 12, 2017 •

edited

rkern commented Oct 12, 2017

ryanpeach commented Oct 12, 2017 •

edited

ryanpeach commented Oct 12, 2017

ryanpeach commented Oct 12, 2017 •

edited

ryanpeach commented Oct 12, 2017

ryanpeach commented Oct 12, 2017

eric-wieser commented Oct 12, 2017

ryanpeach commented Oct 12, 2017 •

edited

ryanpeach commented Oct 12, 2017 •

edited

ryanpeach commented Oct 12, 2017

ryanpeach commented Oct 12, 2017

ryanpeach commented Oct 12, 2017 •

edited

rkern commented Oct 12, 2017

ryanpeach commented Oct 12, 2017

bashtage commented Oct 13, 2017

ghost commented Oct 17, 2017

ghost commented Oct 17, 2017 •

edited by ghost

bashtage commented Oct 17, 2017

bashtage commented Oct 17, 2017

ghost commented Oct 17, 2017

ryanpeach commented Oct 18, 2017 •

edited

ryanpeach commented Oct 18, 2017 •

edited

Eight1911 commented Jan 20, 2018

rkern commented Jan 21, 2018

ryanpeach commented Feb 20, 2018

mattip commented Oct 16, 2018

ryanpeach commented Oct 16, 2018

bashtage commented Apr 14, 2019

ryanpeach commented Apr 15, 2019

Speed improvement to np.random.choice under replace=False and p=None #9855

Speed improvement to np.random.choice under replace=False and p=None #9855

Conversation

ryanpeach commented Oct 12, 2017 • edited

rkern commented Oct 12, 2017

ryanpeach commented Oct 12, 2017 • edited

ryanpeach commented Oct 12, 2017

ryanpeach commented Oct 12, 2017 • edited

ryanpeach commented Oct 12, 2017

ryanpeach commented Oct 12, 2017

eric-wieser commented Oct 12, 2017

ryanpeach commented Oct 12, 2017 • edited

ryanpeach commented Oct 12, 2017 • edited

ryanpeach commented Oct 12, 2017

ryanpeach commented Oct 12, 2017

ryanpeach commented Oct 12, 2017 • edited

rkern commented Oct 12, 2017

ryanpeach commented Oct 12, 2017

bashtage commented Oct 13, 2017

ghost commented Oct 17, 2017

ghost commented Oct 17, 2017 • edited by ghost

bashtage commented Oct 17, 2017

bashtage commented Oct 17, 2017

ghost commented Oct 17, 2017

ryanpeach commented Oct 18, 2017 • edited

Definitions

Proof

Finally:

Plot (z=.9)

ryanpeach commented Oct 18, 2017 • edited

Eight1911 commented Jan 20, 2018

rkern commented Jan 21, 2018

ryanpeach commented Feb 20, 2018

mattip commented Oct 16, 2018

ryanpeach commented Oct 16, 2018

bashtage commented Apr 14, 2019

ryanpeach commented Apr 15, 2019

ryanpeach commented Oct 12, 2017 •

edited

ryanpeach commented Oct 12, 2017 •

edited

ryanpeach commented Oct 12, 2017 •

edited

ryanpeach commented Oct 12, 2017 •

edited

ryanpeach commented Oct 12, 2017 •

edited

ryanpeach commented Oct 12, 2017 •

edited

ghost commented Oct 17, 2017 •

edited by ghost

ryanpeach commented Oct 18, 2017 •

edited

ryanpeach commented Oct 18, 2017 •

edited