Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed improvement to np.random.choice under replace=False and p=None #9855

Closed
wants to merge 13 commits into from

Conversation

ryanpeach
Copy link

@ryanpeach ryanpeach commented Oct 12, 2017

This is a fringe but also pretty common use-case for me. I want to choose a small sample of a large array, without replacement. The way this is currently written is O(n) where n is your array size. In the small case, however, with large arrays, this is very slow. The following algorithm's best performance is O(k) where k is your sample size. The worst case is unfortunately unbounded. There is always a probability that the random sample will continue to select the same values every time.

However, the probability that it will work in O(k) time, is the following:

$P(O(k)) = \frac{n-1}{n} \cdot \frac{n-2}{n} \cdot ... \cdot \frac{n-k}{n} = \frac{n!}{n^k}$

Lets say we want it to choose this task if the probability of O(k) time is > .99 percent, and call that z.

$P(O(k)) > .99 = z$

That probability will be such that n! < z*n^k where z = .99.

$n! < z*n^k$

Let's cancel z out instead and say z=1 (approximately). Thus, this algorithm should run whenever:

z \approx 1.

$n! < n^k$

Now, I haven't tested this code implemented in numpy, and I know running math.factorial is going to break at large numbers. Any way to simplify this inequality further? Does anyone want to dispute my math?

This is a fringe but also pretty common use-case for me. I want to choose a small sample of a large array, without replacement. The way this is currently written is O(n). In the small case, however, this algorithm's best performance is O(k) where k is your sample size. The worst case is unfortunately unbounded. There is always a probability that the random sample will continue to select the same value every time. The probability that it will work in O(k) time, however, is n/k * (n-1)/k * (n-2)/k * ... * (n-k)/n==(n!)/n^k. Lets say we want it to choose this task if the probability of O(k) time is > .99 percent. That probability will be such that n! < z*n^k where z = .99. Let's cancel z out instead and say z=1 (approximately). Thus, this algorithm should run whenever n! < n^k.
@rkern
Copy link
Member

rkern commented Oct 12, 2017

This general approach has been discussed previously and is blocked by stream-backwards compatibility issues until we figure out a good API for people to select newer (or older) implementations of the these methods: #2764

To your question about math.factorial(), there is an implementation of loggam() (the natural logarithm of the gamma function) in distributions.c for similar purposes. It's currently declared static for local use, but certain could be exposed for use in mtrand.pyx.

@ryanpeach
Copy link
Author

ryanpeach commented Oct 12, 2017

I don't know how to prove this, but all my graphics show that n! < n^k is linearly true so long as n<k<0. Could be divergent at high values though. I'd like a proof, but I bet there's some easy solution like 1<n/k<y or something. I just want the switching to be non-arbitrary.

@ryanpeach
Copy link
Author

I have left my complete math and code snippet on the linked issue #2764

Thanks for your help.

Fixed my math with better conditional proven in my pull request.
@ryanpeach
Copy link
Author

ryanpeach commented Oct 12, 2017

Rest of my proof here:

$log_n(n!)<log_n(n^k)$

$log_n(n!)<k$

$log(\Gamma (1 + n))/log(n)<k$

Sterlings approximation says:

$log(\Gamma (1 + n)) = ((1+n)-1/2) \cdot log(1+n) - (1+n) + (1/2) \cdot log(2\pi)$

$= (n+1/2) \cdot log(1+n)- n + (1/2) \cdot log(2\pi) - 1$

So to wrap it all up, our new code should run whenever:

$\frac{(n+1/2) \cdot log(1+n)- n + (1/2) \cdot log(2\pi) - 1}{log(n)}<k$

Implemented in code.

@ryanpeach
Copy link
Author

Sorry, just realized the inequality flipped.

Missed a flipped inequality
Fixed some basic bugs.
@ryanpeach
Copy link
Author

The unittest which fails due to a mismatch in arrays clearly is working, just isn't what's expected. I call shuffle so the in-order sorting of a length 3 array is just a coincidence most likely.

@eric-wieser
Copy link
Member

The unittest which fails due to a mismatch in arrays clearly is working, just isn't what's expected

This is exactly the "stream-backwards compatibility" issue that @rkern is talking about. You function may be correct, but given the same random.seed, it produces a different shuffle order.

This is only fine if we allow users to request the original shuffle order somehow.

Added @bashtage 's variation on my algorithm using windowed randints of reducing size. Should give a significant speed boost.
@ryanpeach
Copy link
Author

ryanpeach commented Oct 12, 2017

Ok, so I've added a legacy option at @eric-wieser 's request and done a speed improving tweak at @bashtage 's suggestion. Should we add a depreciation warning as well so we can remove the legacy option in future versions? What about a warning for code that run's the new algorithm, saying that order may change?

Found out I hadn't added the denominator to the equation. I also approximated a constant for better speed.
@ryanpeach
Copy link
Author

ryanpeach commented Oct 12, 2017

BTW, it seems that this condition is true, in general, for ratio's of (n, k) around {(10,6.3), (100,78.8), (1000, 855.8)}. That's pretty good! But we may want to timeit at these high values. I'd almost be inclined to restrict it to the inverse of that, such that k<<n, because of the pythonic implementation.

Below this number, the formula explodes to infinity or is less than 1.
Not sure if the version being greater than 0 is the right number to make, but I think this is all that's needed.
@ryanpeach
Copy link
Author

Now using self.version as per #5911

Fixed some math, made cleaner by using loggam. Not sure if loggam is included however?
@ryanpeach
Copy link
Author

I just discovered my math was quite a bit off. P(O(k)) actually equals n!/(n^k*(n-k-1)!) To save us all some time, the derivation comes out to log_n(n!)>k+log_n((n-k-1)!). Using the loggam function makes this cleaner subracting 1 from each and dividing by the base.

@ryanpeach
Copy link
Author

ryanpeach commented Oct 12, 2017

That should be much better in terms of maintaining k << n. Hopefully it compiles, loggam may not be imported or the numbers may be mistyped.

https://www.wolframalpha.com/input/?i=plot+(LogGamma%5Bn%2B1%5D-LogGamma%5Bn-k%5D)%2Flog%5Bn%5D+%3E+k+%7C+0%3Cx%3C10

@rkern
Copy link
Member

rkern commented Oct 12, 2017

Your vote with respect to stream compatibility is noted, but we get many more complaints when we break the stream. People rely on their seeded simulations reproducing the exact same results. Our longstanding (and thoroughly-discussed) policy for numpy.random is that we will break the stream if there is a correctness bug that needs to be fixed, but not in the case of performance improvements.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.RandomState.html

Note that if you want to use self.version, then you need to base your PR off of the the branch in #5911, not master.

@ryanpeach
Copy link
Author

Ok, I'll rebase the pull request once I work everything out in terms of this code on my local machine.

@bashtage
Copy link
Contributor

I wouldn't rebase on that. Unfortunately there isn't a clear path right now to multiple versions. The correct method (IMO) isn't really possible since it isn't possible to override new in Cython classes which could be is to swap classes so that nested classes.could be used. This method isn't perfect, and in particular it may not even work for set_state. The other alternatives so far may get very messy over time.

You may want to finish the PR and then just leave it open so that if the version problem is fixed then this can be incorporated.

@ghost
Copy link

ghost commented Oct 17, 2017

They're completely different random generators though; they should have completely different classes. Swapping yourself out for a different class when you get passed an argument is just odd.

Raising every major version when someone wants to use a state from 1.N-1 doesn't seem like a good API.

This is in fact the best possible API:

generator = RandomStateV1()
generator.beta()  # or whatever

It allows progress. It is explicit. People get the generator they want, and they know it won't change. But we can add a RandomState() function for backwards compatibility.

@ghost
Copy link

ghost commented Oct 17, 2017

This would avoid ending up on RandomState27 since there would be only one RandomState

Also, do you realize what this means? It means that there will something like this:

if self.version == 1:
   pass
elif self.version == 2:
   pass
elif self.version == 3:
..
elif self.version == 27:
..

That is horrible for maintainability. If you have 27 different classes, you can still maintain them much more simply than a function with 27 different branches.

@bashtage
Copy link
Contributor

That is horrible for maintainability.

Versioning of individual random generators would happen at the function level, not at the extension level.

Having a path that end up with RandomStateXX may become prohibitive in terms of build time and library size since I think Cython treats each version as stand alone extension from the POV of the C code output.

@bashtage
Copy link
Contributor

If you want to see the size cost of multiple RandomStates, see the file sizes here: https://pypi.org/project/randomstate/#description

There are 10 RandomStates, one for each different core PRNG, and the library is nearly the size on NumPy (bigger in some cases).

@ghost
Copy link

ghost commented Oct 17, 2017

WIthin the current RandomState structure I think it might make sense to use a dictionary to hold versioned functions.

Actually, I read this again, and it's not a bad idea. But what would the public API look like?

@ryanpeach
Copy link
Author

ryanpeach commented Oct 18, 2017

Definitions

  • n is the length of our array
  • k is our sample size
  • P(O(k)) is the probability that our algorithm will run in order k time.
  • z is the percentage confidence desired in P(O(k)) in order to select it to run.

log_n(x)=\frac{ln(x)}{ln(n)}

\Gamma(x+1)=x!

Proof

P(O(k))=\frac{n-1}{n}\cdot\frac{n-2}{n}\cdot...\cdot\frac{n-k}{n}=\frac{n!}{n^{k+1}(n-k-1)!}

P(O(k))=\frac{n!}{n^(k+1)(n-k-1)!}>z

\frac{n!}{n^{k+1}(n-k-1)!}>z

n!>zn^{k+1}(n-k-1)!

log_n(n!)>log_n(zn^{k+1}(n-k-1)!)

log_n(n!)>log_n(n^{k+1})+log_n((n-k-1)!)+log_n(z)

log_n(n!)-log_n((n-k-1)!)>k+ln(z)/ln(n)+1

log_n(\Gamma(n+1))-log_n(\Gamma(n-k))>k+ln(z)/ln(n)+1

\frac{ln(\Gamma(n+1))}{ln(n)}-\frac{ln(\Gamma(n-k))}{ln(n)}>k+ln(z)/ln(n)+1

Finally:

\frac{ln(\Gamma(n+1))-ln(\Gamma(n-k))}{ln(n)}>k+ln(z)/ln(n)+1

Plot (z=.9)

Wolfram Alpha

https://www.wolframalpha.com/input/?i=plot+(LogGamma%5Bn%2B1%5D-LogGamma%5Bn-k%5D)%2Flog%5Bn%5D+%3E+k+%2B+log_n(.9)+%2B+1

@ryanpeach
Copy link
Author

ryanpeach commented Oct 18, 2017

I’ll admit I must not understand the math, because the z can cancel at 1.0, but how could it really given that there’s always the possibility of repeating selection at k>1. Something is wrong.

NVM, the +1 fixed it. So now, z doesn't cancel, we need to set it to a confidence. Like 99% or something.

These make the selection tuneable by some confidence value in the probability that the algorithm will run in O(k) time.
@Eight1911
Copy link

At this point, do we have a consensus on what the new API should be? I'd be willing to take up the work if we have a finalized structure for the API.

@rkern
Copy link
Member

rkern commented Jan 21, 2018

No, there isn't. We're discussing the subject on the mailing list right now.

https://mail.python.org/pipermail/numpy-discussion/2018-January/077608.html

If it's a greater than 50% chance of being a better algorithm, then it's worth trying right, because the expected value is better.
@ryanpeach
Copy link
Author

Could we maybe just make a flag for this. Make it default false?

@mattip
Copy link
Member

mattip commented Oct 16, 2018

This PR should be part of the random number improvement NEP 19, specifically it should be part of the bashtage/randomgen project which will eventually be vendored or copied into NumPy.

@mattip mattip added the 57 - Close? Issues which may be closable unless discussion continued label Oct 16, 2018
@ryanpeach
Copy link
Author

I'm glad this policy has been changed, it was always a bit silly to begin with. Versioning and or dockerizing is always the only way to maintain perfect state compatibility. Even so, a lot of engineers are using seeds in their code as a way to maintain variability, but in violation of basic scientific principles (in my own field of data science, random seeds prevent truly valid cross validation over multiple runs, I assume this principle holds for most sciences). For maybe a game like minecraft that depends on seeds for its save state, it should not, it should keep a record, because like we said before it's not realistic for us to maintain stability between versions and between machines.

Random numbers should be random.

bashtage added a commit to bashtage/randomgen that referenced this pull request Apr 10, 2019
Improve coice when the array dim is large but the number of elements needed is small

xref numpy/numpy#9855
bashtage added a commit to bashtage/randomgen that referenced this pull request Apr 13, 2019
Add tail shuffle to choice and rule to switch between tail and Floyd

xref numpy/numpy#5299
xref numpy/numpy#2764
xref numpy/numpy#9855
bashtage added a commit to bashtage/randomgen that referenced this pull request Apr 13, 2019
Add tail shuffle to choice and rule to switch between tail and Floyd

xref numpy/numpy#5299
xref numpy/numpy#2764
xref numpy/numpy#9855
bashtage added a commit to bashtage/numpy that referenced this pull request Apr 13, 2019
Improve performance in all cases
Large improvement with size is small

xref numpy#5299
xref numpy#2764
xref numpy#9855
xref numpy#7810
@bashtage
Copy link
Contributor

I think the algorithm in #13163 is strictly faster than this method and always completed in O(size) ops

xref #13164

@ryanpeach
Copy link
Author

I do as well. Please disregard this PR. It was made in an attempt to provide a switchable algorithm, which could handle past-version compatibility. As that is being removed, I don't think it's useful.

@ryanpeach ryanpeach closed this Apr 15, 2019
mattip pushed a commit to mattip/numpy that referenced this pull request May 20, 2019
Improve performance in all cases
Large improvement with size is small

xref numpy#5299
xref numpy#2764
xref numpy#9855
xref numpy#7810
mattip pushed a commit to mattip/numpy that referenced this pull request May 20, 2019
Improve performance in all cases
Large improvement with size is small

xref numpy#5299
xref numpy#2764
xref numpy#9855
xref numpy#7810
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
03 - Maintenance 51 - In progress 57 - Close? Issues which may be closable unless discussion continued component: numpy.random
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants