-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed improvement to np.random.choice under replace=False and p=None #9855
Conversation
This is a fringe but also pretty common use-case for me. I want to choose a small sample of a large array, without replacement. The way this is currently written is O(n). In the small case, however, this algorithm's best performance is O(k) where k is your sample size. The worst case is unfortunately unbounded. There is always a probability that the random sample will continue to select the same value every time. The probability that it will work in O(k) time, however, is n/k * (n-1)/k * (n-2)/k * ... * (n-k)/n==(n!)/n^k. Lets say we want it to choose this task if the probability of O(k) time is > .99 percent. That probability will be such that n! < z*n^k where z = .99. Let's cancel z out instead and say z=1 (approximately). Thus, this algorithm should run whenever n! < n^k.
This general approach has been discussed previously and is blocked by stream-backwards compatibility issues until we figure out a good API for people to select newer (or older) implementations of the these methods: #2764 To your question about |
I don't know how to prove this, but all my graphics show that n! < n^k is linearly true so long as n<k<0. Could be divergent at high values though. I'd like a proof, but I bet there's some easy solution like 1<n/k<y or something. I just want the switching to be non-arbitrary. |
I have left my complete math and code snippet on the linked issue #2764 Thanks for your help. |
Fixed my math with better conditional proven in my pull request.
Rest of my proof here: Sterlings approximation says: So to wrap it all up, our new code should run whenever: Implemented in code. |
Sorry, just realized the inequality flipped. |
Missed a flipped inequality
Fixed some basic bugs.
The unittest which fails due to a mismatch in arrays clearly is working, just isn't what's expected. I call shuffle so the in-order sorting of a length 3 array is just a coincidence most likely. |
This is exactly the "stream-backwards compatibility" issue that @rkern is talking about. You function may be correct, but given the same This is only fine if we allow users to request the original shuffle order somehow. |
Added @bashtage 's variation on my algorithm using windowed randints of reducing size. Should give a significant speed boost.
Ok, so I've added a legacy option at @eric-wieser 's request and done a speed improving tweak at @bashtage 's suggestion. Should we add a depreciation warning as well so we can remove the legacy option in future versions? What about a warning for code that run's the new algorithm, saying that order may change? |
Found out I hadn't added the denominator to the equation. I also approximated a constant for better speed.
BTW, it seems that this condition is true, in general, for ratio's of (n, k) around {(10,6.3), (100,78.8), (1000, 855.8)}. That's pretty good! But we may want to timeit at these high values. I'd almost be inclined to restrict it to the inverse of that, such that k<<n, because of the pythonic implementation. |
Below this number, the formula explodes to infinity or is less than 1.
Not sure if the version being greater than 0 is the right number to make, but I think this is all that's needed.
Now using self.version as per #5911 |
Fixed some math, made cleaner by using loggam. Not sure if loggam is included however?
I just discovered my math was quite a bit off. P(O(k)) actually equals n!/(n^k*(n-k-1)!) To save us all some time, the derivation comes out to log_n(n!)>k+log_n((n-k-1)!). Using the loggam function makes this cleaner subracting 1 from each and dividing by the base. |
That should be much better in terms of maintaining k << n. Hopefully it compiles, loggam may not be imported or the numbers may be mistyped. |
Your vote with respect to stream compatibility is noted, but we get many more complaints when we break the stream. People rely on their seeded simulations reproducing the exact same results. Our longstanding (and thoroughly-discussed) policy for https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.RandomState.html Note that if you want to use |
Ok, I'll rebase the pull request once I work everything out in terms of this code on my local machine. |
I wouldn't rebase on that. Unfortunately there isn't a clear path right now to multiple versions. The correct method (IMO) isn't really possible since it isn't possible to override new in Cython classes which could be is to swap classes so that nested classes.could be used. This method isn't perfect, and in particular it may not even work for set_state. The other alternatives so far may get very messy over time. You may want to finish the PR and then just leave it open so that if the version problem is fixed then this can be incorporated. |
They're completely different random generators though; they should have completely different classes. Swapping yourself out for a different class when you get passed an argument is just odd.
This is in fact the best possible API: generator = RandomStateV1()
generator.beta() # or whatever It allows progress. It is explicit. People get the generator they want, and they know it won't change. But we can add a |
Also, do you realize what this means? It means that there will something like this:
That is horrible for maintainability. If you have 27 different classes, you can still maintain them much more simply than a function with 27 different branches. |
Versioning of individual random generators would happen at the function level, not at the extension level. Having a path that end up with |
If you want to see the size cost of multiple There are 10 |
Actually, I read this again, and it's not a bad idea. But what would the public API look like? |
Definitions
ProofFinally:Plot (z=.9) |
I’ll admit I must not understand the math, because the z can cancel at 1.0, but how could it really given that there’s always the possibility of repeating selection at k>1. Something is wrong. NVM, the +1 fixed it. So now, z doesn't cancel, we need to set it to a confidence. Like 99% or something. |
These make the selection tuneable by some confidence value in the probability that the algorithm will run in O(k) time.
At this point, do we have a consensus on what the new API should be? I'd be willing to take up the work if we have a finalized structure for the API. |
No, there isn't. We're discussing the subject on the mailing list right now. https://mail.python.org/pipermail/numpy-discussion/2018-January/077608.html |
If it's a greater than 50% chance of being a better algorithm, then it's worth trying right, because the expected value is better.
Could we maybe just make a flag for this. Make it default false? |
This PR should be part of the random number improvement NEP 19, specifically it should be part of the bashtage/randomgen project which will eventually be vendored or copied into NumPy. |
I'm glad this policy has been changed, it was always a bit silly to begin with. Versioning and or dockerizing is always the only way to maintain perfect state compatibility. Even so, a lot of engineers are using seeds in their code as a way to maintain variability, but in violation of basic scientific principles (in my own field of data science, random seeds prevent truly valid cross validation over multiple runs, I assume this principle holds for most sciences). For maybe a game like minecraft that depends on seeds for its save state, it should not, it should keep a record, because like we said before it's not realistic for us to maintain stability between versions and between machines. Random numbers should be random. |
Improve coice when the array dim is large but the number of elements needed is small xref numpy/numpy#9855
Add tail shuffle to choice and rule to switch between tail and Floyd xref numpy/numpy#5299 xref numpy/numpy#2764 xref numpy/numpy#9855
Add tail shuffle to choice and rule to switch between tail and Floyd xref numpy/numpy#5299 xref numpy/numpy#2764 xref numpy/numpy#9855
Improve performance in all cases Large improvement with size is small xref numpy#5299 xref numpy#2764 xref numpy#9855 xref numpy#7810
I do as well. Please disregard this PR. It was made in an attempt to provide a switchable algorithm, which could handle past-version compatibility. As that is being removed, I don't think it's useful. |
Improve performance in all cases Large improvement with size is small xref numpy#5299 xref numpy#2764 xref numpy#9855 xref numpy#7810
Improve performance in all cases Large improvement with size is small xref numpy#5299 xref numpy#2764 xref numpy#9855 xref numpy#7810
This is a fringe but also pretty common use-case for me. I want to choose a small sample of a large array, without replacement. The way this is currently written is O(n) where n is your array size. In the small case, however, with large arrays, this is very slow. The following algorithm's best performance is O(k) where k is your sample size. The worst case is unfortunately unbounded. There is always a probability that the random sample will continue to select the same values every time.
However, the probability that it will work in O(k) time, is the following:
Lets say we want it to choose this task if the probability of O(k) time is > .99 percent, and call that z.
That probability will be such that n! < z*n^k where z = .99.
Let's cancel z out instead and say z=1 (approximately). Thus, this algorithm should run whenever:
Now, I haven't tested this code implemented in numpy, and I know running math.factorial is going to break at large numbers. Any way to simplify this inequality further? Does anyone want to dispute my math?