-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
numpy.random.choice is super slow for choosing a single element (~100x slower) #11476
Comments
Is |
All of your time is being lost to the expensive conversion of In [1]: import random
In [2]: seen = [[1, 2, 3] * i for i in range(100)]
In [3]: %timeit random.choice(seen)
1.09 µs ± 6.94 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit np.random.choice(seen)
570 µs ± 6.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [5]: seen_a = np.array(seen) # this is super slow, because numpy is not meant for this type of input
In [7]: %timeit random.choice(seen_a)
1.13 µs ± 5.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: %timeit np.random.choice(seen_a)
2.33 µs ± 68 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) |
I see. Thanks a lot for the insightful answer. I guess it would be helpful if numpy throws a warning or something similar in such cases. But I now clearly understand the reason. |
@karttikeya when would this give a warning? i.e. what size array? |
There's talk of warning whenever numpy tries to convert a jagged array like that, and ends up chosing |
@eric-wieser what do you mean by 'jagged'? |
|
Incidentally, those "jagged" arrays are extremely common in computational geometry since collections of polygons invariably tend to have different sizes (edge counts, etc.). |
Just hit this too. I think NumPy should be either failing to work on non-arrays, or have acceptable performance. I volunteer to come up with a PR if you agree. |
@jonashaag, I'm pretty sure in newer versions of numpy my example above would emit a DeprecationWarning, due to the accidental creation of a jagged array. So no further action is needed here. |
Sorry, I'm talking about flat lists: >>> import time
>>> import numpy as np
>>> import random
>>> stuff = range(100_000)
>>> s = time.time(); np.random.choice(stuff); time.time() - s
0.06497526168823242
>>> s = time.time(); random.choice(stuff); time.time() - s
0.0004982948303222656 I think it's reasonable to try to use NumPy for selecting samples from Python lists, for example if you are using a NumPy RNG everywhere in the code and want to be able to work with non-arrays using the same RNG. And I'd expect NumPy to not perform unnecessarily bad. |
I have a loop for reading a file line by line and processing it which among other things picks a random sample from all the lines seen till the point, say stored in
seen[]
, where each element ofseen[]
is itself a list with varying sizes.Now, before I was using
chosen = numpy.random.choice(seen)
with which the loop processing started with ~5000
it/sec and smoothly decreased to ~150
it/sec . Replacing that bychosen = seen[random.randint(0,len(seen)-1)]
gives me a consistent ~16500
it/sec as found out by thetqdm()
package.While this is such a simple tweak, I am bewildered by this annoying glitch and wasted a lot of time profiling my code trying to find the bottleneck. I wished to check with the devs why is this happening and if this can be corrected in the future versions.
The text was updated successfully, but these errors were encountered: