Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numpy.random.choice is super slow for choosing a single element (~100x slower) #11476

Closed
karttikeya opened this issue Jul 2, 2018 · 11 comments
Closed

Comments

@karttikeya
Copy link

I have a loop for reading a file line by line and processing it which among other things picks a random sample from all the lines seen till the point, say stored in seen[] , where each element of seen[] is itself a list with varying sizes.

Now, before I was using chosen = numpy.random.choice(seen) with which the loop processing started with ~5000 it/sec and smoothly decreased to ~150it/sec . Replacing that by chosen = seen[random.randint(0,len(seen)-1)] gives me a consistent ~16500it/sec as found out by the tqdm() package.

While this is such a simple tweak, I am bewildered by this annoying glitch and wasted a lot of time profiling my code trying to find the bottleneck. I wished to check with the devs why is this happening and if this can be corrected in the future versions.

@eric-wieser
Copy link
Member

Is seen a list or a numpy array? If it's a list, you should probably just use random.choice, not np.random.choice which will convert it to an array before doing anything else.

@eric-wieser
Copy link
Member

eric-wieser commented Jul 2, 2018

All of your time is being lost to the expensive conversion of seen into an array:

In [1]: import random
In [2]: seen = [[1, 2, 3] * i for i in range(100)]

In [3]: %timeit random.choice(seen)
1.09 µs ± 6.94 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit np.random.choice(seen)
570 µs ± 6.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: seen_a = np.array(seen)  # this is super slow, because numpy is not meant for this type of input

In [7]: %timeit random.choice(seen_a)
1.13 µs ± 5.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: %timeit np.random.choice(seen_a)
2.33 µs ± 68 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@karttikeya
Copy link
Author

I see. Thanks a lot for the insightful answer. I guess it would be helpful if numpy throws a warning or something similar in such cases. But I now clearly understand the reason.

@anthonyshibitov
Copy link

@karttikeya when would this give a warning? i.e. what size array?

@eric-wieser
Copy link
Member

There's talk of warning whenever numpy tries to convert a jagged array like that, and ends up chosing dtype=object

@anthonyshibitov
Copy link

@eric-wieser what do you mean by 'jagged'?

@mattip
Copy link
Member

mattip commented Jul 3, 2018

seen is a list of lists, each internal list can have a different length, so if you would draw them you could get this which looks "jagged"

[ [1, 2, 3],
  [1, 2, 3, 1, 2, 3],
  [1, 2, 3],
  [1, 2, 3, 1, 2, 3],
  [1, 2, 3],
  [1, 2, 3],
]

@tylerjereddy
Copy link
Contributor

Incidentally, those "jagged" arrays are extremely common in computational geometry since collections of polygons invariably tend to have different sizes (edge counts, etc.).

@jonashaag
Copy link

Just hit this too. I think NumPy should be either failing to work on non-arrays, or have acceptable performance. I volunteer to come up with a PR if you agree.

@eric-wieser
Copy link
Member

eric-wieser commented Oct 24, 2020

@jonashaag, I'm pretty sure in newer versions of numpy my example above would emit a DeprecationWarning, due to the accidental creation of a jagged array. So no further action is needed here.

@jonashaag
Copy link

Sorry, I'm talking about flat lists:

>>> import time
>>> import numpy as np
>>> import random
>>> stuff = range(100_000)
>>> s = time.time(); np.random.choice(stuff); time.time() - s
0.06497526168823242
>>> s = time.time(); random.choice(stuff); time.time() - s
0.0004982948303222656

I think it's reasonable to try to use NumPy for selecting samples from Python lists, for example if you are using a NumPy RNG everywhere in the code and want to be able to work with non-arrays using the same RNG. And I'd expect NumPy to not perform unnecessarily bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants