numpy.random.choice is super slow for choosing a single element (~100x slower) #11476

karttikeya · 2018-07-02T05:58:38Z

I have a loop for reading a file line by line and processing it which among other things picks a random sample from all the lines seen till the point, say stored in seen[] , where each element of seen[] is itself a list with varying sizes.

Now, before I was using chosen = numpy.random.choice(seen) with which the loop processing started with ~5000 it/sec and smoothly decreased to ~150it/sec . Replacing that by chosen = seen[random.randint(0,len(seen)-1)] gives me a consistent ~16500it/sec as found out by the tqdm() package.

While this is such a simple tweak, I am bewildered by this annoying glitch and wasted a lot of time profiling my code trying to find the bottleneck. I wished to check with the devs why is this happening and if this can be corrected in the future versions.

The text was updated successfully, but these errors were encountered:

eric-wieser · 2018-07-02T06:22:38Z

Is seen a list or a numpy array? If it's a list, you should probably just use random.choice, not np.random.choice which will convert it to an array before doing anything else.

eric-wieser · 2018-07-02T06:26:29Z

All of your time is being lost to the expensive conversion of seen into an array:

In [1]: import random
In [2]: seen = [[1, 2, 3] * i for i in range(100)]

In [3]: %timeit random.choice(seen)
1.09 µs ± 6.94 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit np.random.choice(seen)
570 µs ± 6.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [5]: seen_a = np.array(seen)  # this is super slow, because numpy is not meant for this type of input

In [7]: %timeit random.choice(seen_a)
1.13 µs ± 5.19 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [6]: %timeit np.random.choice(seen_a)
2.33 µs ± 68 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

karttikeya · 2018-07-02T21:08:45Z

I see. Thanks a lot for the insightful answer. I guess it would be helpful if numpy throws a warning or something similar in such cases. But I now clearly understand the reason.

anthonyshibitov · 2018-07-03T06:45:46Z

@karttikeya when would this give a warning? i.e. what size array?

eric-wieser · 2018-07-03T06:58:20Z

There's talk of warning whenever numpy tries to convert a jagged array like that, and ends up chosing dtype=object

anthonyshibitov · 2018-07-03T07:09:13Z

@eric-wieser what do you mean by 'jagged'?

mattip · 2018-07-03T16:42:37Z

seen is a list of lists, each internal list can have a different length, so if you would draw them you could get this which looks "jagged"

[ [1, 2, 3],
  [1, 2, 3, 1, 2, 3],
  [1, 2, 3],
  [1, 2, 3, 1, 2, 3],
  [1, 2, 3],
  [1, 2, 3],
]

tylerjereddy · 2018-07-14T13:20:24Z

Incidentally, those "jagged" arrays are extremely common in computational geometry since collections of polygons invariably tend to have different sizes (edge counts, etc.).

jonashaag · 2020-10-24T14:44:47Z

Just hit this too. I think NumPy should be either failing to work on non-arrays, or have acceptable performance. I volunteer to come up with a PR if you agree.

eric-wieser · 2020-10-24T19:01:14Z

@jonashaag, I'm pretty sure in newer versions of numpy my example above would emit a DeprecationWarning, due to the accidental creation of a jagged array. So no further action is needed here.

jonashaag · 2020-10-24T19:59:13Z

Sorry, I'm talking about flat lists:

>>> import time
>>> import numpy as np
>>> import random
>>> stuff = range(100_000)
>>> s = time.time(); np.random.choice(stuff); time.time() - s
0.06497526168823242
>>> s = time.time(); random.choice(stuff); time.time() - s
0.0004982948303222656

I think it's reasonable to try to use NumPy for selecting samples from Python lists, for example if you are using a NumPy RNG everywhere in the code and want to be able to work with non-arrays using the same RNG. And I'd expect NumPy to not perform unnecessarily bad.

mattip added component: numpy.random component: benchmarks labels Jul 19, 2018

mattip closed this as completed Oct 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

numpy.random.choice is super slow for choosing a single element (~100x slower) #11476

numpy.random.choice is super slow for choosing a single element (~100x slower) #11476

karttikeya commented Jul 2, 2018

eric-wieser commented Jul 2, 2018

eric-wieser commented Jul 2, 2018 •

edited

karttikeya commented Jul 2, 2018

anthonyshibitov commented Jul 3, 2018

eric-wieser commented Jul 3, 2018

anthonyshibitov commented Jul 3, 2018

mattip commented Jul 3, 2018

tylerjereddy commented Jul 14, 2018

jonashaag commented Oct 24, 2020

eric-wieser commented Oct 24, 2020 •

edited

jonashaag commented Oct 24, 2020

numpy.random.choice is super slow for choosing a single element (~100x slower) #11476

numpy.random.choice is super slow for choosing a single element (~100x slower) #11476

Comments

karttikeya commented Jul 2, 2018

eric-wieser commented Jul 2, 2018

eric-wieser commented Jul 2, 2018 • edited

karttikeya commented Jul 2, 2018

anthonyshibitov commented Jul 3, 2018

eric-wieser commented Jul 3, 2018

anthonyshibitov commented Jul 3, 2018

mattip commented Jul 3, 2018

tylerjereddy commented Jul 14, 2018

jonashaag commented Oct 24, 2020

eric-wieser commented Oct 24, 2020 • edited

jonashaag commented Oct 24, 2020

eric-wieser commented Jul 2, 2018 •

edited

eric-wieser commented Oct 24, 2020 •

edited