In [1]:
_="""
I'm considering a code simplification - removing the hash collision slowdown prevention in FrozenHashIndex.
What's the worst that could happen?

Scenario it averts:
- We have two objects with the same hash
- One value has many objects, the other has few
- When we look up the "few-object" value, we get the hash, and have to crawl through all the "many-object" values.

Question: How bad is that scenario on a typical few-million-object case? 
It's a numpy array -- can we do it with SIMD in nanoseconds / a few microseconds?

To refine the question further:
 - We have a large numpy array of objects, grouped by value
 - How fast can we find the group of objects with the same value?
"""

In [2]:
import numpy as np

In [4]:
arr = np.array([1]*(10**6)+['abc']*(10**2), dtype='O')
arr

array([1, 1, 1, ..., 'abc', 'abc', 'abc'], dtype=object)

In [11]:
%%timeit -n 5 -r 5
np.where(arr =='abc')
# yeowch!! not that!

14.1 ms ± 653 µs per loop (mean ± std. dev. of 5 runs, 5 loops each)


In [64]:
%%timeit -n 5 -r 5
for index, x in np.ndenumerate(arr):
    if x == 'abc':
        break
# oh no no no. This is bad. OK, we're keeping things as is.

194 ms ± 2.67 ms per loop (mean ± std. dev. of 5 runs, 5 loops each)
