-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Description
A potential performance issue relating to hash collisions from distinct NaNs was recently reported and fixed in core Python. In brief, the fact that all NaNs hash to 0
makes it easy to accidentally invoke quadratic-time behaviour from NaNs going into a set or dictionary. The fix involved changing the hash of a float
NaN to match the usual object hash.
The core issue affects NumPy, too (and in fact one of the examples given in the Python bug report to demonstrate the issue used np.float64
rather than float
); NumPy may want to consider changing the hash of floating-point NaNs, both to address the potential performance issue, and to remain compatible with Python.
Reproducing code example:
The following code takes several hours to execute on my machine:
>>> import collections, numpy as np
>>> x = np.full(1000000, np.nan)
>>> c = collections.Counter(x)
The cause is that (a) the float64
NaN objects realised from the array all have distinct id
, and so are considered unequal for the purposes of dict and set membership, together with (b) all of those NaNs have the same hash value of 0
. Then we get a perfect storm of hash collisions while building the dictionary underlying the Counter
object.
NumPy/Python version information:
>>> import sys, numpy; print(numpy.__version__, sys.version)
1.20.2 3.9.4 (default, Apr 6 2021, 11:23:37)
[Clang 11.0.3 (clang-1103.0.32.62)]