Skip to content

Possible excessive hash collisions from hash of nanΒ #18833

@mdickinson

Description

@mdickinson

A potential performance issue relating to hash collisions from distinct NaNs was recently reported and fixed in core Python. In brief, the fact that all NaNs hash to 0 makes it easy to accidentally invoke quadratic-time behaviour from NaNs going into a set or dictionary. The fix involved changing the hash of a float NaN to match the usual object hash.

The core issue affects NumPy, too (and in fact one of the examples given in the Python bug report to demonstrate the issue used np.float64 rather than float); NumPy may want to consider changing the hash of floating-point NaNs, both to address the potential performance issue, and to remain compatible with Python.

Reproducing code example:

The following code takes several hours to execute on my machine:

>>> import collections, numpy as np
>>> x = np.full(1000000, np.nan)
>>> c = collections.Counter(x)

The cause is that (a) the float64 NaN objects realised from the array all have distinct id, and so are considered unequal for the purposes of dict and set membership, together with (b) all of those NaNs have the same hash value of 0. Then we get a perfect storm of hash collisions while building the dictionary underlying the Counter object.

NumPy/Python version information:

>>> import sys, numpy; print(numpy.__version__, sys.version)
1.20.2 3.9.4 (default, Apr  6 2021, 11:23:37) 
[Clang 11.0.3 (clang-1103.0.32.62)]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions