np.nan appears multiple times in sets #9358

Naereen · 2017-07-04T11:46:39Z

Hi,
Here is a weird example, on Ubuntu 17.04 with Python 3.5.3 and Numpy 1.13.0:

In [0]: import numpy as np
In [1]: set([np.nan] * 5)
Out[1]: {nan}                        # OK
In [2]: set(np.array([np.nan] * 5))    
Out[2]: {nan, nan, nan, nan, nan}    # <--- ???
In [3]: set([10] * 5)
Out[3]: {10}
In [4]: set(np.array([10] * 5))    
Out[4]: {10}                         # only for np.nan ??

I don't understand how {nan,nan,nan,nan,nan} is even possible. The array was constructed with np.nan replicated 5 times (and the same works with np.full(5, np.nan)).

I searched on StackOverflow and on the documentation, but cannot find any explanation for this weird behavior.
Thanks in advance if I'm just misunderstanding something!
But in case it's a bug, I can try to help (if a maintainer advise me on the good direction).

The text was updated successfully, but these errors were encountered:

pv · 2017-07-04T11:48:19Z

nan != nan as per IEEE

Naereen · 2017-07-04T11:48:59Z

OK, but why set([np.nan] * 5) = {nan} or set([np.nan, np.nan]) = {nan} then?

eric-wieser · 2017-07-04T11:50:18Z

Because:

>>> x = [float('nan')] * 5   # np.nan is just defined as `float('nan')`
>>> x[0] is x[1]
True
>>> set(x)
nan

>>> y = [float('nan') for i in range(5)]  # this is essentially what np.array ends up doing
>>> y[0] is y[1]
False
>>> set(y)
{nan, nan, nan, nan, nan}

Also, the above clearly demonstrates this is not under control of numpy, but is python itself

Naereen · 2017-07-04T11:54:54Z

OK, sorry for reporting this here.
I still don't get the difference of behaviors, but alright.

Sorry for the loss of time.

eric-wieser · 2017-07-04T11:56:02Z

In principle, this could be fixed in cpython by float('nan') and float(float('nan')) both returning nan by identity. In practice though, this would be incorrect - there are 16,777,214 different nans, and this would make them all the same.

seberg · 2017-07-04T12:25:41Z

I doubt you can do that, in principle I think you could also return an error when you try to hash a NaN, but I am not sure that is any better as well.... Anyway it is an old issue, which I believe the python guys have discussed probably more then once. But any serious new discussion would have to be on python-ideas probably.

@Naereen the difference is that python optimizes the equality check by first doing an is check in many cases, so the same object will for the purpose of dicts (and some other cases such as comparison of tuples/lists) always be considered equal to itself, which is wrong for NaN.

Naereen · 2017-07-04T12:28:37Z

@seberg OK I get it.

Thanks for your (incredibly) quick response guys!

scott-lydon · 2020-04-19T19:31:26Z

We can add this to the list of why I 'low-key' hate python sometimes.

Here is my current work around.

import math

if math.isnan(row[3]): 
    mySet.add('nan')
else: 
    mySet.add(row[3])

seberg · 2020-04-19T20:05:38Z

You can even use np.nan itself, that way you are sure to always add the same NaN and things happen to work by definition (or at least by definition for all practical purposes).

charris · 2020-04-19T20:05:48Z

Might be worth reporting this upstream to Python.

NVM, this seems to be a case of np.nan not being the same as python nan.

In [2]: set((np.nan, np.nan))                                                   
Out[2]: {nan}

seberg · 2020-04-19T21:12:14Z

@charris I think this is one of the things python-ideas probably discusses every 2 years, and nobody really cares enough, or just accepts it as "well you work with NaN expect strangeness". I expect there is a python bug open somewhere.

Unless we want to go probably as far as digging up some discussions and writing a PEP with a solution (whatever that is), I doubt it can go anywhere. And there is probably no good solution, since disabling hash(np.nan) will break legitimate but strange uses, and otherwise you would require a notion of "equivalent within sets but not equal". So to be honest, I doubt this can go anywhere without serious effort, and even then it is likely there simply is no solution.

eric-wieser closed this as completed Jul 4, 2017

eric-wieser mentioned this issue Jun 8, 2018

Testing.assert_equals bug with nan and sets #11272

Closed

caspervdw mentioned this issue Mar 14, 2020

ENH: Implement hash, eq, neq pygeos/pygeos#102

Merged

alexhlim mentioned this issue Jul 31, 2020

BUG: Index.get_indexer_non_unique misbehaves with multiple nan pandas-dev/pandas#35498

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

np.nan appears multiple times in sets #9358

np.nan appears multiple times in sets #9358

Naereen commented Jul 4, 2017 •

edited

pv commented Jul 4, 2017 via email

Naereen commented Jul 4, 2017 •

edited

eric-wieser commented Jul 4, 2017 •

edited

Naereen commented Jul 4, 2017

eric-wieser commented Jul 4, 2017 •

edited

seberg commented Jul 4, 2017

Naereen commented Jul 4, 2017

scott-lydon commented Apr 19, 2020 •

edited

seberg commented Apr 19, 2020

charris commented Apr 19, 2020 •

edited

seberg commented Apr 19, 2020

np.nan appears multiple times in sets #9358

np.nan appears multiple times in sets #9358

Comments

Naereen commented Jul 4, 2017 • edited

pv commented Jul 4, 2017 via email

Naereen commented Jul 4, 2017 • edited

eric-wieser commented Jul 4, 2017 • edited

Naereen commented Jul 4, 2017

eric-wieser commented Jul 4, 2017 • edited

seberg commented Jul 4, 2017

Naereen commented Jul 4, 2017

scott-lydon commented Apr 19, 2020 • edited

seberg commented Apr 19, 2020

charris commented Apr 19, 2020 • edited

seberg commented Apr 19, 2020

Naereen commented Jul 4, 2017 •

edited

Naereen commented Jul 4, 2017 •

edited

eric-wieser commented Jul 4, 2017 •

edited

eric-wieser commented Jul 4, 2017 •

edited

scott-lydon commented Apr 19, 2020 •

edited

charris commented Apr 19, 2020 •

edited