Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

np.nan appears multiple times in sets #9358

Closed
Naereen opened this issue Jul 4, 2017 · 11 comments
Closed

np.nan appears multiple times in sets #9358

Naereen opened this issue Jul 4, 2017 · 11 comments

Comments

@Naereen
Copy link
Contributor

Naereen commented Jul 4, 2017

Hi,
Here is a weird example, on Ubuntu 17.04 with Python 3.5.3 and Numpy 1.13.0:

In [0]: import numpy as np
In [1]: set([np.nan] * 5)
Out[1]: {nan}                        # OK
In [2]: set(np.array([np.nan] * 5))    
Out[2]: {nan, nan, nan, nan, nan}    # <--- ???
In [3]: set([10] * 5)
Out[3]: {10}
In [4]: set(np.array([10] * 5))    
Out[4]: {10}                         # only for np.nan ??

I don't understand how {nan,nan,nan,nan,nan} is even possible. The array was constructed with np.nan replicated 5 times (and the same works with np.full(5, np.nan)).

I searched on StackOverflow and on the documentation, but cannot find any explanation for this weird behavior.
Thanks in advance if I'm just misunderstanding something!
But in case it's a bug, I can try to help (if a maintainer advise me on the good direction).

@pv
Copy link
Member

pv commented Jul 4, 2017 via email

@Naereen
Copy link
Contributor Author

Naereen commented Jul 4, 2017

OK, but why set([np.nan] * 5) = {nan} or set([np.nan, np.nan]) = {nan} then?

@eric-wieser
Copy link
Member

eric-wieser commented Jul 4, 2017

Because:

>>> x = [float('nan')] * 5   # np.nan is just defined as `float('nan')`
>>> x[0] is x[1]
True
>>> set(x)
nan

>>> y = [float('nan') for i in range(5)]  # this is essentially what np.array ends up doing
>>> y[0] is y[1]
False
>>> set(y)
{nan, nan, nan, nan, nan}

Also, the above clearly demonstrates this is not under control of numpy, but is python itself

@Naereen
Copy link
Contributor Author

Naereen commented Jul 4, 2017

OK, sorry for reporting this here.
I still don't get the difference of behaviors, but alright.

Sorry for the loss of time.

@eric-wieser
Copy link
Member

eric-wieser commented Jul 4, 2017

In principle, this could be fixed in cpython by float('nan') and float(float('nan')) both returning nan by identity. In practice though, this would be incorrect - there are 16,777,214 different nans, and this would make them all the same.

@seberg
Copy link
Member

seberg commented Jul 4, 2017

I doubt you can do that, in principle I think you could also return an error when you try to hash a NaN, but I am not sure that is any better as well.... Anyway it is an old issue, which I believe the python guys have discussed probably more then once. But any serious new discussion would have to be on python-ideas probably.

@Naereen the difference is that python optimizes the equality check by first doing an is check in many cases, so the same object will for the purpose of dicts (and some other cases such as comparison of tuples/lists) always be considered equal to itself, which is wrong for NaN.

@Naereen
Copy link
Contributor Author

Naereen commented Jul 4, 2017

@seberg OK I get it.

Thanks for your (incredibly) quick response guys!

@scott-lydon
Copy link

scott-lydon commented Apr 19, 2020

We can add this to the list of why I 'low-key' hate python sometimes.

Here is my current work around.

import math

if math.isnan(row[3]): 
    mySet.add('nan')
else: 
    mySet.add(row[3])

@seberg
Copy link
Member

seberg commented Apr 19, 2020

You can even use np.nan itself, that way you are sure to always add the same NaN and things happen to work by definition (or at least by definition for all practical purposes).

@charris
Copy link
Member

charris commented Apr 19, 2020

Might be worth reporting this upstream to Python.

NVM, this seems to be a case of np.nan not being the same as python nan.

In [2]: set((np.nan, np.nan))                                                   
Out[2]: {nan}

@seberg
Copy link
Member

seberg commented Apr 19, 2020

@charris I think this is one of the things python-ideas probably discusses every 2 years, and nobody really cares enough, or just accepts it as "well you work with NaN expect strangeness". I expect there is a python bug open somewhere.

Unless we want to go probably as far as digging up some discussions and writing a PEP with a solution (whatever that is), I doubt it can go anywhere. And there is probably no good solution, since disabling hash(np.nan) will break legitimate but strange uses, and otherwise you would require a notion of "equivalent within sets but not equal". So to be honest, I doubt this can go anywhere without serious effort, and even then it is likely there simply is no solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants