pd.core.algorithms.isin() doesn't handle nan correctly if it is a Python-object #22119
import numpy as np import pandas.core.algorithms as algos algos.isin([np.nan], [np.nan])
However, I would expect the result to be
which is the case for example for
I didn't debug it yet, but the issue is probably similar to #21866 (with the difference, that there is no special handling in
The right fix would be probably to fix the behavior of the hash-table and not to try to implement workarounds.
I have used
However, to be precise, my example corresponds to:
Or using float('nan')
Probably, this is the case because
Not very stable approach in anycase though...
Also, I would prefer to be consistent with the sane pandas float64-behavior than with to some degree irratic python-set behavior.
NaN is never equal to itself. The fact that `dictionary[my_nan]` gives a different result to `dictionary[float('nan')]` is because Python's hashtable first checks for object identity. If this doesn't turn up anything, then it falls back to equality. `np.nan` happens to be a singleton, but IIRC, this isn't part of the documented API. So I guess my recommendation would be to not try to look up NaNs :)…
On Mon, Jul 30, 2018 at 8:59 AM gfyoung ***@***.***> wrote: The fact that we're getting inconsistent results depending on the np.nan "wrapper" (ndarray vs list) looks weird in itself. cc @jreback <https://github.com/jreback> : do we have precedent for treating np.nan as equal to itself? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22119 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIguBHXPNv7lVWgDoUL4U9BkNLNf2ks5uLxFTgaJpZM4VlkUO> .
In order to have a consistent behavior, a hash-map/hash-set requires (among other things) that the relation
However, for floats with ieee-754-standard
There are multiple ways to extend ieee-754-
Pandas opted for the second. Thus the behavior of
So having a different behavior for nans as Python-objects is at least surprising.
referenced this issue
Aug 1, 2018
There are actually two different (even if somewhat related) issues:
So I assume there is a bug somewhere, which results in different nan-objects arriving in the hash-table, see #22160.
The Python-way is to say, that nans are not important enough to have a special treatment for them, for which all other objects/classes would pay in performance.
I'm not sure pandas can say nans are not important enough (it is probably the most frequently used value:)), so adding special handling for nans could be worth it, in order to be consistent with the behavior of
but this is obviously not my decision to make.
referenced this issue
Aug 5, 2018
My proposal (i.e. starting point for a discussion) is in PR #22207 : If both objects are floats, then check whether they are both nans - the same way it is done for float64. The used hash-function has the necessary behavior, so there is no need to change something.
It doesn't prevent the user to define custom classes with behaviors similar to nan and shoot themselves in the foot - it is probably too much ask for handling such cases as well.
As to performance, I could not see any worsening (added some additional performance tests myself), the results are:
Btw, the )new) test above shows, how easy it is to trigger the wost case behavior of
My take aways from it:
This would also fix #22148.
Probably worth mentioning:
IIRC the user could create another
Enforcing singleton would be probably a cleaner solution than trying to fix this esoteric possibility in the hash-map.