pd.core.algorithms.isin() doesn't handle nan correctly if it is a Python-object #22119

realead · 2018-07-29T21:06:54Z

Example

import numpy as np
import pandas.core.algorithms as algos
algos.isin([np.nan], [np.nan])

Problem description

results in

array([False], dtype=bool)

Expected Output

However, I would expect the result to be

array([True], dtype=bool)

which is the case for example for

algos.isin(np.array([np.nan]), [np.nan])

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.2.1
pip: 10.0.1
setuptools: 36.5.0.post20170921
Cython: 0.28.3
numpy: 1.13.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

realead · 2018-07-29T21:13:11Z

I didn't debug it yet, but the issue is probably similar to #21866 (with the difference, that there is no special handling in isin() in the case of Python-objects as there is in the case of float64).

The right fix would be probably to fix the behavior of the hash-table and not to try to implement workarounds.

TomAugspurger · 2018-07-30T11:14:09Z

Does this result in a higher-level bug for you? This is the same behavior as Python

In [1]: float('nan') in {float('nan')}
Out[1]: False

realead · 2018-07-30T11:43:28Z

@TomAugspurger

I have used isin directly. I guess there are some workarounds in place for Series, which catch this corner case, when Series.isin(...) is used - at least I was not able to reproduce this error without much trying.

However, to be precise, my example corresponds to:

>>> import numpy as np
>>> np.nan in set([np.nan])
True

Or using float('nan')

>>> my_nan=float("nan")
>>> my_nan in {my_nan}
True

Probably, this is the case because == uses is at first and only than the equal operator of floats - in your example two different nan-objects are created.

Not very stable approach in anycase though...

Also, I would prefer to be consistent with the sane pandas float64-behavior than with to some degree irratic python-set behavior.

gfyoung · 2018-07-30T13:59:40Z

The fact that we're getting inconsistent results depending on the np.nan "wrapper" (ndarray vs list) looks weird in itself.

cc @jreback : do we have precedent for treating np.nan as equal to itself?

TomAugspurger · 2018-07-30T14:04:40Z

NaN is never equal to itself. The fact that `dictionary[my_nan]` gives a different result to `dictionary[float('nan')]` is because Python's hashtable first checks for object identity. If this doesn't turn up anything, then it falls back to equality. `np.nan` happens to be a singleton, but IIRC, this isn't part of the documented API. So I guess my recommendation would be to not try to look up NaNs :)

…

On Mon, Jul 30, 2018 at 8:59 AM gfyoung ***@***.***> wrote: The fact that we're getting inconsistent results depending on the np.nan "wrapper" (ndarray vs list) looks weird in itself. cc @jreback <https://github.com/jreback> : do we have precedent for treating np.nan as equal to itself? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#22119 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIguBHXPNv7lVWgDoUL4U9BkNLNf2ks5uLxFTgaJpZM4VlkUO> .

realead · 2018-07-30T14:10:17Z

@gfyoung pd.unique() treats all nans as being the same:

>>> import pandas as pd
>>> pd.unique([np.nan, -np.nan, float('nan'), -float('nan')])
array([nan])

realead · 2018-07-30T20:37:01Z

In order to have a consistent behavior, a hash-map/hash-set requires (among other things) that the relation == is an equivalence relation. That means in particular that x==x holds for every x.

However, for floats with ieee-754-standard == isn't an equivalence relation, because nan!=nan, thus in order to get meaningful behavior with a hash-map/set either it must be ensured, that nans are not used or the relation == must be redefined.

There are multiple ways to extend ieee-754-==-relation to an equivalence relation:

compare nans bitwise (every nan becomes it own equivalence-class)
all nans are "the same", i.e. from the same equivalence-class
anything between 1. and 2.

Pandas opted for the second. Thus the behavior of np.unique():

>>> pd.unique([np.nan, -np.nan, float('nan'), -float('nan')])
 array([nan])

the isin():

>>> algos.isin(np.array([np.nan]), [-np.nan]))
array([ True], dtype=bool)

So having a different behavior for nans as Python-objects is at least surprising.

realead · 2018-08-01T16:01:52Z

There are actually two different (even if somewhat related) issues:

algos.isin(np.array([np.nan]), [np.nan]) with the same object np.nan.

There were discussions some time ago in the Python community https://bugs.python.org/issue4296#msg75735 and it has been settled that a in {a} should be true even for nans.

The way the hash-table is implemented: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/klib/khash_python.h#L45 - it uses PyObject_RichCompareBool which does short-circuit for the same object.

So I assume there is a bug somewhere, which results in different nan-objects arriving in the hash-table, see #22160.

algos.isin([float('nan')], [float('nan')]) with different nan-objects

The Python-way is to say, that nans are not important enough to have a special treatment for them, for which all other objects/classes would pay in performance.

I'm not sure pandas can say nans are not important enough (it is probably the most frequently used value:)), so adding special handling for nans could be worth it, in order to be consistent with the behavior of

algos.isin(np.array([float('nan')]), np.array([float('nan')]))

but this is obviously not my decision to make.

realead · 2018-08-03T11:29:03Z

Actually, #22148 is at its core the 2. part of this issue: Do we consider

pd.Series([float('nan')], dtype='object').isin([np.nan])

ie. for different nan-objects to be True or False (right now it is False)?

realead · 2018-08-05T21:25:09Z

My proposal (i.e. starting point for a discussion) is in PR #22207 : If both objects are floats, then check whether they are both nans - the same way it is done for float64. The used hash-function has the necessary behavior, so there is no need to change something.

It doesn't prevent the user to define custom classes with behaviors similar to nan and shoot themselves in the foot - it is probably too much ask for handling such cases as well.

As to performance, I could not see any worsening (added some additional performance tests myself), the results are:

asv continuous -f 1.01 upstream/master HEAD -b ^series_methods -b ^algorithms
....                                                           
       before           after         ratio
     [d30c4a06]       [e583a0c9]
-      2.14±0.04s         672±10μs     0.00  series_methods.IsInForObjects.time_isin_nans

Btw, the )new) test above shows, how easy it is to trigger the wost case behavior of O(n^2) for the current implementation: just hash different nan-objects. They all have hash-value = 0 and all are not equal to each other, thus O(n) is needed to add an element into the hashtable with n keys.

My take aways from it:

the performance penalty isn't severe (if one calculates a hash, one also can stomach to check whether a value is nan)
The current implementation has a performance problem for the cases of different nan-objects.
It will not be easy to achieve consistency for the whole project, the different way numpy treats the "sameness" of nans is not easy to overcome. However, the consistency in handling float64 and float-object for this very method is feasible.

This would also fix #22148.

realead · 2018-08-10T12:46:23Z

Probably worth mentioning: pd.NaT is such an object with pd.NaT == pd.NaT being False.

However, pd.NaT is a singleton and thus the shorcut-evaluation pd.NaT is pd.NaT in the hash-map is enough to save us from troubles.

IIRC the user could create another pd.NaT, but then well, if one messes with private stuff...

Enforcing singleton would be probably a cleaner solution than trying to fix this esoteric possibility in the hash-map.

gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jul 30, 2018

WillAyd mentioned this issue Aug 1, 2018

Reading nans in object column #22148

Closed

realead mentioned this issue Aug 1, 2018

unexpected behavior of pd.core.algorithms._ensure_data() #22160

Closed

realead mentioned this issue Aug 5, 2018

BUG: treat nan-objects the same way float64-nans are treated - all na… #22207

Merged

5 tasks

jreback added this to the 0.24.0 milestone Aug 8, 2018

jreback closed this as completed in #22207 Aug 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pd.core.algorithms.isin() doesn't handle nan correctly if it is a Python-object #22119

pd.core.algorithms.isin() doesn't handle nan correctly if it is a Python-object #22119

realead commented Jul 29, 2018

INSTALLED VERSIONS

realead commented Jul 29, 2018

TomAugspurger commented Jul 30, 2018

realead commented Jul 30, 2018 •

edited

gfyoung commented Jul 30, 2018

TomAugspurger commented Jul 30, 2018 via email

realead commented Jul 30, 2018

realead commented Jul 30, 2018

realead commented Aug 1, 2018 •

edited

realead commented Aug 3, 2018

realead commented Aug 5, 2018 •

edited

realead commented Aug 10, 2018

pd.core.algorithms.isin() doesn't handle nan correctly if it is a Python-object #22119

pd.core.algorithms.isin() doesn't handle nan correctly if it is a Python-object #22119

Comments

realead commented Jul 29, 2018

Example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

realead commented Jul 29, 2018

TomAugspurger commented Jul 30, 2018

realead commented Jul 30, 2018 • edited

gfyoung commented Jul 30, 2018

TomAugspurger commented Jul 30, 2018 via email

realead commented Jul 30, 2018

realead commented Jul 30, 2018

realead commented Aug 1, 2018 • edited

realead commented Aug 3, 2018

realead commented Aug 5, 2018 • edited

realead commented Aug 10, 2018

Output of `pd.show_versions()`

realead commented Jul 30, 2018 •

edited

realead commented Aug 1, 2018 •

edited

realead commented Aug 5, 2018 •

edited