Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.core.algorithms.isin() doesn't handle nan correctly if it is a Python-object #22119

Closed
realead opened this issue Jul 29, 2018 · 11 comments · Fixed by #22207
Closed

pd.core.algorithms.isin() doesn't handle nan correctly if it is a Python-object #22119

realead opened this issue Jul 29, 2018 · 11 comments · Fixed by #22207
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@realead
Copy link
Contributor

realead commented Jul 29, 2018

Example

import numpy as np
import pandas.core.algorithms as algos
algos.isin([np.nan], [np.nan])

Problem description

results in

array([False], dtype=bool)

Expected Output

However, I would expect the result to be

array([True], dtype=bool)

which is the case for example for

algos.isin(np.array([np.nan]), [np.nan])

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-53-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.2.1
pip: 10.0.1
setuptools: 36.5.0.post20170921
Cython: 0.28.3
numpy: 1.13.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@realead
Copy link
Contributor Author

realead commented Jul 29, 2018

I didn't debug it yet, but the issue is probably similar to #21866 (with the difference, that there is no special handling in isin() in the case of Python-objects as there is in the case of float64).

The right fix would be probably to fix the behavior of the hash-table and not to try to implement workarounds.

@TomAugspurger
Copy link
Contributor

Does this result in a higher-level bug for you? This is the same behavior as Python

In [1]: float('nan') in {float('nan')}
Out[1]: False

@realead
Copy link
Contributor Author

realead commented Jul 30, 2018

@TomAugspurger

I have used isin directly. I guess there are some workarounds in place for Series, which catch this corner case, when Series.isin(...) is used - at least I was not able to reproduce this error without much trying.

However, to be precise, my example corresponds to:

>>> import numpy as np
>>> np.nan in set([np.nan])
True

Or using float('nan')

>>> my_nan=float("nan")
>>> my_nan in {my_nan}
True

Probably, this is the case because == uses is at first and only than the equal operator of floats - in your example two different nan-objects are created.

Not very stable approach in anycase though...

Also, I would prefer to be consistent with the sane pandas float64-behavior than with to some degree irratic python-set behavior.

@gfyoung gfyoung added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jul 30, 2018
@gfyoung
Copy link
Member

gfyoung commented Jul 30, 2018

The fact that we're getting inconsistent results depending on the np.nan "wrapper" (ndarray vs list) looks weird in itself.

cc @jreback : do we have precedent for treating np.nan as equal to itself?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 30, 2018 via email

@realead
Copy link
Contributor Author

realead commented Jul 30, 2018

@gfyoung pd.unique() treats all nans as being the same:

>>> import pandas as pd
>>> pd.unique([np.nan, -np.nan, float('nan'), -float('nan')])
array([nan])

@realead
Copy link
Contributor Author

realead commented Jul 30, 2018

In order to have a consistent behavior, a hash-map/hash-set requires (among other things) that the relation == is an equivalence relation. That means in particular that x==x holds for every x.

However, for floats with ieee-754-standard == isn't an equivalence relation, because nan!=nan, thus in order to get meaningful behavior with a hash-map/set either it must be ensured, that nans are not used or the relation == must be redefined.

There are multiple ways to extend ieee-754-==-relation to an equivalence relation:

  1. compare nans bitwise (every nan becomes it own equivalence-class)
  2. all nans are "the same", i.e. from the same equivalence-class
  3. anything between 1. and 2.

Pandas opted for the second. Thus the behavior of np.unique():

>>> pd.unique([np.nan, -np.nan, float('nan'), -float('nan')])
 array([nan])

the isin():

>>> algos.isin(np.array([np.nan]), [-np.nan]))
array([ True], dtype=bool)

So having a different behavior for nans as Python-objects is at least surprising.

@realead
Copy link
Contributor Author

realead commented Aug 1, 2018

There are actually two different (even if somewhat related) issues:

  1. algos.isin(np.array([np.nan]), [np.nan]) with the same object np.nan.

    There were discussions some time ago in the Python community https://bugs.python.org/issue4296#msg75735 and it has been settled that a in {a} should be true even for nans.

    The way the hash-table is implemented: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/klib/khash_python.h#L45 - it uses PyObject_RichCompareBool which does short-circuit for the same object.

So I assume there is a bug somewhere, which results in different nan-objects arriving in the hash-table, see #22160.

  1. algos.isin([float('nan')], [float('nan')]) with different nan-objects

The Python-way is to say, that nans are not important enough to have a special treatment for them, for which all other objects/classes would pay in performance.

I'm not sure pandas can say nans are not important enough (it is probably the most frequently used value:)), so adding special handling for nans could be worth it, in order to be consistent with the behavior of

algos.isin(np.array([float('nan')]), np.array([float('nan')]))

but this is obviously not my decision to make.

@realead
Copy link
Contributor Author

realead commented Aug 3, 2018

Actually, #22148 is at its core the 2. part of this issue: Do we consider

pd.Series([float('nan')], dtype='object').isin([np.nan])

ie. for different nan-objects to be True or False (right now it is False)?

@realead
Copy link
Contributor Author

realead commented Aug 5, 2018

My proposal (i.e. starting point for a discussion) is in PR #22207 : If both objects are floats, then check whether they are both nans - the same way it is done for float64. The used hash-function has the necessary behavior, so there is no need to change something.

It doesn't prevent the user to define custom classes with behaviors similar to nan and shoot themselves in the foot - it is probably too much ask for handling such cases as well.

As to performance, I could not see any worsening (added some additional performance tests myself), the results are:

asv continuous -f 1.01 upstream/master HEAD -b ^series_methods -b ^algorithms
....                                                           
       before           after         ratio
     [d30c4a06]       [e583a0c9]
-      2.14±0.04s         672±10μs     0.00  series_methods.IsInForObjects.time_isin_nans

Btw, the )new) test above shows, how easy it is to trigger the wost case behavior of O(n^2) for the current implementation: just hash different nan-objects. They all have hash-value = 0 and all are not equal to each other, thus O(n) is needed to add an element into the hashtable with n keys.

My take aways from it:

  • the performance penalty isn't severe (if one calculates a hash, one also can stomach to check whether a value is nan)
  • The current implementation has a performance problem for the cases of different nan-objects.
  • It will not be easy to achieve consistency for the whole project, the different way numpy treats the "sameness" of nans is not easy to overcome. However, the consistency in handling float64 and float-object for this very method is feasible.

This would also fix #22148.

@jreback jreback added this to the 0.24.0 milestone Aug 8, 2018
@realead
Copy link
Contributor Author

realead commented Aug 10, 2018

Probably worth mentioning: pd.NaT is such an object with pd.NaT == pd.NaT being False.

However, pd.NaT is a singleton and thus the shorcut-evaluation pd.NaT is pd.NaT in the hash-map is enough to save us from troubles.

IIRC the user could create another pd.NaT, but then well, if one messes with private stuff...

Enforcing singleton would be probably a cleaner solution than trying to fix this esoteric possibility in the hash-map.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants