Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.NA is not compatible with searchsorted #30944

Open
jschendel opened this issue Jan 12, 2020 · 9 comments
Open

BUG: pd.NA is not compatible with searchsorted #30944

jschendel opened this issue Jan 12, 2020 · 9 comments
Assignees
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@jschendel
Copy link
Member

Code Sample, a copy-pastable example if possible

On master trying to use pd.NA as an input to searchsorted fails, and trying to use the searchsorted of an array containing pd.NA also fails:

In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '1.0.0rc0+15.g4e2546d89'

In [2]: s = pd.Series([-1, 1, 3, 5])

In [3]: arr_pd_na = pd.array([0, 1, 2, pd.NA])

In [4]: s.searchsorted(arr_pd_na)
---------------------------------------------------------------------------
TypeError: boolean value of NA is ambiguous

In [5]: s.searchsorted(pd.NA)
---------------------------------------------------------------------------
TypeError: boolean value of NA is ambiguous

In [6]: arr_pd_na.searchsorted(10)
---------------------------------------------------------------------------
TypeError: boolean value of NA is ambiguous

Note that the np.nan equivalent works fine:

In [7]: arr_np_nan = np.array([0, 1, 2, np.nan])

In [8]: s.searchsorted(arr_np_nan)
Out[8]: array([1, 1, 2, 4])

In [9]: s.searchsorted(np.nan)
Out[9]: 4

In [10]: arr_np_nan.searchsorted(10)
Out[10]: 3

This has downstream effects on anything that relies on searchsorted, e.g. pd.cut, which has the same failing behavior as above for pd.NA but succeeds for np.nan:

In [11]: pd.cut(arr_pd_na, bins=3)
---------------------------------------------------------------------------
TypeError: boolean value of NA is ambiguous

In [12]: pd.cut(arr_np_nan, bins=3)
Out[12]: 
[(-0.002, 0.667], (0.667, 1.333], (1.333, 2.0], NaN]
Categories (3, interval[float64]): [(-0.002, 0.667] < (0.667, 1.333] < (1.333, 2.0]]

Problem description

pd.NA is not compatible with searchsorted.

Expected Output

I'd expect the output for the pd.NA operations above to match the output of the equivalent np.nan operations.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 4e2546d
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.14-041914-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.0rc0+15.g4e2546d89
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.6.0.post20191030
Cython : 0.29.13
pytest : 5.2.0
hypothesis : 4.36.2
sphinx : 1.8.5
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.0
pytables : None
pytest : 5.2.0
s3fs : 0.3.4
scipy : 1.3.1
sqlalchemy : 1.3.8
tables : 3.5.1
tabulate : None
xarray : 0.13.0
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
numba : 0.46.0

@jschendel jschendel added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 12, 2020
@jschendel jschendel added this to the 1.0.0 milestone Jan 12, 2020
@jschendel
Copy link
Member Author

Marked the milestone as 1.0.0 because it'd be nice to fix this before the release but not sure if this should actually be a blocker for the release.

@jorisvandenbossche
Copy link
Member

It would be indeed be nice to at least solve things like pd.cut for 1.0, as this was working for Int64 dtype before.
One option for a "quick" fix might be to convert the integer array to a float array at the beginning of the cut (and related) method. That should give the same result as before I think.

Longer term: I don't think it is easy to fix the searchsorted directly, as here it is a numpy call, where the passed integer array gets converted to an object numpy array (at least if we don't want to change the coercing behaviour of IntegerArray and the comparison and boolean behaviour of pd.NA).
We probably need to make a "mask-aware" version of our algorithms like cut. Because in principle, pd.cut simply propagates NAs in the input to the output, so they don't need to be passed through the full binning (for which searchsorted is used).

@souvik3333
Copy link
Contributor

Hi, can I work on this?

@souvik3333
Copy link
Contributor

take

@souvik3333
Copy link
Contributor

souvik3333 commented Jan 18, 2020

@jschendel Is this issue still occurring? I tried

import numpy as np
import pandas as pd

print(pd.__version__)
arr_np_nan = np.array([0, 1, 2, np.nan])
a=pd.cut(arr_np_nan, bins=3)
print(a)

and the result is

0.26.0.dev0+1351.g70a083f04
[(-0.002, 0.667], (0.667, 1.333], (1.333, 2.0], NaN]
Categories (3, interval[float64]): [(-0.002, 0.667] < (0.667, 1.333] < (1.333, 2.0]]

Seems like only s.searchsorted(pd.NA) is giving output as

TypeError: boolean value of NA is ambiguous

Should I follow what @jorisvandenbossche said and update integer array to float array in searchsorted related methods?

@jschendel
Copy link
Member Author

Yes, this is specifically an issue with pd.NA. There is no issue with np.nan.

I'm a little hesitant to coerce integer array to float array due to the likely performance hits but could maybe be fine for a short-term fix.

The searchsorted call here is to numpy but we have our own internal algos.searchsorted that we could make mask-aware, and then just ensure that all of our internal searchsorted calls go through algos.searchsorted and not directly to numpy. This would require some care to do in a way that minimizes any performance hits though. The advantage here is that it seems like this would allow us to get by without needing to rewrite algos like cut since the machinery used in them would mask-aware.

@TomAugspurger
Copy link
Contributor

What needs to be done here for 1.0.0? Just fix the regression in pd.cut(pd.array([1, 2, None]), 2)?

@jbrockmendel
Copy link
Member

possibly related: i tried adding name=pd.NA in tm.makeDateIndex and it broke the world

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 24, 2020
xref pandas-dev#30944.
I think this doesn't close it, since only the pd.cut compoment
is fixed.
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 27, 2020
xref pandas-dev#30944.
I think this doesn't close it, since only the pd.cut compoment
is fixed.
@TomAugspurger
Copy link
Contributor

I'm going to move this off 1.0.0, I think that .searchsorted(NA) not working will be a known limitation.

The fix for cut(IntegerArray) is targeted for 1.0.0.

@TomAugspurger TomAugspurger modified the milestones: 1.0.0, Contributions Welcome Jan 27, 2020
@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 27, 2020
@jorisvandenbossche jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jan 30, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

6 participants