BUG: pd.NA is not compatible with searchsorted #30944

jschendel · 2020-01-12T18:12:25Z

Code Sample, a copy-pastable example if possible

On master trying to use pd.NA as an input to searchsorted fails, and trying to use the searchsorted of an array containing pd.NA also fails:

In [1]: import numpy as np; import pandas as pd; pd.__version__
Out[1]: '1.0.0rc0+15.g4e2546d89'

In [2]: s = pd.Series([-1, 1, 3, 5])

In [3]: arr_pd_na = pd.array([0, 1, 2, pd.NA])

In [4]: s.searchsorted(arr_pd_na)
---------------------------------------------------------------------------
TypeError: boolean value of NA is ambiguous

In [5]: s.searchsorted(pd.NA)
---------------------------------------------------------------------------
TypeError: boolean value of NA is ambiguous

In [6]: arr_pd_na.searchsorted(10)
---------------------------------------------------------------------------
TypeError: boolean value of NA is ambiguous

Note that the np.nan equivalent works fine:

In [7]: arr_np_nan = np.array([0, 1, 2, np.nan])

In [8]: s.searchsorted(arr_np_nan)
Out[8]: array([1, 1, 2, 4])

In [9]: s.searchsorted(np.nan)
Out[9]: 4

In [10]: arr_np_nan.searchsorted(10)
Out[10]: 3

This has downstream effects on anything that relies on searchsorted, e.g. pd.cut, which has the same failing behavior as above for pd.NA but succeeds for np.nan:

In [11]: pd.cut(arr_pd_na, bins=3)
---------------------------------------------------------------------------
TypeError: boolean value of NA is ambiguous

In [12]: pd.cut(arr_np_nan, bins=3)
Out[12]: 
[(-0.002, 0.667], (0.667, 1.333], (1.333, 2.0], NaN]
Categories (3, interval[float64]): [(-0.002, 0.667] < (0.667, 1.333] < (1.333, 2.0]]

Problem description

pd.NA is not compatible with searchsorted.

Expected Output

I'd expect the output for the pd.NA operations above to match the output of the equivalent np.nan operations.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 4e2546d
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.14-041914-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.0rc0+15.g4e2546d89
numpy : 1.17.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.6.0.post20191030
Cython : 0.29.13
pytest : 5.2.0
hypothesis : 4.36.2
sphinx : 1.8.5
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.15.0
pytables : None
pytest : 5.2.0
s3fs : 0.3.4
scipy : 1.3.1
sqlalchemy : 1.3.8
tables : 3.5.1
tabulate : None
xarray : 0.13.0
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1
numba : 0.46.0

The text was updated successfully, but these errors were encountered:

jschendel · 2020-01-12T18:14:48Z

Marked the milestone as 1.0.0 because it'd be nice to fix this before the release but not sure if this should actually be a blocker for the release.

jorisvandenbossche · 2020-01-12T19:28:09Z

It would be indeed be nice to at least solve things like pd.cut for 1.0, as this was working for Int64 dtype before.
One option for a "quick" fix might be to convert the integer array to a float array at the beginning of the cut (and related) method. That should give the same result as before I think.

Longer term: I don't think it is easy to fix the searchsorted directly, as here it is a numpy call, where the passed integer array gets converted to an object numpy array (at least if we don't want to change the coercing behaviour of IntegerArray and the comparison and boolean behaviour of pd.NA).
We probably need to make a "mask-aware" version of our algorithms like cut. Because in principle, pd.cut simply propagates NAs in the input to the output, so they don't need to be passed through the full binning (for which searchsorted is used).

souvik3333 · 2020-01-14T07:05:38Z

Hi, can I work on this?

souvik3333 · 2020-01-16T11:35:31Z

take

souvik3333 · 2020-01-18T16:07:03Z

@jschendel Is this issue still occurring? I tried

import numpy as np
import pandas as pd

print(pd.__version__)
arr_np_nan = np.array([0, 1, 2, np.nan])
a=pd.cut(arr_np_nan, bins=3)
print(a)

and the result is

0.26.0.dev0+1351.g70a083f04
[(-0.002, 0.667], (0.667, 1.333], (1.333, 2.0], NaN]
Categories (3, interval[float64]): [(-0.002, 0.667] < (0.667, 1.333] < (1.333, 2.0]]

Seems like only s.searchsorted(pd.NA) is giving output as

TypeError: boolean value of NA is ambiguous

Should I follow what @jorisvandenbossche said and update integer array to float array in searchsorted related methods?

jschendel · 2020-01-21T05:23:05Z

Yes, this is specifically an issue with pd.NA. There is no issue with np.nan.

I'm a little hesitant to coerce integer array to float array due to the likely performance hits but could maybe be fine for a short-term fix.

The searchsorted call here is to numpy but we have our own internal algos.searchsorted that we could make mask-aware, and then just ensure that all of our internal searchsorted calls go through algos.searchsorted and not directly to numpy. This would require some care to do in a way that minimizes any performance hits though. The advantage here is that it seems like this would allow us to get by without needing to rewrite algos like cut since the machinery used in them would mask-aware.

TomAugspurger · 2020-01-22T14:18:40Z

What needs to be done here for 1.0.0? Just fix the regression in pd.cut(pd.array([1, 2, None]), 2)?

jbrockmendel · 2020-01-22T17:02:46Z

possibly related: i tried adding name=pd.NA in tm.makeDateIndex and it broke the world

xref pandas-dev#30944. I think this doesn't close it, since only the pd.cut compoment is fixed.

TomAugspurger · 2020-01-27T15:10:25Z

I'm going to move this off 1.0.0, I think that .searchsorted(NA) not working will be a known limitation.

The fix for cut(IntegerArray) is targeted for 1.0.0.

jschendel added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jan 12, 2020

jschendel added this to the 1.0.0 milestone Jan 12, 2020

jschendel mentioned this issue Jan 12, 2020

Unexpected behavior in cut() with nullable Int64 dtype #30787

Closed

github-actions bot assigned souvik3333 Jan 16, 2020

TomAugspurger mentioned this issue Jan 22, 2020

RLS: 1.0.0 #27492

Closed

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 24, 2020

BUG: Handle IntegerArray in pd.cut

e6ec3b2

xref pandas-dev#30944. I think this doesn't close it, since only the pd.cut compoment is fixed.

TomAugspurger mentioned this issue Jan 24, 2020

BUG: Handle IntegerArray in pd.cut #31290

Merged

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 27, 2020

BUG: Handle IntegerArray in pd.cut

8654a56

xref pandas-dev#30944. I think this doesn't close it, since only the pd.cut compoment is fixed.

TomAugspurger modified the milestones: 1.0.0, Contributions Welcome Jan 27, 2020

jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 27, 2020

jorisvandenbossche mentioned this issue Jan 28, 2020

Handle ExtensionArrays in cut #31389

Open

jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Jan 30, 2020

jbrockmendel mentioned this issue Dec 18, 2021

ROADMAP: Consistent missing value handling with new NA scalar #28095

Open

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.NA is not compatible with searchsorted #30944

BUG: pd.NA is not compatible with searchsorted #30944

jschendel commented Jan 12, 2020

INSTALLED VERSIONS

jschendel commented Jan 12, 2020

jorisvandenbossche commented Jan 12, 2020

souvik3333 commented Jan 14, 2020

souvik3333 commented Jan 16, 2020

souvik3333 commented Jan 18, 2020 •

edited

Loading

jschendel commented Jan 21, 2020

TomAugspurger commented Jan 22, 2020

jbrockmendel commented Jan 22, 2020

TomAugspurger commented Jan 27, 2020

BUG: pd.NA is not compatible with searchsorted #30944

BUG: pd.NA is not compatible with searchsorted #30944

Comments

jschendel commented Jan 12, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jschendel commented Jan 12, 2020

jorisvandenbossche commented Jan 12, 2020

souvik3333 commented Jan 14, 2020

souvik3333 commented Jan 16, 2020

souvik3333 commented Jan 18, 2020 • edited Loading

jschendel commented Jan 21, 2020

TomAugspurger commented Jan 22, 2020

jbrockmendel commented Jan 22, 2020

TomAugspurger commented Jan 27, 2020

Output of `pd.show_versions()`

souvik3333 commented Jan 18, 2020 •

edited

Loading