[WIP]: Indexing with BooleanArray propagates NA #30265

TomAugspurger · 2019-12-13T20:46:44Z

This implements indexing a Series with a BooleanArray for some
dtypes. NA values in the mask propagate. It's not clear that we want to propagate NAs.

This required updating each EA's __getitem__ to correctly propagate
NA. Doing the same for ndarray-backed Series is possible, but will require
some additional hacking and we first need to determine the result dtype.

No need to review in detail. The primary motivation for pushing this up
is to better understand what indexing with NA values will look like in
practice.

xref #28778, and #28778 (comment) in particular.

(copied from #28778 (comment))

On indexing, propagating NAs presents some challenging dtype casting issues. For example, what is the dtype on the output of

In [3]: s = pd.Series([1, 2, 3])

In [4]: mask = pd.array([True, False, None])

In [5]: s[mask]

Would that be an Int64 dtype, with the last value being NA? Would we cast to float64 and use np.nan? object-dtype?

And what if the original series was float dtype? A float-dtype with NaN is about the only option we have right now, since we lack a float64 array that can hold NA.

I don't think that an indexing operation that introduces missing values should change the dtype of the array. I don't think anyone would realistically want that. So... do we just raise for now?

What about cases when we are able to index without changing the dtype? Those would be

Indexing with a BooleanArray with no missing values.
Indexing a dtype that supports missing values (datetime, timedelta, string, object, Int64, boolean...). Basically anything but NumPy's bool, int float.

IMO, which shouldn't have value-dependent behavior, so if

>>> pd.Series([1, 2])[pd.array([True, None])

raises, then so should pd.Series([1, 2])[pd.array([True, False]) (no missing values).

I think supporting 2 is fine, since it just depends on the dtypes.

>>> pd.Series([1, 2], dtype="Int64")[pd.array([True, None])]
Series([1, NA], dtype="Int64")
>>> pd.Series([1, 2], dtype="Int64")[pd.array([True, False])]
Series([1], dtype="Int64")

This implements indexing a Series with a BooleanArray for some dtypes. NA values in the mask propagate. This required updating each EA's `__getitem__` to correctly propagate NA. This isn't an option for ndarray-backed Series, which don't understand BooleanArray. No need to review in detail. The primary motivation for pushing this up is to better understand what indexing with NA values will look like in practice.

TomAugspurger · 2019-12-16T15:59:22Z

The primary motivation for pushing this up
is to better understand what indexing with NA values will look like in
practice.

Any thoughts on this point @jorisvandenbossche, @jreback, et. al? In #28778 (comment) Joris said that

If not raising an error, skipping (interpret NA as False) feels more intuitive to me than propagating NAs (and thus introducing NAs in the filtered array). That's also what SQL and tidyverse do (base R propagates, see above)

I think that the additional dtype for Series([1, 2])[True, False]) being int, but Series([1, 2])[True, NA]) being unknown makes the case for raising even stronger.

So my proposal is to

Have indexing with NA values raises with a nice error message.
Implement BooleanArray.fillna() to make the common case of indexing with a BooleanArray easier.

At some point, we can discuss having NA values propagate or drop, but lets get some user feedback first.

TomAugspurger · 2019-12-16T21:40:46Z

Closing this, since I don't think we'll want to implement NA propagation.

Dr-Irv · 2019-12-16T22:18:58Z

So my proposal is to

Have indexing with NA values raises with a nice error message.

Implement BooleanArray.fillna() to make the common case of indexing with a BooleanArray easier.

I think this is the right thing to do. IMHO, if there is a pd.NA value in a BooleanArray, then we raise to force the user to think about the issue and how they want missing values treated.

Maybe a related issue - what if someone puts pd.NA in an index and then uses .loc, passing pd.NA in the parameters to .loc?

jorisvandenbossche · 2019-12-18T14:26:54Z

I think that the additional dtype for Series([1, 2])[True, False]) being int, but Series([1, 2])[True, NA]) being unknown makes the case for raising even stronger.

Note that is not necessarily an argument for raising. Skipping also preserves the dtype.

Maybe a related issue - what if someone puts pd.NA in an index and then uses .loc, passing pd.NA in the parameters to .loc?

For now, the EAs are not yet enabled in Index (#22861), but this is a good question for when looking into that (and a chance to re-evaluate how it now happens with NaN!).

TomAugspurger mentioned this pull request Dec 13, 2019

DISCUSS: boolean dtype with missing value support #28778

Closed

TomAugspurger changed the title ~~[WIP]: Indexing with BooleanArray~~ [WIP]: Indexing with BooleanArray propagates NA Dec 16, 2019

TomAugspurger closed this Dec 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]: Indexing with BooleanArray propagates NA #30265

[WIP]: Indexing with BooleanArray propagates NA #30265

TomAugspurger commented Dec 13, 2019 •

edited

Loading

TomAugspurger commented Dec 16, 2019

TomAugspurger commented Dec 16, 2019

Dr-Irv commented Dec 16, 2019

jorisvandenbossche commented Dec 18, 2019

[WIP]: Indexing with BooleanArray propagates NA #30265

[WIP]: Indexing with BooleanArray propagates NA #30265

Conversation

TomAugspurger commented Dec 13, 2019 • edited Loading

TomAugspurger commented Dec 16, 2019

TomAugspurger commented Dec 16, 2019

Dr-Irv commented Dec 16, 2019

jorisvandenbossche commented Dec 18, 2019

TomAugspurger commented Dec 13, 2019 •

edited

Loading