DISCUSS: boolean dtype with missing value support #28778

jorisvandenbossche · 2019-10-03T21:05:39Z

Part of the discussion on missing value handling in #28095, detailed proposal at https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB.

if we go for a new NA value, we also need to decide the behaviour of this value in comparison operations. And consequently, we also need to decide on the behaviour of boolean values with missing data in logical operations and indexing operations.
So let's use this issue for that part of the discussion.

Some aspects of this:

Behaviour in comparison operations: currently np.nan compares unequal (value == np.nan -> False, values > np.nan -> False, but we can also propagate missing values (value == NA -> NA, ...)
Behaviour in logical operations: currently we always return False for | or & with missing data. But we could also use a "three-valued logic" like Julia and SQL (this has, eg, NA | True = True or NA & True = NA).
Behaviour in indexing: currently you cannot do boolean indexing with a boolean series with missing values (which is object dtype right now). Do we want to change this? For example, interpret it as False (not select it)
(TODO: should check how other languages do this)

Julia has a nice documentation page explain how they support missing values, the above ideas largely match with that.

Besides those behavioural API discussions, we also need to decide on how to approach this technically (boolean ExtensionArray with boolean numpy array + mask for missing values?) Shall we discuss that here as well, or keep that separate?

cc @pandas-dev/pandas-core

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2019-10-03T21:07:04Z

(also if we don't go for a new NA value, a boolean ExtensionArray with missing data support can be interesting, but in such a case it's probably more difficult to change the behaviour compared to what we currently have with np.nan)

WillAyd · 2019-10-03T21:26:35Z

I think I prefer what R and Julia do here but would be curious to hear counter arguments in support of existing behavior.

Just to clarify, you think any operation where one of the operands is NA should return NA right? But something like isna would return True in the presence of an NA? (assuming that from how Julia has the === operator)

jbrockmendel · 2019-10-03T21:40:44Z

Behaviour in logical operations: currently we always return False for | or & with missing data.

This is not accurate. These ops are basically nothing but corner cases, a handful of which do three-value logic. That's before considering DataFrame, which only sometimes behaves like Series.

I'll elaborate later this afternoon.

jorisvandenbossche · 2019-10-03T22:21:45Z

would be curious to hear counter arguments in support of existing behavior.

@WillAyd I think the existing behaviour is mainly a consequence of using np.nan (for which the existing behaviour makes sense). And an argument to keep the existing behaviour would be that we have done it like that for a long time..

Just to clarify, you think any operation where one of the operands is NA should return NA right?

For comparisons yes, for logical operations it depends. I pasted below the more elaborate explanation with code examples that I wrote in the proposal on hackmd.

But something like isna would return True in the presence of an NA?

Yes, that wouldn't change compared to the current behaviour I think.

Behaviour in comparison operations

In numerical operations, NA propagates (see also above). But for boolean operations the situation is less clear. Currently, we use the behaviour of np.nan for missing values in pandas. This means:

>>> np.nan == 1
False
>>> np.nan < 1
False
>>> np.nan != 1
True

However, a missing value could also propagate:

>>> pd.NA == 1
NA
>>> pd.NA < 1
NA
>>> PD.NA != 1
NA

This is for example what Julia and R do.

Boolean data type with missing values and logical operations

If we propagate NA in comparison operations (see above), the consequence is that you end up with boolean masks with missing values. This means that we need to support a boolean dtype with NA support, and define the behaviour in logical operations and indexing.

What to return in logical operations? (eg True & NA)
How to handle NA's in indexing operations (raise error, or consider as False ?)

Currently, the logical operations are not very consistently defined. On Series/DataFrame, it returns mostly False, and for scalars it is not defined:

>>> pd.Series([True, False, np.nan]) & True
0     True
1    False
2    False
dtype: bool

>>> pd.Series([True, False, np.nan]) | True
0     True
1     True
2    False
dtype: bool

>>> np.nan & True
TypeError: unsupported operand type(s) for &: 'float' and 'bool'

For those logical operations, Julia, R and SQL choose for the "three-valued logic" (only propagate missing values when it is logically required). See https://docs.julialang.org/en/v1/manual/missing/index.html for a good explanation. This would give:

>>> pd.Series([True, False, pd.NA]) & True
0     True
1    False
2       NA
dtype: bool

>>> pd.NA & True
NA

>>> pd.NA & False
False

>>> pd.NA | True
True

>>> pd.NA | False
NA

jorisvandenbossche · 2019-10-04T15:29:52Z

For the question around the indexing behaviour with boolean values in the presence of NAs, I think there are 3 options:

Raise an exception
Don't include in the output (interpret NA as False in the filtering operation)
Propagate NAs (NA in mask gives NA in output)

I looked at some other languages / libraries that deal with this.

Postgres (SQL) filters only where True (thus interprets NULL as False in the filtering operation):

CREATE TABLE test_types (
    col1    integer,
    col2    integer
);

INSERT INTO test_types VALUES (1, 1);
INSERT INTO test_types VALUES (2, NULL);
INSERT INTO test_types VALUES (3, 3);

test_db=# SELECT col1, col2 > 2 AS mask FROM test_types2;
 col1 | mask 
------+------
    1 | f
    2 | 
    3 | t
(3 rows)

test_db=# SELECT * FROM test_types2 WHERE col2 > 2;
 col1 | col2 
------+------
    3 |    3
(1 row)

In R, it depends on function. dplyr's filter drops NAs: Unlike base subsetting with [, rows where the condition evaluates to NA are dropped. (from https://dplyr.tidyverse.org/reference/filter.html). Example:

> df <- tibble(col1 = c(1L, 2L, 3L), col2 = c(1L, NA, 3L))
> df
# A tibble: 3 x 2
   col1  col2
  <int> <int>
1     1     1
2     2    NA
3     3     3
> df %>% mutate(mask = col2 > 2)
# A tibble: 3 x 3
   col1  col2 mask 
  <int> <int> <lgl>
1     1     1 FALSE
2     2    NA NA   
3     3     3 TRUE 
> df %>% filter(col2 > 2)
# A tibble: 1 x 2
   col1  col2
  <int> <int>
1     3     3

But so in base R, it propagates NAs (missing value in the index always yields a missing value in the output, from https://adv-r.hadley.nz/subsetting.html):

> x <- c(1, 2, 3)
> mask <- c(FALSE, NA, TRUE)
> x[mask]
[1] NA  3

Julia currently raises an error (but not sure if this is on purpose or just not yet implemented. EDIT: based on https://julialang.org/blog/2018/06/missing this seems to be on purpose):

julia> arr = [1 2 3]
1×3 Array{Int64,2}:
 1  2  3

julia> mask = [false missing true]
1×3 Array{Union{Missing, Bool},2}:
 false  missing  true

julia> arr[mask]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
Stacktrace:
 [1] checkindex(::Type{Bool}, ::Base.OneTo{Int64}, ::Missing) at ./abstractarray.jl:504
 [2] checkindex at ./abstractarray.jl:519 [inlined]
 [3] checkbounds at ./abstractarray.jl:434 [inlined]
 [4] checkbounds at ./abstractarray.jl:449 [inlined]
 [5] _getindex at ./multidimensional.jl:596 [inlined]
 [6] getindex(::Array{Int64,2}, ::Array{Union{Missing, Bool},2}) at ./abstractarray.jl:905
 [7] top-level scope at none:0

Apache Arrow C++ (pyarrow) has currently the same behaviour as base R (propagating):

In [2]: import pyarrow as pa 

In [4]: arr = pa.array([1, 2, 3])

In [5]: mask = pa.array([False, None, True])

In [6]: mask
Out[6]: 
<pyarrow.lib.BooleanArray object at 0x7fa0f1b9f768>
[
  false,
  null,
  true
]

In [7]: arr.filter(mask) 
Out[7]: 
<pyarrow.lib.Int64Array object at 0x7fa0f1b9fd68>
[
  null,
  3
]

jorisvandenbossche · 2019-10-07T07:11:21Z

(comment from the hackmd copied here)

[@Dr-Irv] I think you will have to be careful when people mix numpy and pandas operations. Does (pd.Series([pd.NA], dtype=int) + np.array([0])).values return an array of np.nan or a pandas typed array with pd.NA inside?

In our ops code, pandas objects always have "priority" on numpy arrays. So if you do series + array, the result is a series (and hence in your example will contain pd.NA).

But, it's certainly true that when actually performing similar operations on the equivalent numpy arrays, you can get different results, certainly if np.nan and pd.NA behave differently. So that is a clear drawback of choosing for such different behaviour.

Hypothetical example:

>>> pd.Series([1, 2, pd.NA]) == 2
0    False
1     True
2       NA
dtype: bool

>>> np.asarray(pd.Series([1, 2, pd.NA])) == 2
array([False,  True, False])

But, this also relates with the question: how do we convert to numpy? (which wasn't really discussed yet) By default, if there are NAs, we could also convert to object dtype (like now for IntegerArray) preserving the pd.NA, and then you wouldn't get this different behaviour. And then it could be an option for the user to ask for a conversion to np.nan (to get a non-object float array), but it would be an explicit request of the user to get something different than pd.NA.

jbrockmendel · 2019-10-08T15:46:28Z

I think I promised to offer examples of weird edge cases and then this got lost in my inbox while travelling. Is that still something that would be useful?

xhochy · 2019-10-08T16:11:41Z

I recently made a prototype BooleanArray that deals with missing in the current pandas logic: https://uwekorn.com/2019/09/02/boolean-array-with-missings.html It shouldn't be hard to adapt that to output results in Julia/Kleene logic and also implement other operations like | on top of that.

xhochy · 2019-10-08T20:04:32Z

(extracted from #28095 (comment))

(NA == NA) = ?

In this case, I would expect also NA as NA is typically interpreted as unknown (actually in the discussion here just a synonym to missing but it makes the intuition a bit better), thus the outcome of the comparison of two unknown values is also unknown.

Naming

In general, I like the Julia documentation but I would prefer when we would stick to using more widely known terms of the three value logic like Kleene logic. With them, we have a good theoretical foundation we can refer to on what a computation should return and this might make communication with e.g. the database community a lot easier.

jorisvandenbossche · 2019-11-14T19:39:00Z

Is there more feedback on this?
For the three items, I think:

For the comparison ops, we can go with "propagating NAs"
For the logical ops, we can go with the Kleene (three-valued) logic
For the indexing I am less sure:
- Raising an error might be the most conservative. Users can always first explicitly do fillna(True/False) on their mask. But at the same time, this also might become tedious if you want the same (False?) in 90% of the cases.
- If not raising an error, skipping (interpret NA as False) feels more intuitive to me than propagating NAs (and thus introducing NAs in the filtered array). That's also what SQL and tidyverse do (base R propagates, see above)

jbrockmendel · 2019-11-14T19:46:06Z

If not raising an error, skipping (interpret NA as False) feels more intuitive than propagating NAs (and thus introducing NAs in the filtered array)

My understanding is that in (nearly?) every other situation, pd.NA refuses to cast to bool?

jorisvandenbossche · 2019-11-14T20:02:48Z

bool(pd.NA) raises an error in the current PR, yes (which means it raises in eg if expr if expr evaluates to pd.NA). But that can also be discussed.

TomAugspurger · 2019-12-13T17:46:47Z

(discuss in #30265)

On indexing, propagating NAs presents some challenging dtype casting issues. For example, what is the dtype on the output of

In [3]: s = pd.Series([1, 2, 3])

In [4]: mask = pd.array([True, False, None])

In [5]: s[mask]

Would that be an Int64 dtype, with the last value being NA? Would we cast to float64 and use np.nan? object-dtype?

And what if the original series was float dtype? A float-dtype with NaN is about the only option we have right now, since we lack a float64 array that can hold NA.

I don't think that an indexing operation that introduces missing values should change the dtype of the array. I don't think anyone would realistically want that. So... do we just raise for now?

What about cases when we are able to index without changing the dtype? Those would be

Indexing with a BooleanArray with no missing values.
Indexing a dtype that supports missing values (datetime, timedelta, string, object, Int64, boolean...). Basically anything but NumPy's bool, int float.

IMO, which shouldn't have value-dependent behavior, so if

>>> pd.Series([1, 2])[pd.array([True, None])

raises, then so should pd.Series([1, 2])[pd.array([True, False]) (no missing values).

I think supporting 2 is fine, since it just depends on the dtypes.

>>> pd.Series([1, 2], dtype="Int64")[pd.array([True, None])]
Series([1, NA], dtype="Int64")
>>> pd.Series([1, 2], dtype="Int64")[pd.array([True, False])]
Series([1], dtype="Int64")

TomAugspurger · 2019-12-13T20:48:17Z

Pushed a prototype up for discussion at #30265. Let's move the indexing discussion over there.

jorisvandenbossche · 2019-12-16T21:11:54Z

Without complexities of implementation in mind, I am not sure that we actually would want such propagation of missing values?
My plan was to post here a summary of our discussion in the chat about this, but never got to it... sorry about that. But I seem to remember that propagation was actually the least favoured option of the three? (our notes also don't say much ..)

I think the main take-away of the discussion was that there is not a clear "best" option.
Raising an error is the most "conservative", in the sense that we will never do the wrong thing automatically, and the user always needs to specify with fillna(True/False) what they want. This is very explicit, but can also get annoying if in 95% of the cases you always want fillna(False).
Skipping the NAs (which means doing a fillna(False) implicitly) might be what most people want / expect most of the time. But, it is less explicit, and when you didn't expect it / want something else, it can maybe also be very confusing if this happened automatically.

TomAugspurger · 2019-12-16T21:17:21Z

Without complexities of implementation in mind, I am not sure that we actually would want such propagation of missing values?

That's my conclusion in #30265 (comment) as well.

TomAugspurger · 2020-01-06T20:26:19Z

I think this has been implemented.

https://dev.pandas.io/docs/user_guide/boolean.html and https://dev.pandas.io/docs/reference/api/pandas.arrays.BooleanArray.html#pandas.arrays.BooleanArray

jorisvandenbossche · 2020-02-10T15:07:08Z

A note for who followed the discussion here. The specific issue about indexing (masking) with booleans in the presence of missing values has come up again in #31503.
We originally decided in this issue to raise for now as the most conservative option, so we could re-evaluate later. In the linked issue, the consensus seems to be that we should go with a "skipping" behaviour (i.e. interpret NA as False) instead.

johentsch · 2020-03-29T20:49:01Z

The main inconvenience of the new behaviour, IMHO, is that all code written for prior versions is extremely likely to need a lot of fillna() to be added in order to run in pandas 1.0

jorisvandenbossche · 2020-03-30T06:56:31Z

@laserjeyes note that in the meantime, NAs are considered as False when it comes to filtering, which should normally lessen the need to the fillna() often (see #31591).

jorisvandenbossche mentioned this issue Oct 3, 2019

ROADMAP: Consistent missing value handling with new NA scalar #28095

Open

jorisvandenbossche added API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action labels Oct 7, 2019

This comment has been minimized.

Sign in to view

pandas-dev deleted a comment from seberg Oct 17, 2019

pandas-dev deleted a comment from Dr-Irv Oct 17, 2019

jorisvandenbossche mentioned this issue Nov 11, 2019

ENH: add BooleanArray extension array #29555

Merged

This was referenced Nov 18, 2019

ENH: add NA scalar for missing value indicator, use in StringArray. #29597

Merged

API: any/all in context of boolean dtype with missing values #29686

Closed

TomAugspurger mentioned this issue Nov 25, 2019

Missing values proposal: concrete steps for 1.0 #29556

Closed

13 tasks

TomAugspurger mentioned this issue Dec 13, 2019

StringArray comparisions return BooleanArray #30231

Merged

TomAugspurger mentioned this issue Dec 13, 2019

[WIP]: Indexing with BooleanArray propagates NA #30265

Closed

TomAugspurger mentioned this issue Dec 17, 2019

DOC/TST: Indexing with NA raises #30308

Merged

anisotropi4 mentioned this issue Dec 22, 2019

DISCUSS: disambiguation of NA and "NA" in reprs #30415

Closed

TomAugspurger added this to the 1.0 milestone Jan 6, 2020

TomAugspurger closed this as completed Jan 6, 2020

jorisvandenbossche mentioned this issue Jan 31, 2020

API: query / boolean selection with nullable dtypes with NAs #31503

Closed

jorisvandenbossche mentioned this issue Sep 16, 2020

BUG: indexing with DataFrame with nullable boolean dtype #36395

Closed

arw2019 mentioned this issue Nov 3, 2020

ENH: should a sequence of integers be a valid input to BooleanArray? #37614

Open

jorisvandenbossche mentioned this issue Mar 2, 2021

API: expected result for pow(1, pd.NA) or pow(pd.NA, 0) #29997

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISCUSS: boolean dtype with missing value support #28778

DISCUSS: boolean dtype with missing value support #28778

jorisvandenbossche commented Oct 3, 2019 •

edited

jorisvandenbossche commented Oct 3, 2019

WillAyd commented Oct 3, 2019

jbrockmendel commented Oct 3, 2019

jorisvandenbossche commented Oct 3, 2019

jorisvandenbossche commented Oct 4, 2019 •

edited

jorisvandenbossche commented Oct 7, 2019

jbrockmendel commented Oct 8, 2019

xhochy commented Oct 8, 2019

xhochy commented Oct 8, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

jorisvandenbossche commented Nov 14, 2019 •

edited

jbrockmendel commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

TomAugspurger commented Dec 13, 2019 •

edited

TomAugspurger commented Dec 13, 2019

jorisvandenbossche commented Dec 16, 2019 •

edited

TomAugspurger commented Dec 16, 2019

TomAugspurger commented Jan 6, 2020

jorisvandenbossche commented Feb 10, 2020

johentsch commented Mar 29, 2020 •

edited

jorisvandenbossche commented Mar 30, 2020

DISCUSS: boolean dtype with missing value support #28778

DISCUSS: boolean dtype with missing value support #28778

Comments

jorisvandenbossche commented Oct 3, 2019 • edited

jorisvandenbossche commented Oct 3, 2019

WillAyd commented Oct 3, 2019

jbrockmendel commented Oct 3, 2019

jorisvandenbossche commented Oct 3, 2019

Behaviour in comparison operations

Boolean data type with missing values and logical operations

jorisvandenbossche commented Oct 4, 2019 • edited

jorisvandenbossche commented Oct 7, 2019

jbrockmendel commented Oct 8, 2019

xhochy commented Oct 8, 2019

xhochy commented Oct 8, 2019

(NA == NA) = ?

Naming

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

jorisvandenbossche commented Nov 14, 2019 • edited

jbrockmendel commented Nov 14, 2019

jorisvandenbossche commented Nov 14, 2019

TomAugspurger commented Dec 13, 2019 • edited

TomAugspurger commented Dec 13, 2019

jorisvandenbossche commented Dec 16, 2019 • edited

TomAugspurger commented Dec 16, 2019

TomAugspurger commented Jan 6, 2020

jorisvandenbossche commented Feb 10, 2020

johentsch commented Mar 29, 2020 • edited

jorisvandenbossche commented Mar 30, 2020

jorisvandenbossche commented Oct 3, 2019 •

edited

jorisvandenbossche commented Oct 4, 2019 •

edited

jorisvandenbossche commented Nov 14, 2019 •

edited

TomAugspurger commented Dec 13, 2019 •

edited

jorisvandenbossche commented Dec 16, 2019 •

edited

johentsch commented Mar 29, 2020 •

edited