Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISCUSS: boolean dtype with missing value support #28778

Closed
jorisvandenbossche opened this issue Oct 3, 2019 · 23 comments
Closed

DISCUSS: boolean dtype with missing value support #28778

jorisvandenbossche opened this issue Oct 3, 2019 · 23 comments
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 3, 2019

Part of the discussion on missing value handling in #28095, detailed proposal at https://hackmd.io/@jorisvandenbossche/Sk0wMeAmB.

if we go for a new NA value, we also need to decide the behaviour of this value in comparison operations. And consequently, we also need to decide on the behaviour of boolean values with missing data in logical operations and indexing operations.
So let's use this issue for that part of the discussion.

Some aspects of this:

  • Behaviour in comparison operations: currently np.nan compares unequal (value == np.nan -> False, values > np.nan -> False, but we can also propagate missing values (value == NA -> NA, ...)
  • Behaviour in logical operations: currently we always return False for | or & with missing data. But we could also use a "three-valued logic" like Julia and SQL (this has, eg, NA | True = True or NA & True = NA).
  • Behaviour in indexing: currently you cannot do boolean indexing with a boolean series with missing values (which is object dtype right now). Do we want to change this? For example, interpret it as False (not select it)
    (TODO: should check how other languages do this)

Julia has a nice documentation page explain how they support missing values, the above ideas largely match with that.

Besides those behavioural API discussions, we also need to decide on how to approach this technically (boolean ExtensionArray with boolean numpy array + mask for missing values?) Shall we discuss that here as well, or keep that separate?

cc @pandas-dev/pandas-core

@jorisvandenbossche
Copy link
Member Author

(also if we don't go for a new NA value, a boolean ExtensionArray with missing data support can be interesting, but in such a case it's probably more difficult to change the behaviour compared to what we currently have with np.nan)

@WillAyd
Copy link
Member

WillAyd commented Oct 3, 2019

I think I prefer what R and Julia do here but would be curious to hear counter arguments in support of existing behavior.

Just to clarify, you think any operation where one of the operands is NA should return NA right? But something like isna would return True in the presence of an NA? (assuming that from how Julia has the === operator)

@jbrockmendel
Copy link
Member

Behaviour in logical operations: currently we always return False for | or & with missing data.

This is not accurate. These ops are basically nothing but corner cases, a handful of which do three-value logic. That's before considering DataFrame, which only sometimes behaves like Series.

I'll elaborate later this afternoon.

@jorisvandenbossche
Copy link
Member Author

would be curious to hear counter arguments in support of existing behavior.

@WillAyd I think the existing behaviour is mainly a consequence of using np.nan (for which the existing behaviour makes sense). And an argument to keep the existing behaviour would be that we have done it like that for a long time..

Just to clarify, you think any operation where one of the operands is NA should return NA right?

For comparisons yes, for logical operations it depends. I pasted below the more elaborate explanation with code examples that I wrote in the proposal on hackmd.

But something like isna would return True in the presence of an NA?

Yes, that wouldn't change compared to the current behaviour I think.


Behaviour in comparison operations

In numerical operations, NA propagates (see also above). But for boolean operations the situation is less clear. Currently, we use the behaviour of np.nan for missing values in pandas. This means:

>>> np.nan == 1
False
>>> np.nan < 1
False
>>> np.nan != 1
True

However, a missing value could also propagate:

>>> pd.NA == 1
NA
>>> pd.NA < 1
NA
>>> PD.NA != 1
NA

This is for example what Julia and R do.

Boolean data type with missing values and logical operations

If we propagate NA in comparison operations (see above), the consequence is that you end up with boolean masks with missing values. This means that we need to support a boolean dtype with NA support, and define the behaviour in logical operations and indexing.

  • What to return in logical operations? (eg True & NA)
  • How to handle NA's in indexing operations (raise error, or consider as False ?)

Currently, the logical operations are not very consistently defined. On Series/DataFrame, it returns mostly False, and for scalars it is not defined:

>>> pd.Series([True, False, np.nan]) & True
0     True
1    False
2    False
dtype: bool

>>> pd.Series([True, False, np.nan]) | True
0     True
1     True
2    False
dtype: bool

>>> np.nan & True
TypeError: unsupported operand type(s) for &: 'float' and 'bool'

For those logical operations, Julia, R and SQL choose for the "three-valued logic" (only propagate missing values when it is logically required). See https://docs.julialang.org/en/v1/manual/missing/index.html for a good explanation. This would give:

>>> pd.Series([True, False, pd.NA]) & True
0     True
1    False
2       NA
dtype: bool

>>> pd.NA & True
NA

>>> pd.NA & False
False

>>> pd.NA | True
True

>>> pd.NA | False
NA

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Oct 4, 2019

For the question around the indexing behaviour with boolean values in the presence of NAs, I think there are 3 options:

  • Raise an exception
  • Don't include in the output (interpret NA as False in the filtering operation)
  • Propagate NAs (NA in mask gives NA in output)

I looked at some other languages / libraries that deal with this.

Postgres (SQL) filters only where True (thus interprets NULL as False in the filtering operation):

CREATE TABLE test_types (
    col1    integer,
    col2    integer
);

INSERT INTO test_types VALUES (1, 1);
INSERT INTO test_types VALUES (2, NULL);
INSERT INTO test_types VALUES (3, 3);
test_db=# SELECT col1, col2 > 2 AS mask FROM test_types2;
 col1 | mask 
------+------
    1 | f
    2 | 
    3 | t
(3 rows)

test_db=# SELECT * FROM test_types2 WHERE col2 > 2;
 col1 | col2 
------+------
    3 |    3
(1 row)

In R, it depends on function. dplyr's filter drops NAs: Unlike base subsetting with [, rows where the condition evaluates to NA are dropped. (from https://dplyr.tidyverse.org/reference/filter.html). Example:

> df <- tibble(col1 = c(1L, 2L, 3L), col2 = c(1L, NA, 3L))
> df
# A tibble: 3 x 2
   col1  col2
  <int> <int>
1     1     1
2     2    NA
3     3     3
> df %>% mutate(mask = col2 > 2)
# A tibble: 3 x 3
   col1  col2 mask 
  <int> <int> <lgl>
1     1     1 FALSE
2     2    NA NA   
3     3     3 TRUE 
> df %>% filter(col2 > 2)
# A tibble: 1 x 2
   col1  col2
  <int> <int>
1     3     3

But so in base R, it propagates NAs (missing value in the index always yields a missing value in the output, from https://adv-r.hadley.nz/subsetting.html):

> x <- c(1, 2, 3)
> mask <- c(FALSE, NA, TRUE)
> x[mask]
[1] NA  3

Julia currently raises an error (but not sure if this is on purpose or just not yet implemented. EDIT: based on https://julialang.org/blog/2018/06/missing this seems to be on purpose):

julia> arr = [1 2 3]
1×3 Array{Int64,2}:
 1  2  3

julia> mask = [false missing true]
1×3 Array{Union{Missing, Bool},2}:
 false  missing  true

julia> arr[mask]
ERROR: ArgumentError: unable to check bounds for indices of type Missing
Stacktrace:
 [1] checkindex(::Type{Bool}, ::Base.OneTo{Int64}, ::Missing) at ./abstractarray.jl:504
 [2] checkindex at ./abstractarray.jl:519 [inlined]
 [3] checkbounds at ./abstractarray.jl:434 [inlined]
 [4] checkbounds at ./abstractarray.jl:449 [inlined]
 [5] _getindex at ./multidimensional.jl:596 [inlined]
 [6] getindex(::Array{Int64,2}, ::Array{Union{Missing, Bool},2}) at ./abstractarray.jl:905
 [7] top-level scope at none:0

Apache Arrow C++ (pyarrow) has currently the same behaviour as base R (propagating):

In [2]: import pyarrow as pa 

In [4]: arr = pa.array([1, 2, 3])

In [5]: mask = pa.array([False, None, True])

In [6]: mask
Out[6]: 
<pyarrow.lib.BooleanArray object at 0x7fa0f1b9f768>
[
  false,
  null,
  true
]

In [7]: arr.filter(mask) 
Out[7]: 
<pyarrow.lib.Int64Array object at 0x7fa0f1b9fd68>
[
  null,
  3
]

@jorisvandenbossche jorisvandenbossche added API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action labels Oct 7, 2019
@jorisvandenbossche
Copy link
Member Author

(comment from the hackmd copied here)

[@Dr-Irv] I think you will have to be careful when people mix numpy and pandas operations. Does (pd.Series([pd.NA], dtype=int) + np.array([0])).values return an array of np.nan or a pandas typed array with pd.NA inside?

In our ops code, pandas objects always have "priority" on numpy arrays. So if you do series + array, the result is a series (and hence in your example will contain pd.NA).

But, it's certainly true that when actually performing similar operations on the equivalent numpy arrays, you can get different results, certainly if np.nan and pd.NA behave differently. So that is a clear drawback of choosing for such different behaviour.

Hypothetical example:

>>> pd.Series([1, 2, pd.NA]) == 2
0    False
1     True
2       NA
dtype: bool

>>> np.asarray(pd.Series([1, 2, pd.NA])) == 2
array([False,  True, False])

But, this also relates with the question: how do we convert to numpy? (which wasn't really discussed yet) By default, if there are NAs, we could also convert to object dtype (like now for IntegerArray) preserving the pd.NA, and then you wouldn't get this different behaviour. And then it could be an option for the user to ask for a conversion to np.nan (to get a non-object float array), but it would be an explicit request of the user to get something different than pd.NA.

@jbrockmendel
Copy link
Member

I think I promised to offer examples of weird edge cases and then this got lost in my inbox while travelling. Is that still something that would be useful?

@xhochy
Copy link
Contributor

xhochy commented Oct 8, 2019

I recently made a prototype BooleanArray that deals with missing in the current pandas logic: https://uwekorn.com/2019/09/02/boolean-array-with-missings.html It shouldn't be hard to adapt that to output results in Julia/Kleene logic and also implement other operations like | on top of that.

@xhochy
Copy link
Contributor

xhochy commented Oct 8, 2019

(extracted from #28095 (comment))

(NA == NA) = ?

In this case, I would expect also NA as NA is typically interpreted as unknown (actually in the discussion here just a synonym to missing but it makes the intuition a bit better), thus the outcome of the comparison of two unknown values is also unknown.

Naming

In general, I like the Julia documentation but I would prefer when we would stick to using more widely known terms of the three value logic like Kleene logic. With them, we have a good theoretical foundation we can refer to on what a computation should return and this might make communication with e.g. the database community a lot easier.

@seberg

This comment has been minimized.

@Dr-Irv

This comment has been minimized.

@jorisvandenbossche

This comment has been minimized.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 14, 2019

Is there more feedback on this?
For the three items, I think:

  • For the comparison ops, we can go with "propagating NAs"
  • For the logical ops, we can go with the Kleene (three-valued) logic
  • For the indexing I am less sure:
    • Raising an error might be the most conservative. Users can always first explicitly do fillna(True/False) on their mask. But at the same time, this also might become tedious if you want the same (False?) in 90% of the cases.
    • If not raising an error, skipping (interpret NA as False) feels more intuitive to me than propagating NAs (and thus introducing NAs in the filtered array). That's also what SQL and tidyverse do (base R propagates, see above)

@jbrockmendel
Copy link
Member

If not raising an error, skipping (interpret NA as False) feels more intuitive than propagating NAs (and thus introducing NAs in the filtered array)

My understanding is that in (nearly?) every other situation, pd.NA refuses to cast to bool?

@jorisvandenbossche
Copy link
Member Author

bool(pd.NA) raises an error in the current PR, yes (which means it raises in eg if expr if expr evaluates to pd.NA). But that can also be discussed.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 13, 2019

(discuss in #30265)

On indexing, propagating NAs presents some challenging dtype casting issues. For example, what is the dtype on the output of

In [3]: s = pd.Series([1, 2, 3])

In [4]: mask = pd.array([True, False, None])

In [5]: s[mask]

Would that be an Int64 dtype, with the last value being NA? Would we cast to float64 and use np.nan? object-dtype?

And what if the original series was float dtype? A float-dtype with NaN is about the only option we have right now, since we lack a float64 array that can hold NA.

I don't think that an indexing operation that introduces missing values should change the dtype of the array. I don't think anyone would realistically want that. So... do we just raise for now?

What about cases when we are able to index without changing the dtype? Those would be

  1. Indexing with a BooleanArray with no missing values.
  2. Indexing a dtype that supports missing values (datetime, timedelta, string, object, Int64, boolean...). Basically anything but NumPy's bool, int float.

IMO, which shouldn't have value-dependent behavior, so if

>>> pd.Series([1, 2])[pd.array([True, None])

raises, then so should pd.Series([1, 2])[pd.array([True, False]) (no missing values).

I think supporting 2 is fine, since it just depends on the dtypes.

>>> pd.Series([1, 2], dtype="Int64")[pd.array([True, None])]
Series([1, NA], dtype="Int64")
>>> pd.Series([1, 2], dtype="Int64")[pd.array([True, False])]
Series([1], dtype="Int64")

@TomAugspurger
Copy link
Contributor

Pushed a prototype up for discussion at #30265. Let's move the indexing discussion over there.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Dec 16, 2019

Without complexities of implementation in mind, I am not sure that we actually would want such propagation of missing values?
My plan was to post here a summary of our discussion in the chat about this, but never got to it... sorry about that. But I seem to remember that propagation was actually the least favoured option of the three? (our notes also don't say much ..)

I think the main take-away of the discussion was that there is not a clear "best" option.
Raising an error is the most "conservative", in the sense that we will never do the wrong thing automatically, and the user always needs to specify with fillna(True/False) what they want. This is very explicit, but can also get annoying if in 95% of the cases you always want fillna(False).
Skipping the NAs (which means doing a fillna(False) implicitly) might be what most people want / expect most of the time. But, it is less explicit, and when you didn't expect it / want something else, it can maybe also be very confusing if this happened automatically.

@TomAugspurger
Copy link
Contributor

Without complexities of implementation in mind, I am not sure that we actually would want such propagation of missing values?

That's my conclusion in #30265 (comment) as well.

@TomAugspurger
Copy link
Contributor

@jorisvandenbossche
Copy link
Member Author

A note for who followed the discussion here. The specific issue about indexing (masking) with booleans in the presence of missing values has come up again in #31503.
We originally decided in this issue to raise for now as the most conservative option, so we could re-evaluate later. In the linked issue, the consensus seems to be that we should go with a "skipping" behaviour (i.e. interpret NA as False) instead.

@johentsch
Copy link

johentsch commented Mar 29, 2020

The main inconvenience of the new behaviour, IMHO, is that all code written for prior versions is extremely likely to need a lot of fillna() to be added in order to run in pandas 1.0

@jorisvandenbossche
Copy link
Member Author

@laserjeyes note that in the meantime, NAs are considered as False when it comes to filtering, which should normally lessen the need to the fillna() often (see #31591).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

8 participants