Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: any/all in context of boolean dtype with missing values #29686

Closed
jorisvandenbossche opened this issue Nov 18, 2019 · 11 comments · Fixed by #30062
Closed

API: any/all in context of boolean dtype with missing values #29686

jorisvandenbossche opened this issue Nov 18, 2019 · 11 comments · Fixed by #30062
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 18, 2019

In the new missing values support, and especially while implementing the BooleanArray (#29555), the question comes up: what should any and all do in presence of missing values?

edit from Tom: Here's a proposed table of behavior

case input output
1. all([True, NA], skipna=False) NA
2. all([False, NA], skipna=False) False
3. all([NA], skipna=False) NA
4. all([], skipna=False) True
5. any([True, NA], skipna=False) True
6. any([False, NA], skipna=False) NA
7. any([NA], skipna=False) NA
8. any([], skipna=False) False
case input output
9. all([True, NA], skipna=True) True
10. all([False, NA], skipna=True) False
11. all([NA], skipna=True) True
12. all([], skipna=True) True
13. any([True, NA], skipna=True) True
14. any([False, NA], skipna=True) False
15. any([NA], skipna=True) False
16. any([], skipna=True) False

Some context:

Currently, if having bools with NaNs, you end up with a object dtype, and the behaviour of any/all with object dtype has all kinds of corner cases. @xhochy recently opened #27709 for this (but opening a new issue since want to focus here the behaviour in boolean dtype, the behaviour in object dtype might still deviate)

The documentation of any says (https://dev.pandas.io/docs/reference/api/pandas.Series.any.html)

Return whether any element is True, potentially over an axis.

Returns False unless there at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

...

skipna : bool, default True
Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

and similar for all (https://dev.pandas.io/docs/reference/api/pandas.Series.all.html).

Default behaviour with skipna=True

in case of some NA's and some True/False values, I think the behaviour is clear: any/all are reductions, and in pandas we use skipna=True for reductions.

So you get something like this:
(I am still using np.nan here as missing value, since the pd.NA PR is not yet merged / combined with the BooleanArray PR; but let's focus on return value)

In [2]: pd.Series([True, False, np.nan]).any() 
Out[2]: True

In [3]: pd.Series([True, False, np.nan]).all()
Out[3]: False

In [4]: pd.Series([True, True, np.nan]).all() 
Out[4]: True

(although when interpreting NA as "unknown", it might look a bit strange to return True in the last case since the NA might still be True or False)

Behaviour for all-NA in case of skipna=True

This is a case that is described in the current docs: "If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column", and is indeed consistent with skipping all NAs -> any/all of empty set.
And then, we follow numpy's behaviour (False for any, True for all):

In [8]: np.array([], dtype=bool).any() 
Out[8]: False

In [9]: np.array([], dtype=bool).all()
Out[9]: True

(although I don't find this necessarily very intuitive, this seems more a consequence of the algorithm starting with a base "identity" value of False/True for any/all)

Behaviour with skipna=False

Here comes the more tricky part. Currently, with object dtype, we have some buggy behaviour (see #27709), and it depends on the order of the values and which missing value (np.nan or None) is used.

With BooleanArray we won't have this problem (there is only a single NA + we don't need to rely on numpy's buggy object dtype behaviour). But I am not sure we should follow what is currently in the docs:

If skipna is False, then NA are treated as True, because these are not equal to zero.

This follows from numpy's behaviour with floats:

In [10]: np.array([0, np.nan]).any()
Out[10]: True

and while this might make sense in float context, I am not sure we should follow this behaviour and our docs and do:

>>> pd.Series([False, pd.NA], dtype="boolean").any()
True

I think this should rather give False or NA instead of True.
While for object dtype it might make sense to align the behaviour with float (as argued in #27709 (comment)), for a boolean dtype we can probably use the behaviour we defined for NA in logical operations (eg False | NA = NA, so in that case, the above should give NA).
But are we ok with any/all not returning a boolean in this case? (note, you only have this if someone specifically set skipna=False)

@jorisvandenbossche jorisvandenbossche added ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action labels Nov 18, 2019
@TomAugspurger
Copy link
Contributor

Agreed that the skipna=True case looks fine as is.

For skipna=False, I think that the presence of any NA would make the result NA, though it'd be good to survey what other systems do here.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 18, 2019

For skipna=False, I think that the presence of any NA would make the result NA, though it'd be good to survey what other systems do here.

So if we follow the NA behaviour for logical operations as discussed in #28778 (and implemented for pd.NA in #29597), this will mostly result in NA, but sometimes can also result in True, eg for:

>>> pd.Series([True, pd.NA], dtype="boolean").any(skipna=False)
True

since there is already one True, the result can always be True regardless of whether the NA is actually True or False.

@TomAugspurger
Copy link
Contributor

Should that example have a skipna=False?

I guess for any that makes sense. For .all Series([True, pd.NA], dtype="boolean").all(skipna=False)` would be NA?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 18, 2019

Should that example have a skipna=False?

Yes, updated

For Series([True, pd.NA], dtype="boolean").all(skipna=False)` would be NA?

Yes, that would be consistent with the logical op behaviour (True & NA = NA)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 18, 2019

Thanks. So, just to make sure, can you verify this table, and add it to the original post if it's correct (and eventually the docs)?

case input output
1. all([True, NA], skipna=False) NA
2. all([False, NA], skipna=False) False
3. all([NA], skipna=False) NA
4. all([], skipna=False) True
5. any([True, NA], skipna=False) True
6. any([False, NA], skipna=False) NA
7. any([NA], skipna=False) NA
8. any([], skipna=False) False
case input output
9. all([True, NA], skipna=True) True
10. all([False, NA], skipna=True) False
11. all([NA], skipna=True) True
12. all([], skipna=True) True
13. any([True, NA], skipna=True) True
14. any([False, NA], skipna=True) False
15. any([NA], skipna=True) False
16. any([], skipna=True) False

@jorisvandenbossche
Copy link
Member Author

Thanks for that overview! And yes, that is seems correct based on my understanding.

This also seems to be what R is doing: https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/any (except that their default "skipna" is the opposite: na.rm=FALSE is the default)

@TomAugspurger
Copy link
Contributor

OK. I'm happy to deviate from R in the default skipna.

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 18, 2019

I'm happy to deviate from R in the default skipna.

And that's the case for all our reductions anyway

@WillAyd
Copy link
Member

WillAyd commented Nov 19, 2019

Sorry trying to read through all of the related items but not seeing it - what did we ultimately decide for comparison operators? IMO any should match whatever or does and all should match whatever and does

@jorisvandenbossche
Copy link
Member Author

The logical operations (and, or) are being discussed in #28778 (and implemented for pd.NA in #29597).
The table that Tom made with the behaviour for all possible cases for any/all match with what the current conclusion / implementation is in those issues regarding logical operations ("Kleene logic" or "three value logic" giving eg True | NA == True, True & NA == NA and False & NA == False, see the bottom of this comment: #28778 (comment))

@WillAyd
Copy link
Member

WillAyd commented Nov 19, 2019

Gotcha thanks! So yea I think I agree with Tom's table then - any / all should follow the logic rules of OR / AND respectively across all elements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants