Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

any/all reductions on boolean object-typed Series #27709

Open
xhochy opened this issue Aug 2, 2019 · 5 comments

Comments

@xhochy
Copy link
Contributor

commented Aug 2, 2019

On implementing a boolean based ExtensionArray I stumbled on the case that boolean arrays with missing values (which can only be object-typed in pandas) are kind-of undefined behaviour in Pandas reductions with skipna=False:

The following case should return True according to the docstring of Series.any(skipna=False):

pd.Series([False, None]).any(skipna=False)
# None
pd.Series([None, False]).any(skipna=False)
# False
pd.Series([False, np.nan]).any(skipna=False)
# nan
pd.Series([np.nan, False]).any(skipna=False)
# nan

Whereas when you do the same operation on float columns the behaviour is as documented:

pd.Series([np.nan, 0.]).any(skipna=False)
# True
pd.Series([0, np.nan]).any(skipna=False)
# True

As I have not found a unit test for the above mentioned case with a boolean object column, I suspect that this is rather undefined behaviour then intended.

Three solutions come to my mind:

  1. Document this behaviour in the Series.any() docstring.
  2. Align the behaviour of pd.Series(booleans, dtype=object).any(…) with pd.Series(booleans, dtype=object).astype(float).any(…).
  3. Raise an error when calling any/all on a mixed typed boolean series.
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

There is an open issue about this already, I think (will try to look it up). IIRC, the bottom line is that this is numpy behaviour.

@xhochy

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2019

@jorisvandenbossche Thanks! I wasn't able to find that one. So the best solution is to add an example to the docs then?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

See eg #12863 (I seem to remember another issue where I participated, but can't find anything)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

So the best solution is to add an example to the docs then?

Not fully sure. From quickly reading #12863, it seems the idea is that this could be fixed in pandas. And there are also open PRs on the numpy side.

@xhochy

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2019

I've read through the open and closed PRs and issues and am still confused. The issues were in general more about support any/all on object columns of any type, not only bool. I'm a bit more specific here about only booleans.

# These return True
any([np.nan, False]) 
any([False, np.nan])
# These return False 
any([None, False])
any([False, None])

The above operations yield different results, we would want to have True as the result for all according to our documentation. Most issues argue that pandas/numpy should align with the built-in Python behaviour which wouldn't be given then anymore.

I would therefore actually adjust the code with options 2 but this would be a behaviour breaking change.

Align the behaviour of pd.Series(booleans, dtype=object).any(…) with pd.Series(booleans, dtype=object).astype(float).any(…).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.