Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: bool(pd.NA) #38224

Open
jreback opened this issue Dec 2, 2020 · 12 comments
Open

API: bool(pd.NA) #38224

jreback opened this issue Dec 2, 2020 · 12 comments
Labels
API Design Ice Cream Agreement Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays

Comments

@jreback
Copy link
Contributor

jreback commented Dec 2, 2020

xref #38102 (comment)

for np.nan we define bool(np.nan)

In [189]: 1 or np.nan                                                                                                                                                                     
Out[189]: 1

In [190]: np.nan or 1                                                                                                                                                                     
Out[190]: nan

however we don't do this for pd.NA

>>> pd.NA or 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/_libs/missing.pyx", line 360, in pandas._libs.missing.NAType.__bool__
    raise TypeError("boolean value of NA is ambiguous")
TypeError: boolean value of NA is ambiguous

I think this is pretty odd and not very useful.

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design labels Dec 2, 2020
@jreback jreback added this to the 1.3 milestone Dec 2, 2020
@jreback
Copy link
Contributor Author

jreback commented Dec 2, 2020

cc @pandas-dev/pandas-core if anyone has a reference to this original discussion.

@shoyer
Copy link
Member

shoyer commented Dec 2, 2020

bool(nan) is dictated by Python, not NumPy:

>>> bool(float('nan'))
True
>>> type(np.nan)
float

I doubt it was picked very intentionally, though.

In my opinion, the current behavior for pd.NA is better, especially in the context of a missing value for boolean dtypes.

@TomAugspurger
Copy link
Contributor

Agreed with Stephan. Even if we had a typed pd.NA[T], for non-bool types I think raising would still be appropriate since most (all?) types will have a Falsey value, like 0 for integers.

Since Python requires that bool() return a True or False, raising is the best option.

@jorisvandenbossche jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Dec 2, 2020
@jorisvandenbossche
Copy link
Member

As for the history of it / original discussion: this behaviour was already included in the PR originally adding pd.NA (#29597), but apart from one comment of Tom (#29597 (comment)), there was not much discussion about this specific aspect (also not in the main NA issue #28095).

Personally, I would rather say that the behaviour of bool(np.nan) can be unexpected for users (intuitively, on first thought, I would expect it to return False).
But that probably also proves the point: IMO there is no clear "best" return value (should it be True or False?), so raising the error is then a "refuse to guess".

A general interpretation of pd.NA is that it represents "unknown" values. But so if the value is unknown, it is also unknown whether it is truthy or not. So this way, I think we can certainly give a reasonable explanation for the behaviour.

Yes, it will be annoying to deal with in some cases. But at least it ensures you deal with it explicitly.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 2, 2020

But that probably also proves the point: IMO there is no clear "best" return value (should it be True or False?), so raising the error is then a "refuse to guess".

A general interpretation of pd.NA is that it represents "unknown" values. But so if the value is unknown, it is also unknown whether it is truthy or not. So this way, I think we can certainly give a reasonable explanation for the behaviour.

Yes, it will be annoying to deal with in some cases. But at least it ensures you deal with it explicitly.

I really support this reasoning. pd.NA means "missing". You know nothing. It ain't true. It ain't false. It's not an integer, a string, or a float. It's nebulous.

@jbrockmendel
Copy link
Member

I agree with everything said so far about the intended semantics. But I'm also concerned about the raising example from the OP. I think we're going to be chasing down corner cases for years:

ser = pd.Series(range(3), name=pd.NA)

>>> ser == ser
TypeError: boolean value of NA is ambiguous

ser.name = None
ser[0] = pd.NA
>>> ser == ser    #  <- don't we want res[0] to be pd.NA?
0    False
1     True
2     True
dtype: bool

Caveat: I have not had occasion to use pd.NA in my own work, so my experience with it is exclusively in debugging problems it has caused, which is bound to jaundice my view.

@bashtage
Copy link
Contributor

bashtage commented Dec 2, 2020

@jbrockmendel That seems right to me. NaNs are not equal either. Missing values should also be not equal. in the sense that "objects that are not equal" are False.

@jorisvandenbossche
Copy link
Member

But I'm also concerned about the raising example from the OP

The specific example that sparked the discussion was fill_value or np.nan. But note that (for this very specific case of course) this is not a good idea anyway, because that would not work as expected for fill_value being 0, or False, or .. (anything considered "falsey"). So having pd.NA raise here might actually be a good thing, to not silently do the potentially wrong thing.

ser = pd.Series(range(3), name=pd.NA)
>>> ser == ser
TypeError: boolean value of NA is ambiguous

Note that this is actually also buggy for np.nan as name. It doesn't raise (which is of course much better, to be clear ;)), but it also doesn't propagate as expected. There is actually a TODO about that in the code:

if a.name == b.name:
return a.name
else:
# TODO: what if they both have np.nan for their names?
return None

So also for np.nan being not equal to itself, we need special handling in certain places (again, the fact that it doesn't raise (in this specific example) is friendlier for the user. But to find those cases for us as developer, having it raise can actually help).

There will certainly be a good amount of "chasing down" those cases where we need special NA handling, that will be inevitable I think (but thanks to your refactorings, the example of "name resolution" for ops is now much more centralized, which should make it easier to fix this)

>>> ser == ser # <- don't we want res[0] to be pd.NA?

Indeed, that should be NA instead of False. But because of the assignment, you actually have an object dtype Series, and it is a known issue that NA in object type doesn't yet work: #32931

@simonjayhawkins
Copy link
Member

removing milestone

@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021
@jorisvandenbossche jorisvandenbossche added this to the 2.0 milestone Jun 11, 2021
@jbrockmendel
Copy link
Member

Request for clarification: is bool(pd.NA) raising intrinsically inseparable from having pd.NA propagate in ops?

@jorisvandenbossche
Copy link
Member

Can you clarify your question? Has it mentioned before that those two issues are inseparable? I think it are two aspects of the NA design, and this issue is about the bool(NA) behaviour.

@jbrockmendel
Copy link
Member

Can you clarify your question?

At first glance I think it would be possible to change bool(pd.NA) behavior without changing anything else, but I could also imagine that introducing other inconsistencies. Not advocating a change, just wanted to confirm that the design decision is orthogonal to the other behaviors.

@mroeschke mroeschke removed this from the 2.0 milestone Jan 13, 2023
@jbrockmendel jbrockmendel added the Ice Cream Agreement Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Ice Cream Agreement Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

No branches or pull requests

9 participants