API: bool(pd.NA) #38224

jreback · 2020-12-02T02:06:07Z

for np.nan we define bool(np.nan)

In [189]: 1 or np.nan                                                                                                                                                                     
Out[189]: 1

In [190]: np.nan or 1                                                                                                                                                                     
Out[190]: nan

however we don't do this for pd.NA

>>> pd.NA or 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/_libs/missing.pyx", line 360, in pandas._libs.missing.NAType.__bool__
    raise TypeError("boolean value of NA is ambiguous")
TypeError: boolean value of NA is ambiguous

I think this is pretty odd and not very useful.

The text was updated successfully, but these errors were encountered:

jreback · 2020-12-02T02:06:32Z

cc @pandas-dev/pandas-core if anyone has a reference to this original discussion.

shoyer · 2020-12-02T02:11:23Z

bool(nan) is dictated by Python, not NumPy:

>>> bool(float('nan'))
True
>>> type(np.nan)
float

I doubt it was picked very intentionally, though.

In my opinion, the current behavior for pd.NA is better, especially in the context of a missing value for boolean dtypes.

TomAugspurger · 2020-12-02T02:17:24Z

Agreed with Stephan. Even if we had a typed pd.NA[T], for non-bool types I think raising would still be appropriate since most (all?) types will have a Falsey value, like 0 for integers.

Since Python requires that bool() return a True or False, raising is the best option.

jorisvandenbossche · 2020-12-02T07:30:07Z

As for the history of it / original discussion: this behaviour was already included in the PR originally adding pd.NA (#29597), but apart from one comment of Tom (#29597 (comment)), there was not much discussion about this specific aspect (also not in the main NA issue #28095).

Personally, I would rather say that the behaviour of bool(np.nan) can be unexpected for users (intuitively, on first thought, I would expect it to return False).
But that probably also proves the point: IMO there is no clear "best" return value (should it be True or False?), so raising the error is then a "refuse to guess".

A general interpretation of pd.NA is that it represents "unknown" values. But so if the value is unknown, it is also unknown whether it is truthy or not. So this way, I think we can certainly give a reasonable explanation for the behaviour.

Yes, it will be annoying to deal with in some cases. But at least it ensures you deal with it explicitly.

Dr-Irv · 2020-12-02T14:58:24Z

But that probably also proves the point: IMO there is no clear "best" return value (should it be True or False?), so raising the error is then a "refuse to guess".

A general interpretation of pd.NA is that it represents "unknown" values. But so if the value is unknown, it is also unknown whether it is truthy or not. So this way, I think we can certainly give a reasonable explanation for the behaviour.

Yes, it will be annoying to deal with in some cases. But at least it ensures you deal with it explicitly.

I really support this reasoning. pd.NA means "missing". You know nothing. It ain't true. It ain't false. It's not an integer, a string, or a float. It's nebulous.

jbrockmendel · 2020-12-02T15:11:48Z

I agree with everything said so far about the intended semantics. But I'm also concerned about the raising example from the OP. I think we're going to be chasing down corner cases for years:

ser = pd.Series(range(3), name=pd.NA)

>>> ser == ser
TypeError: boolean value of NA is ambiguous

ser.name = None
ser[0] = pd.NA
>>> ser == ser    #  <- don't we want res[0] to be pd.NA?
0    False
1     True
2     True
dtype: bool

Caveat: I have not had occasion to use pd.NA in my own work, so my experience with it is exclusively in debugging problems it has caused, which is bound to jaundice my view.

bashtage · 2020-12-02T15:18:21Z

@jbrockmendel That seems right to me. NaNs are not equal either. Missing values should also be not equal. in the sense that "objects that are not equal" are False.

jorisvandenbossche · 2020-12-02T21:11:27Z

But I'm also concerned about the raising example from the OP

The specific example that sparked the discussion was fill_value or np.nan. But note that (for this very specific case of course) this is not a good idea anyway, because that would not work as expected for fill_value being 0, or False, or .. (anything considered "falsey"). So having pd.NA raise here might actually be a good thing, to not silently do the potentially wrong thing.

ser = pd.Series(range(3), name=pd.NA)
>>> ser == ser
TypeError: boolean value of NA is ambiguous

Note that this is actually also buggy for np.nan as name. It doesn't raise (which is of course much better, to be clear ;)), but it also doesn't propagate as expected. There is actually a TODO about that in the code:

pandas/pandas/core/ops/common.py

Lines 115 to 119 in 4c8d66e

    
           if a.name == b.name: 
        
               return a.name 
        
           else: 
        
               # TODO: what if they both have np.nan for their names? 
        
               return None

So also for np.nan being not equal to itself, we need special handling in certain places (again, the fact that it doesn't raise (in this specific example) is friendlier for the user. But to find those cases for us as developer, having it raise can actually help).

There will certainly be a good amount of "chasing down" those cases where we need special NA handling, that will be inevitable I think (but thanks to your refactorings, the example of "name resolution" for ops is now much more centralized, which should make it easier to fix this)

>>> ser == ser # <- don't we want res[0] to be pd.NA?

Indeed, that should be NA instead of False. But because of the assignment, you actually have an object dtype Series, and it is a known issue that NA in object type doesn't yet work: #32931

simonjayhawkins · 2021-06-11T12:22:47Z

removing milestone

jbrockmendel · 2021-12-17T21:25:28Z

Request for clarification: is bool(pd.NA) raising intrinsically inseparable from having pd.NA propagate in ops?

jorisvandenbossche · 2021-12-17T21:31:47Z

Can you clarify your question? Has it mentioned before that those two issues are inseparable? I think it are two aspects of the NA design, and this issue is about the bool(NA) behaviour.

jbrockmendel · 2021-12-17T21:36:13Z

Can you clarify your question?

At first glance I think it would be possible to change bool(pd.NA) behavior without changing anything else, but I could also imagine that introducing other inconsistencies. Not advocating a change, just wanted to confirm that the design decision is orthogonal to the other behaviors.

jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design labels Dec 2, 2020

jreback added this to the 1.3 milestone Dec 2, 2020

jorisvandenbossche added the NA - MaskedArrays Related to pd.NA and nullable extension arrays label Dec 2, 2020

simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021

jorisvandenbossche added this to the 2.0 milestone Jun 11, 2021

jorisvandenbossche mentioned this issue Dec 21, 2022

BUG: or, and and not operators not correctly implemented for pd.NA #49828

Open

3 tasks

mroeschke removed this from the 2.0 milestone Jan 13, 2023

jbrockmendel added the Ice Cream Agreement Issues that would be addressed by the Ice Cream Agreement from the Aug 2023 sprint label Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: bool(pd.NA) #38224

API: bool(pd.NA) #38224

jreback commented Dec 2, 2020

jreback commented Dec 2, 2020

shoyer commented Dec 2, 2020

TomAugspurger commented Dec 2, 2020

jorisvandenbossche commented Dec 2, 2020

Dr-Irv commented Dec 2, 2020

jbrockmendel commented Dec 2, 2020

bashtage commented Dec 2, 2020 •

edited

Loading

jorisvandenbossche commented Dec 2, 2020

simonjayhawkins commented Jun 11, 2021

jbrockmendel commented Dec 17, 2021

jorisvandenbossche commented Dec 17, 2021

jbrockmendel commented Dec 17, 2021

API: bool(pd.NA) #38224

API: bool(pd.NA) #38224

Comments

jreback commented Dec 2, 2020

jreback commented Dec 2, 2020

shoyer commented Dec 2, 2020

TomAugspurger commented Dec 2, 2020

jorisvandenbossche commented Dec 2, 2020

Dr-Irv commented Dec 2, 2020

jbrockmendel commented Dec 2, 2020

bashtage commented Dec 2, 2020 • edited Loading

jorisvandenbossche commented Dec 2, 2020

simonjayhawkins commented Jun 11, 2021

jbrockmendel commented Dec 17, 2021

jorisvandenbossche commented Dec 17, 2021

jbrockmendel commented Dec 17, 2021

bashtage commented Dec 2, 2020 •

edited

Loading