BUG `pd.NA` not treated correctly in `where` and `mask` operations #53124

Charlie-XIAO · 2023-05-07T00:58:11Z

Closes BUG: NA value doesn't match mask condition, still masked #52955
Tests added and passed
All code checks passed
Added an entry in the latest doc/source/whatsnew/v2.1.0.rst file

Suppose we have

>>> ser = pd.Series([1, 2, pd.NA], name="int64_col", dtype=pd.Int64Dtype())
>>> ser
0       1
1       1
2    <NA>
Name: int64_col, dtype: Int64

In the above example, if use a condition such as ser % 2 == 1, then there will be pd.NA in cond. I'm not sure which of the following would be the desired behavior: (1) an entry propagates through both where and mask (expect for some really special cases) if cond evaluates to pd.NA, (2) we raise an error message if the cond of any entry evaluates to pd.NA (in other words, users should fillna themselves in advance), or (3) provide an additional keyword for users to specify how they want to treat entries for which cond evaluates. (Or maybe none of the above is the desired behavior, I'm not sure about that.)

This PR is currently implementing the first approach. Please let me know if maintainers prefer some other approaches.

To be more specific

(1)

>>> ser.mask(ser % 2 == 1, 0)  # The 2nd row evaluates to pd.NA, thus propagates through mask
0       0
1       2
2    <NA>
Name: int64_col, dtype: Int64
>>> ser.where(ser % 2 == 1, 0)  # The 2nd row propagates through where as well
0       1
1       0
2    <NA>
Name: int64_col, dtype: Int64
>>> ser.mask(ser ** 0 == 1, 0)  # cond evaluates to True for pd.NA here, so 2nd row does not propagate through mask
0    0
1    0
2    0
Name: int64_col, dtype: Int64

(2)

I don't think this is the right way to go. This can affect the behavior of the following:

>>> df = pd.DataFrame(np.random.random((3, 3)), dtype=pd.Float64Dtype())
>>> df[0][0] = pd.NA
>>> df
          0         1         2
0      <NA>  0.609241  0.419094
1  0.274784  0.342904  0.026101
2  0.670259  0.218889  0.177126
>>> df[df >= 0.5] = 0  # If we take Approach 2, then this will raise an error
>>> df
          0         1         2
0      <NA>       0.0  0.419094
1  0.274784  0.342904  0.026101
2       0.0  0.218889  0.177126

(3)

Provide a new keyword that defaults to True.

>>> ser.mask(ser % 2 == 1, 0, cond_na=False)  # If cond evaluates to pd.NA, treat it as False, so 2nd row is not replaced
0       0
1       2
2    <NA>
Name: int64_col, dtype: Int64
>>> ser.where(ser % 2 == 1, 0, cond_na=False)  # Similar to above
0    1
1    0
2    0
Name: int64_col, dtype: Int64

topper-123 · 2023-05-07T09:13:43Z

The problem is also present if we operate using a BooleanArray instead of a Series:

>>> arr = ser.array
>>> arr
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
>>> ser.mask(arr % 2 == 1, 0]
0    0
1    2
2    0
dtype: Int64

Can you fix this case also?

Charlie-XIAO · 2023-05-07T13:56:02Z

@topper-123 Thanks for your review. I have pushed a fix when using BooleanArray.

topper-123

Looks good, just a couple of changes.

pandas/core/generic.py

pandas/tests/frame/indexing/test_mask.py

Charlie-XIAO · 2023-05-08T07:05:14Z

@topper-123 I'm a little bit of confused about the following case:

>>> df = pd.DataFrame([[1, pd.NA], [pd.NA, 2]], dtype=pd.Int64Dtype())
>>> df
      0     1
0     1  <NA>
1  <NA>     2
>>> df.mask(df[0] % 2 == 1, 0)
      0  1
0     0  0
1  <NA>  2

Is this really the desired behavior? Here the case is:

>>> df[0] % 2 == 1
0    True
1    <NA>

The first row has cond=True so it gets masked. The second row has cond=pd.NA so it propagates through the mask operation. If this is indeed the desired behavior, I think I should reword the sentence pd.NA propagates in mask and where operations. Otherwise, can you suggest what should be the correct output?

Thank you very much!

(PS: I will assume that the above is the correct behavior for now.)

topper-123 · 2023-05-08T12:05:44Z

@topper-123 I'm a little bit of confused about the following case:
>>> df = pd.DataFrame([[1, pd.NA], [pd.NA, 2]], dtype=pd.Int64Dtype())
>>> df
      0     1
0     1  <NA>
1  <NA>     2
>>> df.mask(df[0] % 2 == 1, 0)
      0  1
0     0  0
1  <NA>  2
Is this really the desired behavior? Here the case is:
>>> df[0] % 2 == 1
0    True
1    <NA>
The first row has cond=True so it gets masked. The second row has cond=pd.NA so it propagates through the mask operation. If this is indeed the desired behavior, I think I should reword the sentence pd.NA propagates in mask and where operations. Otherwise, can you suggest what should be the correct output?

Thank you very much!

(PS: I will assume that the above is the correct behavior for now.)

The new behavior looks correct to me:

for mask: change where cond is True, i.e. don't change where it is False or NA
for where: change where cond is False, i.e. don't change where it is True or NA

I think rewording could make it clearer, would be good if you'd update a bit.

jbrockmendel · 2023-05-08T15:35:15Z

pandas/core/generic.py

@@ -9869,6 +9869,8 @@ def _where(
        # align the cond to same shape as myself
        cond = common.apply_if_callable(cond, self)
        if isinstance(cond, NDFrame):
+            # GH #52955: if cond is NA, element propagates in mask and where
+            cond = cond.fillna(True)


has the option of just raising on NAs been discussed? seems ambiguous and a general PITA.

If you are saying raising in where and mask, no we haven't discussed yet. If you are saying raising in _where, I think this is not desired since then, the following will not work:

>>> df = pd.DataFrame(np.random.random((3, 3)), dtype=pd.Float64Dtype()) >>> df[0][0] = pd.NA >>> df 0 1 2 0 <NA> 0.609241 0.419094 1 0.274784 0.342904 0.026101 2 0.670259 0.218889 0.177126 >>> df[df >= 0.5] = 0 # This will raise an error, which I assume is undesired >>> df 0 1 2 0 <NA> 0.0 0.419094 1 0.274784 0.342904 0.026101 2 0.0 0.218889 0.177126

i would just have that raise too, yes.

@jbrockmendel I think the above code snippet actually works for versions v2.0.x, do we really want to change its behavior? @topper-123 I think we may need further discussion about the desired behavior of _where, i.e., propagate or raise. I will postpone the rewording mentioned in #53124 (comment) until maintainers reach an agreement.

IMO we should accept BooleanArrays (and Series/DataFrame containing BooleanArrays/ArrowArray[bool]) as conditional here. I think it will be surprising if those data structure work in loc and not here.

Do similar functionality raise in any other methods? I don't recall any.

Hi @jbrockmendel any updates on this?

github-actions · 2023-07-13T00:06:21Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

into na-masked-unexp

Charlie-XIAO · 2023-07-16T09:35:03Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

I'm still interested in working on this, but maintainers have not reached an agreement yet.

jbrockmendel · 2023-07-28T23:10:42Z

I'm still interested in working on this, but maintainers have not reached an agreement yet.

@MarcoGorelli @phofl @mroeschke what would you expect from each of obj.mask(cond_frame, other), obj.where(cond_frame, other), and df[cond_frame] = other when cond_frame is nullable bool and contains NAs? I say we should raise in all three (with deprecation cycle where necessary)

mroeschke · 2023-07-31T22:19:52Z

Makes sense to raise to me

Charlie-XIAO · 2023-08-01T03:39:30Z

Sure, I will make the change soon.

phofl · 2023-08-01T08:00:41Z

Pretty sure that we changed this to be used as False before 2.0 came out, that's a bit annoying

into na-masked-unexp

Charlie-XIAO · 2023-08-29T17:00:44Z

Sorry for the late follow-up @jbrockmendel @mroeschke. I have made the suggested changes: now obj.mask(cond_frame, other) and obj.where(cond_frame, other), when cond_frame contains NA values, raise ValueError. Not sure if what I've done is the desired behavior.

There seems to be a lot more to do since as @phofl has also mentioned, NA has been used as False in nullable boolean arrays since 1.0.2. There will be more codes to change (updating error messages and updating tests), but I just want to make sure I'm on the right track. (See also #31591 and What's new 1.0.2)

jbrockmendel · 2023-08-31T17:11:50Z

So at the sprint we decided a long-term plan where pd.NA would be treated as false in these cases. I'm not sure if there is a plan for how to get there. Apologies for the indecisiveness.

mroeschke · 2023-11-07T00:48:14Z

Thanks for the PR, but appears this issue probably needs more discussion on the issue before proceeding with a solution here. Closing for now, but happy for you to engage in the discussion there

Charlie-XIAO added 2 commits May 6, 2023 23:06

make NA propagate where and mask operations

89c0f3d

changelog added

321147f

topper-123 added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels May 7, 2023

fix when using boolean arrays

36bbe16

topper-123 reviewed May 8, 2023

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

pandas/tests/frame/indexing/test_mask.py Outdated Show resolved Hide resolved

Charlie-XIAO and others added 2 commits May 8, 2023 07:21

added tests, reword NA propagates -> if cond=NA then element propagates

e2216cb

Merge branch 'main' into na-masked-unexp

5a45a29

jbrockmendel reviewed May 8, 2023

View reviewed changes

Charlie-XIAO and others added 5 commits May 8, 2023 17:05

avoid multiple fillna when unnecessary

9875669

Merge branch 'main' into na-masked-unexp

8381aba

Merge branch 'main' into na-masked-unexp

5a41560

Merge branch 'main' into na-masked-unexp

8af09df

Merge branch 'main' into na-masked-unexp

3859bff

github-actions bot added the Stale label Jul 13, 2023

Charlie-XIAO added 2 commits July 14, 2023 16:35

Merge remote-tracking branch 'upstream/main' into na-masked-unexp

c1d43c8

Merge branch 'na-masked-unexp' of https://github.com/Charlie-XIAO/pandas

c542727

into na-masked-unexp

Merge branch 'main' into na-masked-unexp

8140c5b

Merge branch 'main' into na-masked-unexp

a2151be

Merge remote-tracking branch 'upstream/main' into na-masked-unexp

6f90c1c

Charlie-XIAO added 4 commits August 29, 2023 18:36

Merge branch 'na-masked-unexp' of https://github.com/Charlie-XIAO/pandas

394d4bb

into na-masked-unexp

Merge remote-tracking branch 'upstream/main' into na-masked-unexp

1cc6208

raise in where and mask if cond is nullable bool with NAs

09f62bc

Merge remote-tracking branch 'upstream/main' into na-masked-unexp

b55f411

Charlie-XIAO added 2 commits August 30, 2023 11:17

remove conflicting (?) test and improve message

cbbd866

Merge remote-tracking branch 'upstream/main' into na-masked-unexp

3a34a85

mroeschke closed this Nov 7, 2023

Charlie-XIAO deleted the na-masked-unexp branch November 7, 2023 00:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG `pd.NA` not treated correctly in `where` and `mask` operations #53124

BUG `pd.NA` not treated correctly in `where` and `mask` operations #53124

Charlie-XIAO commented May 7, 2023 •

edited

topper-123 commented May 7, 2023 •

edited

Charlie-XIAO commented May 7, 2023

topper-123 left a comment

Charlie-XIAO commented May 8, 2023 •

edited

topper-123 commented May 8, 2023

jbrockmendel May 8, 2023

Charlie-XIAO May 8, 2023

jbrockmendel May 8, 2023

Charlie-XIAO May 8, 2023 •

edited

topper-123 May 8, 2023

Charlie-XIAO Jun 11, 2023

github-actions bot commented Jul 13, 2023

Charlie-XIAO commented Jul 16, 2023

jbrockmendel commented Jul 28, 2023

mroeschke commented Jul 31, 2023

Charlie-XIAO commented Aug 1, 2023

phofl commented Aug 1, 2023

Charlie-XIAO commented Aug 29, 2023 •

edited

jbrockmendel commented Aug 31, 2023

mroeschke commented Nov 7, 2023

BUG pd.NA not treated correctly in where and mask operations #53124

BUG pd.NA not treated correctly in where and mask operations #53124

Conversation

Charlie-XIAO commented May 7, 2023 • edited

To be more specific

topper-123 commented May 7, 2023 • edited

Charlie-XIAO commented May 7, 2023

topper-123 left a comment

Choose a reason for hiding this comment

Charlie-XIAO commented May 8, 2023 • edited

topper-123 commented May 8, 2023

jbrockmendel May 8, 2023

Choose a reason for hiding this comment

Charlie-XIAO May 8, 2023

Choose a reason for hiding this comment

jbrockmendel May 8, 2023

Choose a reason for hiding this comment

Charlie-XIAO May 8, 2023 • edited

Choose a reason for hiding this comment

topper-123 May 8, 2023

Choose a reason for hiding this comment

Charlie-XIAO Jun 11, 2023

Choose a reason for hiding this comment

github-actions bot commented Jul 13, 2023

Charlie-XIAO commented Jul 16, 2023

jbrockmendel commented Jul 28, 2023

mroeschke commented Jul 31, 2023

Charlie-XIAO commented Aug 1, 2023

phofl commented Aug 1, 2023

Charlie-XIAO commented Aug 29, 2023 • edited

jbrockmendel commented Aug 31, 2023

mroeschke commented Nov 7, 2023

BUG `pd.NA` not treated correctly in `where` and `mask` operations #53124

BUG `pd.NA` not treated correctly in `where` and `mask` operations #53124

Charlie-XIAO commented May 7, 2023 •

edited

topper-123 commented May 7, 2023 •

edited

Charlie-XIAO commented May 8, 2023 •

edited

Charlie-XIAO May 8, 2023 •

edited

Charlie-XIAO commented Aug 29, 2023 •

edited