Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG pd.NA not treated correctly in where and mask operations #53124

Closed
wants to merge 21 commits into from

Conversation

Charlie-XIAO
Copy link
Contributor

@Charlie-XIAO Charlie-XIAO commented May 7, 2023

Suppose we have

>>> ser = pd.Series([1, 2, pd.NA], name="int64_col", dtype=pd.Int64Dtype())
>>> ser
0       1
1       1
2    <NA>
Name: int64_col, dtype: Int64

In the above example, if use a condition such as ser % 2 == 1, then there will be pd.NA in cond. I'm not sure which of the following would be the desired behavior: (1) an entry propagates through both where and mask (expect for some really special cases) if cond evaluates to pd.NA, (2) we raise an error message if the cond of any entry evaluates to pd.NA (in other words, users should fillna themselves in advance), or (3) provide an additional keyword for users to specify how they want to treat entries for which cond evaluates. (Or maybe none of the above is the desired behavior, I'm not sure about that.)

This PR is currently implementing the first approach. Please let me know if maintainers prefer some other approaches.

To be more specific

(1)

>>> ser.mask(ser % 2 == 1, 0)  # The 2nd row evaluates to pd.NA, thus propagates through mask
0       0
1       2
2    <NA>
Name: int64_col, dtype: Int64
>>> ser.where(ser % 2 == 1, 0)  # The 2nd row propagates through where as well
0       1
1       0
2    <NA>
Name: int64_col, dtype: Int64
>>> ser.mask(ser ** 0 == 1, 0)  # cond evaluates to True for pd.NA here, so 2nd row does not propagate through mask
0    0
1    0
2    0
Name: int64_col, dtype: Int64

(2)

I don't think this is the right way to go. This can affect the behavior of the following:

>>> df = pd.DataFrame(np.random.random((3, 3)), dtype=pd.Float64Dtype())
>>> df[0][0] = pd.NA
>>> df
          0         1         2
0      <NA>  0.609241  0.419094
1  0.274784  0.342904  0.026101
2  0.670259  0.218889  0.177126
>>> df[df >= 0.5] = 0  # If we take Approach 2, then this will raise an error
>>> df
          0         1         2
0      <NA>       0.0  0.419094
1  0.274784  0.342904  0.026101
2       0.0  0.218889  0.177126

(3)

Provide a new keyword that defaults to True.

>>> ser.mask(ser % 2 == 1, 0, cond_na=False)  # If cond evaluates to pd.NA, treat it as False, so 2nd row is not replaced
0       0
1       2
2    <NA>
Name: int64_col, dtype: Int64
>>> ser.where(ser % 2 == 1, 0, cond_na=False)  # Similar to above
0    1
1    0
2    0
Name: int64_col, dtype: Int64

@topper-123
Copy link
Contributor

topper-123 commented May 7, 2023

The problem is also present if we operate using a BooleanArray instead of a Series:

>>> arr = ser.array
>>> arr
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
>>> ser.mask(arr % 2 == 1, 0]
0    0
1    2
2    0
dtype: Int64

Can you fix this case also?

@topper-123 topper-123 added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels May 7, 2023
@Charlie-XIAO
Copy link
Contributor Author

@topper-123 Thanks for your review. I have pushed a fix when using BooleanArray.

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple of changes.

pandas/core/generic.py Outdated Show resolved Hide resolved
pandas/tests/frame/indexing/test_mask.py Outdated Show resolved Hide resolved
@Charlie-XIAO
Copy link
Contributor Author

Charlie-XIAO commented May 8, 2023

@topper-123 I'm a little bit of confused about the following case:

>>> df = pd.DataFrame([[1, pd.NA], [pd.NA, 2]], dtype=pd.Int64Dtype())
>>> df
      0     1
0     1  <NA>
1  <NA>     2
>>> df.mask(df[0] % 2 == 1, 0)
      0  1
0     0  0
1  <NA>  2

Is this really the desired behavior? Here the case is:

>>> df[0] % 2 == 1
0    True
1    <NA>

The first row has cond=True so it gets masked. The second row has cond=pd.NA so it propagates through the mask operation. If this is indeed the desired behavior, I think I should reword the sentence pd.NA propagates in mask and where operations. Otherwise, can you suggest what should be the correct output?

Thank you very much!

(PS: I will assume that the above is the correct behavior for now.)

@topper-123
Copy link
Contributor

@topper-123 I'm a little bit of confused about the following case:

>>> df = pd.DataFrame([[1, pd.NA], [pd.NA, 2]], dtype=pd.Int64Dtype())
>>> df
      0     1
0     1  <NA>
1  <NA>     2
>>> df.mask(df[0] % 2 == 1, 0)
      0  1
0     0  0
1  <NA>  2

Is this really the desired behavior? Here the case is:

>>> df[0] % 2 == 1
0    True
1    <NA>

The first row has cond=True so it gets masked. The second row has cond=pd.NA so it propagates through the mask operation. If this is indeed the desired behavior, I think I should reword the sentence pd.NA propagates in mask and where operations. Otherwise, can you suggest what should be the correct output?

Thank you very much!

(PS: I will assume that the above is the correct behavior for now.)

The new behavior looks correct to me:

  1. for mask: change where cond is True, i.e. don't change where it is False or NA
  2. for where: change where cond is False, i.e. don't change where it is True or NA

I think rewording could make it clearer, would be good if you'd update a bit.

@@ -9869,6 +9869,8 @@ def _where(
# align the cond to same shape as myself
cond = common.apply_if_callable(cond, self)
if isinstance(cond, NDFrame):
# GH #52955: if cond is NA, element propagates in mask and where
cond = cond.fillna(True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

has the option of just raising on NAs been discussed? seems ambiguous and a general PITA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are saying raising in where and mask, no we haven't discussed yet. If you are saying raising in _where, I think this is not desired since then, the following will not work:

>>> df = pd.DataFrame(np.random.random((3, 3)), dtype=pd.Float64Dtype())
>>> df[0][0] = pd.NA
>>> df
          0         1         2
0      <NA>  0.609241  0.419094
1  0.274784  0.342904  0.026101
2  0.670259  0.218889  0.177126
>>> df[df >= 0.5] = 0  # This will raise an error, which I assume is undesired
>>> df
          0         1         2
0      <NA>       0.0  0.419094
1  0.274784  0.342904  0.026101
2       0.0  0.218889  0.177126

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would just have that raise too, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I think the above code snippet actually works for versions v2.0.x, do we really want to change its behavior? @topper-123 I think we may need further discussion about the desired behavior of _where, i.e., propagate or raise. I will postpone the rewording mentioned in #53124 (comment) until maintainers reach an agreement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should accept BooleanArrays (and Series/DataFrame containing BooleanArrays/ArrowArray[bool]) as conditional here. I think it will be surprising if those data structure work in loc and not here.

Do similar functionality raise in any other methods? I don't recall any.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jbrockmendel any updates on this?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 13, 2023
@Charlie-XIAO
Copy link
Contributor Author

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

I'm still interested in working on this, but maintainers have not reached an agreement yet.

@jbrockmendel
Copy link
Member

I'm still interested in working on this, but maintainers have not reached an agreement yet.

@MarcoGorelli @phofl @mroeschke what would you expect from each of obj.mask(cond_frame, other), obj.where(cond_frame, other), and df[cond_frame] = other when cond_frame is nullable bool and contains NAs? I say we should raise in all three (with deprecation cycle where necessary)

@mroeschke
Copy link
Member

Makes sense to raise to me

@Charlie-XIAO
Copy link
Contributor Author

Sure, I will make the change soon.

@phofl
Copy link
Member

phofl commented Aug 1, 2023

Pretty sure that we changed this to be used as False before 2.0 came out, that's a bit annoying

@Charlie-XIAO
Copy link
Contributor Author

Charlie-XIAO commented Aug 29, 2023

Sorry for the late follow-up @jbrockmendel @mroeschke. I have made the suggested changes: now obj.mask(cond_frame, other) and obj.where(cond_frame, other), when cond_frame contains NA values, raise ValueError. Not sure if what I've done is the desired behavior.

There seems to be a lot more to do since as @phofl has also mentioned, NA has been used as False in nullable boolean arrays since 1.0.2. There will be more codes to change (updating error messages and updating tests), but I just want to make sure I'm on the right track. (See also #31591 and What's new 1.0.2)

@jbrockmendel
Copy link
Member

So at the sprint we decided a long-term plan where pd.NA would be treated as false in these cases. I'm not sure if there is a plan for how to get there. Apologies for the indecisiveness.

@mroeschke
Copy link
Member

Thanks for the PR, but appears this issue probably needs more discussion on the issue before proceeding with a solution here. Closing for now, but happy for you to engage in the discussion there

@mroeschke mroeschke closed this Nov 7, 2023
@Charlie-XIAO Charlie-XIAO deleted the na-masked-unexp branch November 7, 2023 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: NA value doesn't match mask condition, still masked
5 participants