Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: dtype._is_valid_na_for_dtype #51378

Closed
wants to merge 1 commit into from

Conversation

jbrockmendel
Copy link
Member

Some dtypes (pyarrow/nullable mostly) need to treat NaN (and maybe None) differently from pd.NA. Motivating bug (which this PR does not address):

arr = pd.array(range(5)) / 0

>>> np.nan in arr  # <- wrong
False

What this does fix is arr[2] = np.nan now sets NaN instead of pd.NA. Setting np.nan into an IntegerArray/BooleanArray raises. (None still casts to pd.NA).

This breaks a bunch of tests. I'd like feedback before I go track those down. cc @jorisvandenbossche

Tangentially related #27825

The base class implementation considers any scalar recognized by pd.isna
to be equivalent.
"""
return libmissing.checknull(value)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could require that pd.NA always be recognized?

@mroeschke mroeschke added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Feb 14, 2023
@jorisvandenbossche
Copy link
Member

What this does fix is arr[2] = np.nan now sets NaN instead of pd.NA. Setting np.nan into an IntegerArray/BooleanArray raises. (None still casts to pd.NA).

I think this part requires some more discussion, as it is not necessarily a "fix".
This is generally a complex situation without ideal situation, but for now the conversion to nullable dtype have been implemented to treat NaN as missing upon conversion / input (i.e. when converting from python/numpy data to nullable typed series/array).

For example, when casting a default numpy float64 to nullable Float64. I think that astype case certainly make sense to do. But for example, we currently also have (whether you like it or not):

>>> pd.array([np.nan], dtype="Float64")
<FloatingArray>
[<NA>]
Length: 1, dtype: Float64

And BTW, pyarrow does something similar if the input data is a pandas object, i.e. pa.array(pd.Series([np.nan])) converts the NaN to a null.

So maybe we should also change the pd.array(..) example, or maybe not. And there is certainly something to say that it would be nice you could set an actual NaN using arr[2] = np.nan (otherwise it is quite difficult to achieve such result), but you could also expect that NaN is treated similar in both cases (as input to pd.array, and as right-hand-side value in a setitem). Further, as long as we don't treat NaN as NA in nullable float arrays, this would be a big (breaking) change, since a lot of code will use setitem with NaN expecting that this introduces missing values. So if we want to change this, I think it is something we should also deprecate for a long time.

@jbrockmendel
Copy link
Member Author

Agreed that the constructor behavior is a difficult (frustrating!) API problem, and there is a natural link between constructor and __setitem__ behavior. I don't think the desired __contains__ behavior is ambiguous though?

One option to fix the __contains__ behavior is to just patch the relevant EA subclass __contains__ methods. Other ideas?

There are a couple other places I intend to use something like dtype._is_valid_na_for_dtype: MultiIndex._get_loc_single_level_index and a few places in the indexing engines that use checknull. If we don't go down this path, I'll go back to the drawing board.

@jorisvandenbossche
Copy link
Member

I don't think the desired __contains__ behavior is ambiguous though?

Yes, I think that one is good to fix (one can probably argue there is still some ambiguity, but it's also much more a corner case, and a situation where the user is more explicitly checking for that specific value, so it make sense to be stricter.

I think fixing this locally in the relevant __contains__ implementation/overrides sounds good.

There are a couple other places I intend to use something like dtype._is_valid_na_for_dtype: MultiIndex._get_loc_single_level_index and a few places in the indexing engines that use checknull.

For indexing, we might also want to distinguish between pd.NA and np.nan? (this is for things like ser.loc[np.nan] ?)

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@jbrockmendel
Copy link
Member Author

Long-term I think we need something like this. For now I'll make a PR that patches the relevant __contains__ method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants