WIP: dtype._is_valid_na_for_dtype #51378

jbrockmendel · 2023-02-14T15:26:39Z

Some dtypes (pyarrow/nullable mostly) need to treat NaN (and maybe None) differently from pd.NA. Motivating bug (which this PR does not address):

arr = pd.array(range(5)) / 0

>>> np.nan in arr  # <- wrong
False

What this does fix is arr[2] = np.nan now sets NaN instead of pd.NA. Setting np.nan into an IntegerArray/BooleanArray raises. (None still casts to pd.NA).

This breaks a bunch of tests. I'd like feedback before I go track those down. cc @jorisvandenbossche

Tangentially related #27825

jbrockmendel · 2023-02-14T15:27:01Z

pandas/core/dtypes/base.py

+        The base class implementation considers any scalar recognized by pd.isna
+        to be equivalent.
+        """
+        return libmissing.checknull(value)


could require that pd.NA always be recognized?

jorisvandenbossche · 2023-02-14T23:48:22Z

What this does fix is arr[2] = np.nan now sets NaN instead of pd.NA. Setting np.nan into an IntegerArray/BooleanArray raises. (None still casts to pd.NA).

I think this part requires some more discussion, as it is not necessarily a "fix".
This is generally a complex situation without ideal situation, but for now the conversion to nullable dtype have been implemented to treat NaN as missing upon conversion / input (i.e. when converting from python/numpy data to nullable typed series/array).

For example, when casting a default numpy float64 to nullable Float64. I think that astype case certainly make sense to do. But for example, we currently also have (whether you like it or not):

>>> pd.array([np.nan], dtype="Float64")
<FloatingArray>
[<NA>]
Length: 1, dtype: Float64

And BTW, pyarrow does something similar if the input data is a pandas object, i.e. pa.array(pd.Series([np.nan])) converts the NaN to a null.

So maybe we should also change the pd.array(..) example, or maybe not. And there is certainly something to say that it would be nice you could set an actual NaN using arr[2] = np.nan (otherwise it is quite difficult to achieve such result), but you could also expect that NaN is treated similar in both cases (as input to pd.array, and as right-hand-side value in a setitem). Further, as long as we don't treat NaN as NA in nullable float arrays, this would be a big (breaking) change, since a lot of code will use setitem with NaN expecting that this introduces missing values. So if we want to change this, I think it is something we should also deprecate for a long time.

jbrockmendel · 2023-02-15T00:43:35Z

Agreed that the constructor behavior is a difficult (frustrating!) API problem, and there is a natural link between constructor and __setitem__ behavior. I don't think the desired __contains__ behavior is ambiguous though?

One option to fix the __contains__ behavior is to just patch the relevant EA subclass __contains__ methods. Other ideas?

There are a couple other places I intend to use something like dtype._is_valid_na_for_dtype: MultiIndex._get_loc_single_level_index and a few places in the indexing engines that use checknull. If we don't go down this path, I'll go back to the drawing board.

jorisvandenbossche · 2023-02-15T21:55:30Z

I don't think the desired __contains__ behavior is ambiguous though?

Yes, I think that one is good to fix (one can probably argue there is still some ambiguity, but it's also much more a corner case, and a situation where the user is more explicitly checking for that specific value, so it make sense to be stricter.

I think fixing this locally in the relevant __contains__ implementation/overrides sounds good.

There are a couple other places I intend to use something like dtype._is_valid_na_for_dtype: MultiIndex._get_loc_single_level_index and a few places in the indexing engines that use checknull.

For indexing, we might also want to distinguish between pd.NA and np.nan? (this is for things like ser.loc[np.nan] ?)

github-actions · 2023-03-18T00:05:25Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

jbrockmendel · 2023-04-21T22:16:26Z

Long-term I think we need something like this. For now I'll make a PR that patches the relevant __contains__ method.

WIP: dtype._is_valid_na_for_dtype

253e5a4

jbrockmendel commented Feb 14, 2023

View reviewed changes

mroeschke added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Feb 14, 2023

github-actions bot added the Stale label Mar 18, 2023

jbrockmendel mentioned this pull request Apr 21, 2023

BUG: fix setitem with enlargment with pyarrow Scalar #52833

Closed

4 tasks

jbrockmendel closed this Apr 21, 2023

jbrockmendel deleted the bug-contains branch April 21, 2023 22:16

jbrockmendel mentioned this pull request Apr 21, 2023

BUG: FloatingArray.__contains__(nan) #52840

Merged

5 tasks

jbrockmendel mentioned this pull request Jul 3, 2023

Support uncertainties as EA Dtype #53970

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: dtype._is_valid_na_for_dtype #51378

WIP: dtype._is_valid_na_for_dtype #51378

jbrockmendel commented Feb 14, 2023

jbrockmendel Feb 14, 2023

jorisvandenbossche commented Feb 14, 2023

jbrockmendel commented Feb 15, 2023

jorisvandenbossche commented Feb 15, 2023

github-actions bot commented Mar 18, 2023

jbrockmendel commented Apr 21, 2023

WIP: dtype._is_valid_na_for_dtype #51378

WIP: dtype._is_valid_na_for_dtype #51378

Conversation

jbrockmendel commented Feb 14, 2023

jbrockmendel Feb 14, 2023

Choose a reason for hiding this comment

jorisvandenbossche commented Feb 14, 2023

jbrockmendel commented Feb 15, 2023

jorisvandenbossche commented Feb 15, 2023

github-actions bot commented Mar 18, 2023

jbrockmendel commented Apr 21, 2023