Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISC: path to nullable-by-default integers and floats #58243

Open
jbrockmendel opened this issue Apr 12, 2024 · 3 comments
Open

DISC: path to nullable-by-default integers and floats #58243

jbrockmendel opened this issue Apr 12, 2024 · 3 comments
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Apr 12, 2024

I've been giving some thought to how we can move towards having nullable integer/bool dtypes by default (from the ice cream agreement last august).

Terminology note: I am using "nullable" to mean "supports some missing sentinel without taking a stance on what that sentinel is or what semantics it has"

On the user-end, I think it will need to be opt-in for a while. This can mirror the pyarrow-hybrid string future option. In the medium-term, we can implement hybrid Integer/Boolean dtype/EAs that use nan as their sentinel. This will minimize the behavior changes users see and avoids introducing mixed-propagation behavior. A subsequent deprecation cycle can move to all-propagating.

Open Questions

  • Do we disallow numpy int/bool dtypes entirely?
  • Lots of users have legacy code that says dtype=np.int64, do we warn/raise or map that to future dtype (assuming the user has opted in)?
  • Similarly if they do df.dtypes == np.int64?

Now that I write that out, I'm talking myself into being strict on this front and avoiding headaches down the road.

Thoughts?

cc @jorisvandenbossche @phofl

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 12, 2024
@jorisvandenbossche
Copy link
Member

The migration path and behaviour impact of moving towards the nullable dtypes is something that will have to be described and decided in the future PDEP. So good to start discussing that here!

With the recent experience of the opt-in of pd.options.mode.copy_on_write and pd.options.future.infer_strings to enable previews of future changes, I would also go for a similar mechanism for new nullable dtypes.

In the medium-term, we can implement hybrid Integer/Boolean dtype/EAs that use nan as their sentinel. This will minimize the behavior changes users see and avoids introducing mixed-propagation behavior.

How would that avoid mixed-propagation behaviour? Do you mean in the case that we would only add integer/boolean dtypes with NA, and not yet for other types? (in which case using NaN instead of NA indeed keeps things more consistent)
(in my mind if we start adding an option to enable NA-nullable dtypes, it should be available for all dtypes, in which case you wouldn't have mixed propagation)

Do we disallow numpy int/bool dtypes entirely?

That's a good question ..;), and indeed something we need to discuss more. I personally think there is value in 1) having all dtypes/arrays stored in a pandas object be a EDtype / EA (for internal consistency, but also for users, such that eg .dtype always returns a pandas dtype), and 2) having all supported dtypes also support missing values.
But even if we want 1) (only EAs internally), but would be OK with having a variant (or backend) of our integer dtype that does not support missing values and that would map 1:1 to a numpy array, we could have something that is essentially our NumpyExtensionArray, but then more officially. And that could then go beyond just integer/boolean, but allow the user to store any numpy array unchanged in a pandas Series (eg also datetime64[D], or numpy's fixed width string and bytes which we convert to object dtype now).

Personally I don't think that will be useful for the large majority of our users to allow that generally, and only be confusing (and in theory it's relatively straightforward to write your own EA to store one of the numpy dtypes we don't handle natively). Specifically for int/bool, what would be the value to allow it? It would feel a bit as having the ability to say that a certain column can have no nulls (like databases often have a nullability flag for a certain column in the schema).

Now that I write that out, I'm talking myself into being strict on this front and avoiding headaches down the road.

On the other hand, it will make the upgrade path a lot smoother if we still allow using numpy dtypes wherever a user specifies a dtypes (and automatically translate that to the equivalent pandas dtype).
At least initially we will have to do that anyway, I think, to allow it potentially with a warning (as a way to test and upgrade to the new dtypes).

@jorisvandenbossche
Copy link
Member

The discussed changes with nullable dtypes are indeed going to have a big impact on users. I personally find it hard to judge if it will be easier rather than harder for users to update by slicing it in two separate changes (the change is each time smaller, but you have (somewhat related) breaking changes twice)

But I though it might be a good exercise to think through all the potential areas of direct user impact of making this change. That is something we will have to do anyway to document for users (and for the PDEP to decide we are OK with such a set of changes), and that might also give a better idea of the kind of changes we are talking about and how they would be separated if doing the change in two steps.

First attempt of listing user impact of moving to nullable extension dtypes:

  • ExtensionArray/Dtype instead of numpy - the underlying values is no longer a numpy array but a pandas ExtensionArray. In general this is mostly hidden from the user, but I think the biggest impact would come from the dtype object:
    • Conversion to numpy (__array__, to_numpy) could change, but I think we are probably planning to keep this the same, even with nullable dtypes. So this might not have a direct visible impact.
      • For example for the masked dtypes initially we used object dtype with NA, but now for Int64 with NAs we return float64 array with NaN. For boolean we still return object dtype, but that is consistent with when you have a numpy bool Series where NaNs get introduced (eg because of reindexing), that also becomes object dtype right now.
      • The less visible impact for users of course is that this conversion can be more expensive if there are nulls
    • The .values attribute no longer returns a numpy array but an EA? But, we could also decide to keep .values to return the numpy array like __array__ would return, and point users to .array for the EA. We have done this in the past for other dtypes as well: e.g with a datetimetz series, .values still returns the numpy array that lost the tz information (this would essentially keep .values as a kind of alias for to_numpy()).
    • The .dtype attribute is no longer a numpy dtype:
      • This can have a significant impact when users pass this dtype object into a context that expects a numpy dtype
      • Other usage patterns can probably be smoothed out by still accepting numpy dtypes for dtype specification in pandas and auto-translating that to our own dtype (e.g. dtype=np.int64) and potentially allowing in comparisons with our own dtypes (e.g. the df.dtypes == np.int64 mentioned in the top post)
      • In theory we could also (initially) continue using numpy dtypes in the user interface, similarly like we still use np.dtype("datetime64[ns]") while under the hood storing the values in a DatetimeArray EA. But this inconsistent situation of an EA without EA dtype also complicates things and probably not something we want to have / keep?
  • Supporting missing values in new dtypes (integer/bool) (regardless of using NaN or NA as sentinel)
    • The data types in your DataFrame can change because of no longer promoting dtypes because of missing values (i.e. integer columns are now still integer instead of cast to float, same for bool to object)
      • People will have written code that assumes / deals with this casting
    • You can get pandas series/columns with a bool dtype with missing values.
      • I think for usage within pandas, this will not have that much impact, because for example in indexing, summing, .. with your mask, pandas will handle that case and that will typically give the same result (for example, in indexing context the NaN/NA will be considered as False, keeping most cases behave the same)
      • When the boolean array (which can now contains missing values) is used in a numpy context, however, at that point you can run into errors because numpy will no longer see it as boolean?
      • However, not that many cases will already see new missing values in a boolean dtype (where they were not present before), because a typical case will come from the NA propagation in comparisons, which falls under the next point.
  • Behaviour changes with pd.NA nullable semantics
    • The scalar you get back (from indexing, result of reduction, ..) is now pd.NA instead of np.nan
      • Making bool(pd.NA) to be False instead of error, a part of the impact will be alleviated
      • Code that expects exactly np.nan (eg using np.isnan, or use it as input for a function that can handle NaN but not pd.NA) will be impacted
    • Different propagation of NA in comparison operations (i.e. propagate instead of giving False)
      • This will give rise to more cases of boolean arrays with missing values. As mentioned above, I think usage within pandas will not be impacted that much (since pandas can deal with such data), but when the boolean array is used in a numpy context, however, at that point you can run into errors because numpy will no longer see it as boolean (conversion of our boolean array with missing values to a numpy array will not be a numpy bool array).
    • Kleene-logic in logical operations (and, or)
      • I suppose this will not have too much breaking impact in practice (it also only comes into play if you have missing values in the boolean mask, so then the impact will come from that, not from the kleene logic)

Other things?

@jbrockmendel
Copy link
Member Author

How would that avoid mixed-propagation behaviour? Do you mean in the case that we would only add integer/boolean dtypes with NA, and not yet for other types? (in which case using NaN instead of NA indeed keeps things more consistent)
(in my mind if we start adding an option to enable NA-nullable dtypes, it should be available for all dtypes, in which case you wouldn't have mixed propagation)

I see three main paths:

  1. Swap over to NA-nullable everywhere
  2. Swap int/bool over to nan-nullable
  3. status quo: don't do any of this

I've landed on 2) as my preferred option because

a) it is a much smaller user-facing change than 1
b) it gets the large majority of the actual benefits
i) the long-requested support for a sentinel in int/bool dtypes
ii) makes it easier to then deprecate all the places where we do silent casting (setitem cases not covered by PDEP6, #53802, #53868, #53910) and accompanying code simplifications
c) while I hope the Ice Cream Agreement lays to rest the nan-vs-NA topic, I suspect that a cadre of pro-distinguish enthusiasts will not let it go. That seems like it will bog down option 1, while option 2 could move forward.
d) 2 deprecation cycles gives users a chance to weigh in on more specific issues.

(mostly c and b)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants