Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: New Name for "numpy_nullable" dtype_backend #59032

Open
1 of 3 tasks
WillAyd opened this issue Jun 17, 2024 · 17 comments
Open
1 of 3 tasks

ENH: New Name for "numpy_nullable" dtype_backend #59032

WillAyd opened this issue Jun 17, 2024 · 17 comments
Assignees

Comments

@WillAyd
Copy link
Member

WillAyd commented Jun 17, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Many I/O methods today accept a "numpy_nullable" argument for the dtype_backend= parameter. While historically our extension arrays exclusively used NumPy, this is no longer true with the string dtype so the name "numpy_nullable" is a misnomer.

Feature Description

To make for a less confusing API, I would suggest adding "pandas_nullable" or maybe even just "pandas" as an argument. This can have the exact same behavior as "numpy_nullable" today but abstracts and corrects the semantics. "numpy_nullable" can be slowly deprecated over time

Alternative Solutions

n/a

Additional Context

dtype_backend="pandas" would also make for a smoother transition into the logical type system proposed as part of PDEP-13 #58455

...but even if that PDEP is not accepted, I still see value in changing the value "numpy_nullable" to something else

@WillAyd WillAyd added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 17, 2024
@WillAyd
Copy link
Member Author

WillAyd commented Jun 17, 2024

@jorisvandenbossche maybe a good follow up to the discussion we had as part of PDEP-14

@WillAyd
Copy link
Member Author

WillAyd commented Jul 11, 2024

@pandas-dev/pandas-core this wasn't major enough to include as part of PDEP-14, but I think is a logical follow up to clean up semantics. Curious what others may think

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jul 11, 2024

I think it should be pandas_nullable . Keeps options open with respect to the whole pd.NA/np.nan discussion

@chaarvii
Copy link
Contributor

Hey! I’d like to work on this

@chaarvii
Copy link
Contributor

Take

@WillAyd
Copy link
Member Author

WillAyd commented Aug 1, 2024

Any other team feedback on this? I think would be good to use the new name starting with 3.0

@simonjayhawkins
Copy link
Member

We have pandas.api.types.pandas_dtype where we Convert input into a pandas only dtype object ... and this returns np.dtype or a pandas dtype.

Given that the term “pandas dtype” already has a precedent, using dtype_backend="pandas" would indeed align well with existing conventions. It provides clarity and maintains consistency.

@WillAyd
Copy link
Member Author

WillAyd commented Aug 1, 2024

I also have a slight preference for pandas because it is shorter, and I don't see us every introducing a non-nullable type system, so "_nullable" is superfluous

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 1, 2024

On the other hand, when specifying dtype_backend="pyarrow", you also get back a "pandas dtype" in that sense (i.e. a pandas ExtensionDtype subclass). And at the same time, some of the non-nullable default dtypes we have are also pandas dtypes.

So I don't think dtype_backend="pandas" is an ideal naming, but I also don't have any better suggestion ..

@jbrockmendel
Copy link
Member

masked

@WillAyd
Copy link
Member Author

WillAyd commented Aug 1, 2024

On the other hand, when specifying dtype_backend="pyarrow", you also get back a "pandas dtype" in that sense (i.e. a pandas ExtensionDtype subclass). And at the same time, some of the non-nullable default dtypes we have are also pandas dtypes.

That's true as a matter of implementation, but I don't think end users are going to know that

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 1, 2024

So I don't think dtype_backend="pandas" is an ideal naming, but I also don't have any better suggestion ..

I did suggest pandas_nullable above. I think I may have been the one to introduce the word "nullable" into our lexicon. So if we use pandas_nullable, it's clear that we are storing a pandas rep of missing values in the backend. I'm concerned that just using pandas could prevent some other usage that we don't see now, but want to introduce in the future.

@WillAyd
Copy link
Member Author

WillAyd commented Aug 1, 2024

That's a fair point, though I'm not sure that adding _nullable prevents that. I think that would only prevent an issue if we decided to offer non-nullable types

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 1, 2024

That's a fair point, though I'm not sure that adding _nullable prevents that. I think that would only prevent an issue if we decided to offer non-nullable types

Or offer something else that we can't foresee today

@simonjayhawkins
Copy link
Member

On the other hand, when specifying dtype_backend="pyarrow", you also get back a "pandas dtype" in that sense (i.e. a pandas ExtensionDtype subclass). And at the same time, some of the non-nullable default dtypes we have are also pandas dtypes.

So I don't think dtype_backend="pandas" is an ideal naming, but I also don't have any better suggestion ..

PyArrow types indeed are pandas extension types, enhancing the functionality of the base PyArrow library to suit our use case of backing DataFrames or Series.

We don't always rigidly adhere to the behavior of NumPy arrays for a Series with a NumPy dtype. We allow expansion, upcasting, and other conversions that may diverge from NumPy behavior, even though we return a NumPy type as the dtype.

But I see no problems when we use the terms "pyarrow" or "numpy" when we talk about the backend. So it would seem reasonable to me to use the term "pandas" to describe the pandas nullable extension types.

I did suggest pandas_nullable above. I think I may have been the one to introduce the word "nullable" into our lexicon. So if we use pandas_nullable, it's clear that we are storing a pandas rep of missing values in the backend. I'm concerned that just using pandas could prevent some other usage that we don't see now, but want to introduce in the future.

The dtype_backend argument is forward-thinking, enabling early adoption of experimental data types that aren't currently the default.

Presently, the available options for dtype_backend in I/O methods and .convert_dtypes are limited to 'numpy_nullable' and 'pyarrow'.

If we aim to allow users to continue using legacy types even when nullable types become the default, introducing an additional argument makes sense.

Considering package names, options like pyarrow, pandas, and numpy would be meaningful, clear, concise, and consistent choices?

@WillAyd
Copy link
Member Author

WillAyd commented Aug 3, 2024

I'm on board with what @simonjayhawkins is suggesting - pyarrow, pandas, and numpy as arguments reflect the core of the type system evolution, even if they may not be 100% technically accurate

@WillAyd
Copy link
Member Author

WillAyd commented Aug 3, 2024

If we do decide on those terms, I also wonder if we should change the default value of None to "numpy"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants