Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Standardize ExtensionDtype subtype name #22224

Open
TomAugspurger opened this issue Aug 6, 2018 · 15 comments
Open

API: Standardize ExtensionDtype subtype name #22224

TomAugspurger opened this issue Aug 6, 2018 · 15 comments
Labels
API - Consistency Internal Consistency of API/Behavior API Design Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@TomAugspurger
Copy link
Contributor

We have a few extension arrays that build off one or more ndarrays. You'll often want to get the underlying NumPy dtype for that array.

  • IntegerArray uses .type
  • IntervalArray uses .subtype

IntervalArray can't use .type, since that has to be Interval. So IntegerArray will need to alias subtype to type.

Are we happy with the name subtype?

@TomAugspurger TomAugspurger added API Design ExtensionArray Extending pandas with custom dtypes or arrays. labels Aug 6, 2018
@jorisvandenbossche
Copy link
Member

+1 for subtype

@jorisvandenbossche
Copy link
Member

IntegerArray actually has numpy_dtype

@jorisvandenbossche jorisvandenbossche added this to the 0.24.0 milestone Aug 22, 2018
@TomAugspurger TomAugspurger added the Blocker Blocking issue or pull request for an upcoming release label Oct 8, 2018
@TomAugspurger
Copy link
Contributor Author

We also set subdtype on the base PandasExtensionDtype, but don't use it anywhere. That seems to be mirroring https://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.subdtype.html, which I don't really understand.

We should first agree on what exactly this thing (call it subtype for now) is supposed to mean.

  • IntervalDtype: the dtype used for the left and right arrays
  • SparseDtype: the dtype of the non-fill-value values
  • IntegerDtype: the dtype that would be used, if only NumPy supported nullable integer arrays
  • PeriodDtype: could be int for the ordinals
  • DatetimeTZDtype: could be datetime64[ns]

So something like "the type this would correspond to in NumPy, if NumPy supported this directly"?

Given all that, I think I like numpy_type as the best name.

@jorisvandenbossche
Copy link
Member

So something like "the type this would correspond to in NumPy, if NumPy supported this directly"?

But this is not what you listed for all dtypes, so maybe we should not have a single name. Eg for intervals, even if numpy supported interval dtype, this would not be the same dtype as the interval bounds itself (the same for period).
And for categorical, you have the dtype of the codes, and the dtype of the categories.

So maybe we shouldn't standardize too much. numpy_dtype could also be kept for #22791

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 8, 2018 via email

@jreback jreback modified the milestones: 0.24.0, 0.25.0 Jan 4, 2019
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.25.0, 0.24.0 Jan 4, 2019
@jorisvandenbossche
Copy link
Member

Let's do this for 0.24.0 ? Otherwise we need to start deprecating things if we still want to change it after the initial release, which is a bit stupid for this.

I don't think it is much work, it is mainly deciding what we want to do

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 4, 2019

Yeah, not much work, aside from the hard problem of "what do we want". I haven't really come to a decision on what's best here.

What we can relatively easily do for 0.24 is decide if anything should be assigned a common name

  1. IntervalDtype.subtype
  2. SparseDtype.subtype
  3. Int*Dytpe.numpy_dtype
  4. CaegoricalDtype.codes.dtype or CategoricalDtype.categories.dtype

I'm really not sure here... Maybe there all OK as is.

@jorisvandenbossche
Copy link
Member

In the meantime, the PandasDtype (for PandasArray holding numpy array) also has a numpy_dtype attribute.

IntervalDtype has a subtype (but also subdtype which is defined in PandasExtensionDtype, not sure what is that for. For numpy compat?).

I'm really not sure here... Maybe there all OK as is.

I think the main question might be what we want to do with #22791. Because a name there might clash with some of the names used here.

@jreback
Copy link
Contributor

jreback commented Jan 4, 2019

yeah I ran into this when fixing up to use pandas_dtype. We also have a .base attribute that need some rationalization.

We ought to make these NotImplemented on the base dtype to catch errors.

I think subtype is ok where it is now for Sparse, maybe add to Categorical in the same capacity?
numpy_dtype is also pretty clear.

I would allow having one of [subtype|numpy_dtype] for dtypes.

@jorisvandenbossche
Copy link
Member

@TomAugspurger about your list above: I think SparseDtype actually uses subtype and not numpy_dtype?

@TomAugspurger
Copy link
Contributor Author

Yes, sorry. Updated.

@TomAugspurger
Copy link
Contributor Author

Does anyone have a concrete proposal for what to update, if anything? I've struggled to form an opinion here.

@TomAugspurger TomAugspurger modified the milestones: 0.24.0, 0.24.1 Jan 20, 2019
@TomAugspurger TomAugspurger modified the milestones: 0.24.1, 0.24.2 Jan 30, 2019
@TomAugspurger
Copy link
Contributor Author

Pushing to 0.24.2.

@jreback jreback modified the milestones: 0.24.2, 0.25.0 Mar 3, 2019
@TomAugspurger TomAugspurger removed Blocker Blocking issue or pull request for an upcoming release labels Jun 20, 2019
@jreback jreback modified the milestones: 0.25.0, 1.0 Jun 28, 2019
@TomAugspurger TomAugspurger modified the milestones: 1.0, Contributions Welcome Dec 30, 2019
@jbrockmendel jbrockmendel added the API - Consistency Internal Consistency of API/Behavior label Sep 21, 2020
@JBGreisman
Copy link
Contributor

JBGreisman commented Jun 9, 2021

To follow up on this old thread, is there a currently supported way to get numpy-compatible dtype inference for an ExtensionDtype?

It seems that one should be able to define a custom ExtensionDtype._get_common_dtype() method. This ends up working in some cases where a DataFrame is populated with a non-unique set of custom ExtensionDtypes. Unfortunately, if one has a DataFrame with homogeneous ExtensionDtypes these lines get hit:

# workaround for find_common_type([np.dtype('datetime64[ns]')] * 2)
# => object
if all(is_dtype_equal(first, t) for t in types[1:]):
return first

This in turn causes the BlockManager to fall back to a numpy object dtype:

# TODO: https://github.com/pandas-dev/pandas/issues/22791
# Give EAs some input on what happens here. Sparse needs this.
if isinstance(dtype, SparseDtype):
dtype = dtype.subtype
elif isinstance(dtype, ExtensionDtype):
dtype = np.dtype("object")
elif is_dtype_equal(dtype, str):
dtype = np.dtype("object")

Are there any thoughts on modifying that isinstance(dtype, ExtensionDtype) clause to do something like this to allow an ExtensionDtype to avoid falling back to object-dtypes in numpy?

elif isinstance(dtype, ExtensionDtype):
    if hasattr(dtype, "numpy_dtype"):
        dtype = dtype.numpy_dtype
    elif hasattr(dtype, "subtype"):
        dtype = dtype.subtype
    else:
        dtype = np.dtype("object")

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

The issue here is standardizing idiosyncratic naming across different internal dtypes, not defining an API we expect 3rd parties to implement? If so, I'm really not bothered by the existing names not matching. They mean different things for different dtypes right?

The closest thing I have to an opinion is maybe they should be privatized?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior API Design Enhancement ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

6 participants