Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/EA-API: EA constructor without dtype specified #56430

Open
jbrockmendel opened this issue Dec 10, 2023 · 4 comments
Open

REF/EA-API: EA constructor without dtype specified #56430

jbrockmendel opened this issue Dec 10, 2023 · 4 comments
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Dec 10, 2023

TLDR: we should make dtype required in EA._from_sequence and implement a new EA constructor for flavor-preserving inference.

ATM dtype is not required in EA._from_sequence. The behavior-- and more importantly the usage-- when it is not specified is not standardized. In many cases it does some kind of inference, but how much inference varies.

Most of the places where we don't pass a dtype are aimed at some type of dtype-flavor-retention. e.g. we did some type of operation starting with a pyarrow/masked/sparse dtype and we want the result.dtype to still be pyarrow/masked/sparse, but not necessarily the same exact dtype. The main examples that come to mind are maybe_cast_pointwise_result, MaskedArray._maybe_mask_result.

The main other place where we call _from_sequence without a dtype is pd.array. With a little bit of effort I'm pretty sure we can start passing dtypes there.

cc @jorisvandenbossche

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 10, 2023
@jorisvandenbossche
Copy link
Member

In my mind, _from_sequence already is the "constructor for flavor-preserving inference".

I understand there are multiple use cases, but that can be served by a single method depending on whether a dtype is passed or not? That feels quite clear to me: when a dtype is passed, this is honored, and otherwise the dtype is inferred from the data (with the constraint of that it has to be a dtype supported by the calling class).

The main examples that come to mind are maybe_cast_pointwise_result, MaskedArray._maybe_mask_result.

In MaskedArray._maybe_mask_result, we actually don't use _from_sequence, but the main Array class constructors (but also without specifying a dtype)

@jorisvandenbossche jorisvandenbossche added API Design ExtensionArray Extending pandas with custom dtypes or arrays. and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 11, 2023
@jbrockmendel
Copy link
Member Author

In MaskedArray._maybe_mask_result, we actually don't use _from_sequence, but the main Array class constructors (but also without specifying a dtype)

Correct. My point is that MaskedArray subclasses use a different pattern to achieve the same result. The datetimelike EAs have their own special-casing. If it is feasible (which im not ready to claim), then it would be preferable to have a single shared pattern for these.

I understand there are multiple use cases, but that can be served by a single method depending on whether a dtype is passed or not?

Certainly possible. On the margin I'd prefer the cases where we intentionally want dtype inference to be more explicit. I'm spending some time this week tracking down just where those cases are.

@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Dec 15, 2023
@jbrockmendel
Copy link
Member Author

jbrockmendel commented Dec 17, 2023

I've spent some time tracking down the places where we don't pass a dtype to from_sequence:

  • maybe_cast_pointwise_result; specifically aimed at dtype flavor-retention
  • lib.maybe_convert_objects
  • via DatetimeIndex/TimedeltaIndex/PeriodIndex via helpers _from_sequence_not_strict/ibid/period_array
  • in pd.array (removed by REF: use maybe_convert_objects in pd.array #56484)
  • ArrowExtensionArray._groupby_op (PR making this explicit coming soon)
  • in EA._quantile (AFAICT this is just for Sparse; MaskedArray uses _maybe_mask_result for partial inference)
  • DatetimeLikeArrayMixin._validate_listlike
    • Recall this is used for casting other in e.g.
      • self == other
      • self[seq] == other
      • dti.get_indexer(other)
    • This can almost be replaced by maybe_convert_objects, but that fails on cases where other is 2D (need maybe_convert_objects to support 2D, which may be worth pursuing)
  • DatetimeLikeArrayMixin.isin
    • I think this is just needed for string cases which are deprecated, so this should be removable in 3.0.
  • DatetimeArray._add_offset (PR making this explicit coming soon [i think this hides a coincidental bug])
  • BaseGrouper.agg_series (REF: avoid special-casing in agg_series #56540)
  • A couple of fastparquet tests fails if we assert dtype is not None; looks like both in DatetimeArray._from_sequence.

Also tracking down the various patterns we use for flavor-preserving-partial-inference:

  • MaskedArray._maybe_mask_result (has mask keyword, would complicate sharing)
  • NumpyExtensionArray._wrap_ndarray_result (can return TimedeltaArray)
  • NumpyExtensionArray._from_backing_data

Other places where we have special-casing for Masked/Arrow dtypes related to flavor-retention:

  • GroupBy.count has special-casing for Masked and Arrow dtypes, could also handle Sparse if we felt like it (Sparse actually fails entirely for unrelated reasons)
  • Groupby.quantile has it for Masked, not Arrow

I expect there are more that I have missed, will update here as I find them.

@jbrockmendel
Copy link
Member Author

In my mind, _from_sequence already is the "constructor for flavor-preserving inference".

Re-reading, I think I missed an important point: a big part of the relevant use case is having a BooleanArray method that returns a FloatingArray/IntegerArray etc. (this example could also be addressed by condensing these classes down to just MaskedArray). xref #58258

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

2 participants