Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reductions for ExtensionArray #22346

Closed
TomAugspurger opened this issue Aug 14, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@TomAugspurger
Copy link
Contributor

commented Aug 14, 2018

Creeping up to this in #22345

A few questions

  1. What should this look like for EA authors? What helpers can / should we provide?
  2. How does this affect users? Specifically, assuming IntegerArray implements reductions, what's the dtype here?
In [11]: df = pd.DataFrame({"A": ['a', 'b', 'a'], "B": pd.core.arrays.IntegerArray([1, 2, 3])})

In [12]: df.groupby("A").B.min().dtype

Is it int64, or should we preserve Int64?

We can also discuss transforms ( cumulative operations like cumsum, maybe things like .shift).

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2018

To kick things off, one option is to make EA authors implement a _reduce like we do internally. That would require them adding things like .min, .max, .etc directly to their EA. This seems reasonable, but may be a bit overly restrictive.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 14, 2018

(We also have #22224)

So basically, for Series, the reduction operations call the following function (where op is eg nanops.nansum in case of Series.sum()):

pandas/pandas/core/series.py

Lines 3244 to 3266 in cf70d11

def _reduce(self, op, name, axis=0, skipna=True, numeric_only=None,
filter_type=None, **kwds):
"""
perform a reduction operation
if we have an ndarray as a value, then simply perform the operation,
otherwise delegate to the object
"""
delegate = self._values
if isinstance(delegate, np.ndarray):
# Validate that 'axis' is consistent with Series's single axis.
if axis is not None:
self._get_axis_number(axis)
if numeric_only:
raise NotImplementedError('Series.{0} does not implement '
'numeric_only.'.format(name))
with np.errstate(all='ignore'):
return op(delegate, skipna=skipna, **kwds)
return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,
numeric_only=numeric_only,
filter_type=filter_type, **kwds)

So with this existing infrastructure, it is indeed an option to let EA's implement _reduce.

Internally, such a _reduce basically always simply checks if the operation is defined as a method and calls that, eg for Categorical:

def _reduce(self, op, name, axis=0, skipna=True, numeric_only=None,
filter_type=None, **kwds):
""" perform the reduction type operation """
func = getattr(self, name, None)
if func is None:
msg = 'Categorical cannot perform the operation {op}'
raise TypeError(msg.format(op=name))
return func(numeric_only=numeric_only, **kwds)

That would require them adding things like .min, .max, .etc directly to their EA. This seems reasonable, but may be a bit overly restrictive.

@TomAugspurger What do you mean exactly with restrictive? (note that using _reduce actually does not require them to implement all methods directly, since they can also simply return directly the result from _reduce) Is it int64, or should we preserve Int64?

We could also go the current numpy way of checking if the object has a similarly named method (so which is basically what the internal _reduce implementations do, but we could simply do that as the EA protocol instead of going through _reduce)?

What should this look like for EA authors? What helpers can / should we provide?

I am not sure that we can provide helpers? I think the actual implementation (or which of the reductions work) will be rather EA dependent.

How does this affect users? Specifically, assuming IntegerArray implements reductions, what's the dtype here? ...

I think, if the IntegerArray implements the asked reduction, is should definitely be Int64 and not int64 (so preserving the dtype). Is there a reason not to do that?
(of course, depending on the reduction, the dtype might change, eg for mean -> float, but that is up to the EA implementation?)

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Aug 14, 2018

I think, if the IntegerArray implements the asked reduction, is should definitely be Int64 and not int64 (so preserving the dtype). Is there a reason not to do that?

Ah, in case of groupby, you need to gather together the scalar results, so in that case it is indeed not that straightforward, as you then need to know the desired dtype .. (which might depend on the actual reduction type)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Aug 14, 2018

What do you mean exactly with restrictive? (note that using _reduce actually does not require them to implement all methods directly, since they can also simply return directly the result from _reduce)

Ah, by restrictive I meant it requires them to implement .min, etc. directly on their EA. But as you explain, they can do whatever they want in _reduce, so ignore that comment.

I am not sure that we can provide helpers?

I vaguely had in mind something like Categorical._reduce, so that EAs mirroring NumPy and defining reductions would work "for free". But that's maybe not a good idea.

you need to gather together the scalar results

Right. I think for most reductions, the result will be an instance of ExtensionDtype.type (reductions are an algebra). But is that true for all of them? If so, then we should be fine with saying that reductions preserve dtype in groupby. We should also look at window...

jreback added a commit to jreback/pandas that referenced this issue Sep 19, 2018

jreback added a commit to jreback/pandas that referenced this issue Sep 19, 2018

jreback added a commit to jreback/pandas that referenced this issue Sep 19, 2018

jreback added a commit to jreback/pandas that referenced this issue Sep 23, 2018

jreback added a commit to jreback/pandas that referenced this issue Oct 1, 2018

jreback added a commit to jreback/pandas that referenced this issue Oct 3, 2018

jreback added a commit to jreback/pandas that referenced this issue Oct 5, 2018

jreback added a commit to jreback/pandas that referenced this issue Oct 6, 2018

jreback added a commit to jreback/pandas that referenced this issue Oct 11, 2018

jreback added a commit to jreback/pandas that referenced this issue Oct 11, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.