Reductions for ExtensionArray #22346

TomAugspurger · 2018-08-14T16:46:27Z

Creeping up to this in #22345

A few questions

What should this look like for EA authors? What helpers can / should we provide?
How does this affect users? Specifically, assuming IntegerArray implements reductions, what's the dtype here?

In [11]: df = pd.DataFrame({"A": ['a', 'b', 'a'], "B": pd.core.arrays.IntegerArray([1, 2, 3])})

In [12]: df.groupby("A").B.min().dtype

Is it int64, or should we preserve Int64?

We can also discuss transforms ( cumulative operations like cumsum, maybe things like .shift).

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-08-14T16:47:50Z

To kick things off, one option is to make EA authors implement a _reduce like we do internally. That would require them adding things like .min, .max, .etc directly to their EA. This seems reasonable, but may be a bit overly restrictive.

jorisvandenbossche · 2018-08-14T19:27:08Z

(We also have #22224)

So basically, for Series, the reduction operations call the following function (where op is eg nanops.nansum in case of Series.sum()):

pandas/pandas/core/series.py

Lines 3244 to 3266 in cf70d11

    
               def _reduce(self, op, name, axis=0, skipna=True, numeric_only=None, 
        
                           filter_type=None, **kwds): 
        
                   """ 
        
                   perform a reduction operation 
        
                   if we have an ndarray as a value, then simply perform the operation, 
        
                   otherwise delegate to the object 
        
                   """ 
        
                   delegate = self._values 
        
                   if isinstance(delegate, np.ndarray): 
        
                       # Validate that 'axis' is consistent with Series's single axis. 
        
                       if axis is not None: 
        
                           self._get_axis_number(axis) 
        
                       if numeric_only: 
        
                           raise NotImplementedError('Series.{0} does not implement ' 
        
                                                     'numeric_only.'.format(name)) 
        
                       with np.errstate(all='ignore'): 
        
                           return op(delegate, skipna=skipna, **kwds) 
        
                   return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna, 
        
                                           numeric_only=numeric_only, 
        
                                           filter_type=filter_type, **kwds)

So with this existing infrastructure, it is indeed an option to let EA's implement _reduce.

Internally, such a _reduce basically always simply checks if the operation is defined as a method and calls that, eg for Categorical:

pandas/pandas/core/arrays/categorical.py

Lines 2077 to 2084 in cf70d11

    
           def _reduce(self, op, name, axis=0, skipna=True, numeric_only=None, 
        
                       filter_type=None, **kwds): 
        
               """ perform the reduction type operation """ 
        
               func = getattr(self, name, None) 
        
               if func is None: 
        
                   msg = 'Categorical cannot perform the operation {op}' 
        
                   raise TypeError(msg.format(op=name)) 
        
               return func(numeric_only=numeric_only, **kwds)

That would require them adding things like .min, .max, .etc directly to their EA. This seems reasonable, but may be a bit overly restrictive.

@TomAugspurger What do you mean exactly with restrictive? (note that using _reduce actually does not require them to implement all methods directly, since they can also simply return directly the result from _reduce) Is it int64, or should we preserve Int64?

We could also go the current numpy way of checking if the object has a similarly named method (so which is basically what the internal _reduce implementations do, but we could simply do that as the EA protocol instead of going through _reduce)?

What should this look like for EA authors? What helpers can / should we provide?

I am not sure that we can provide helpers? I think the actual implementation (or which of the reductions work) will be rather EA dependent.

How does this affect users? Specifically, assuming IntegerArray implements reductions, what's the dtype here? ...

I think, if the IntegerArray implements the asked reduction, is should definitely be Int64 and not int64 (so preserving the dtype). Is there a reason not to do that?
(of course, depending on the reduction, the dtype might change, eg for mean -> float, but that is up to the EA implementation?)

jorisvandenbossche · 2018-08-14T19:28:07Z

I think, if the IntegerArray implements the asked reduction, is should definitely be Int64 and not int64 (so preserving the dtype). Is there a reason not to do that?

Ah, in case of groupby, you need to gather together the scalar results, so in that case it is indeed not that straightforward, as you then need to know the desired dtype .. (which might depend on the actual reduction type)

TomAugspurger · 2018-08-14T19:36:58Z

What do you mean exactly with restrictive? (note that using _reduce actually does not require them to implement all methods directly, since they can also simply return directly the result from _reduce)

Ah, by restrictive I meant it requires them to implement .min, etc. directly on their EA. But as you explain, they can do whatever they want in _reduce, so ignore that comment.

I am not sure that we can provide helpers?

I vaguely had in mind something like Categorical._reduce, so that EAs mirroring NumPy and defining reductions would work "for free". But that's maybe not a good idea.

you need to gather together the scalar results

Right. I think for most reductions, the result will be an instance of ExtensionDtype.type (reductions are an algebra). But is that true for all of them? If so, then we should be fine with saying that reductions preserve dtype in groupby. We should also look at window...

closes pandas-dev#21789 closes pandas-dev#22346

TomAugspurger added Groupby API Design Numeric Operations Arithmetic, Comparison, and Logical operations labels Aug 14, 2018

TomAugspurger added this to the 0.24.0 milestone Aug 14, 2018

jreback added a commit to jreback/pandas that referenced this issue Sep 19, 2018

ENH: add groupby & reduce support to EA

5c55011

closes pandas-dev#21789 closes pandas-dev#22346

jreback mentioned this issue Sep 19, 2018

ENH: add groupby & reduce support to EA #22762

Merged

jreback added a commit to jreback/pandas that referenced this issue Sep 19, 2018

ENH: add groupby & reduce support to EA

067335e

closes pandas-dev#21789 closes pandas-dev#22346

jreback added a commit to jreback/pandas that referenced this issue Sep 19, 2018

ENH: add groupby & reduce support to EA

27ef5a7

closes pandas-dev#21789 closes pandas-dev#22346

jreback added a commit to jreback/pandas that referenced this issue Sep 23, 2018

ENH: add groupby & reduce support to EA

2ce2fed

closes pandas-dev#21789 closes pandas-dev#22346

jreback added a commit to jreback/pandas that referenced this issue Oct 1, 2018

ENH: add groupby & reduce support to EA

00ad792

closes pandas-dev#21789 closes pandas-dev#22346

jreback added a commit to jreback/pandas that referenced this issue Oct 3, 2018

ENH: add groupby & reduce support to EA

130a7db

closes pandas-dev#21789 closes pandas-dev#22346

jreback added a commit to jreback/pandas that referenced this issue Oct 5, 2018

ENH: add groupby & reduce support to EA

7cf0725

closes pandas-dev#21789 closes pandas-dev#22346

jreback added a commit to jreback/pandas that referenced this issue Oct 6, 2018

ENH: add groupby & reduce support to EA

aed316f

closes pandas-dev#21789 closes pandas-dev#22346

jreback added a commit to jreback/pandas that referenced this issue Oct 11, 2018

ENH: add groupby & reduce support to EA

3d7e093

closes pandas-dev#21789 closes pandas-dev#22346

jreback added a commit to jreback/pandas that referenced this issue Oct 11, 2018

ENH: add groupby & reduce support to EA

90fb20e

closes pandas-dev#21789 closes pandas-dev#22346

jorisvandenbossche closed this as completed in #22762 Oct 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reductions for ExtensionArray #22346

Reductions for ExtensionArray #22346

TomAugspurger commented Aug 14, 2018

TomAugspurger commented Aug 14, 2018

jorisvandenbossche commented Aug 14, 2018

jorisvandenbossche commented Aug 14, 2018

TomAugspurger commented Aug 14, 2018

Reductions for ExtensionArray #22346

Reductions for ExtensionArray #22346

Comments

TomAugspurger commented Aug 14, 2018

TomAugspurger commented Aug 14, 2018

jorisvandenbossche commented Aug 14, 2018

jorisvandenbossche commented Aug 14, 2018

TomAugspurger commented Aug 14, 2018