DOC: Dict of Dicts for renaming Groupby Aggregations #9052

Closed
TomAugspurger opened this Issue Dec 10, 2014 · 12 comments

Comments

Projects
None yet
6 participants
Contributor

TomAugspurger commented Dec 10, 2014

I didn't realize this was possible, and didn't see it in the docs.

df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'a', 'b'], 'C': [3, 4, 5]})
df.groupby('B').agg({'A': {'mean1': 'mean', 'med1': 'median'}, 'C': {'mean2': 'mean', 'med2': 'median'}})

TomAugspurger added this to the 0.16.0 milestone Dec 10, 2014

Contributor

jreback commented Dec 10, 2014

xref is #8593 (which would replace / enhance this)

Thanks for the tip. Didn't realize this was possible either, this will save me from building my multicolumns "by hand".

@jreback are you planning any API change for 0.16.0 on this? #8593 does not seem to interfere with this behaviour, but maybe a deeper change is planned?

I'd rather not rely on this if it's not tested atm. Or would you accept a test for this?

Contributor

jreback commented Dec 11, 2014

@Gimli510 this IS implemented. Its basically the same as the following (except the name determination is slightly different).

In [5]: df.groupby('B').agg({'A': ['mean','median'], 'C': ['mean','median']})
Out[5]: 
     A           C       
  mean median mean median
B                        
a  1.5    1.5  3.5    3.5
b  3.0    3.0  5.0    5.0

I haven't carefully looked thru, but I suspect their is at least 1 tests. Though would for sure accept a PR which makes these tests more prominent (e.g. test_agg_api or something).

pd.Summary will enhance this API, the existing will remain.

@jreback jreback modified the milestone: 0.16.0, Next Major Release Mar 6, 2015

jreback referenced this issue May 11, 2015

Open

API: specification of functions in .agg #8593

0 of 3 tasks complete
Contributor

jreback commented Nov 12, 2015

from mailing list

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : np.random.randn(8),
   ...:                    'D' : np.random.randn(8)})

In [3]: 

In [3]: grouped = df.groupby(['A', 'B'])

In [4]: grouped[['D','C']].agg({'r':np.sum, 'r2':np.mean})
Out[4]: 
                    r        r2
A   B                          
bar one   D -0.460078 -0.460078
          C  0.798220  0.798220
    three D  1.599986  1.599986
          C -0.554798 -0.554798
    two   D  0.124900  0.124900
          C  0.084758  0.084758
foo one   D -0.466082 -0.233041
          C -0.585512 -0.292756
    three D -0.184726 -0.184726
          C  0.130756  0.130756
    two   D -1.985586 -0.992793
          C  1.275138  0.637569

In [5]: grouped[['D','C']].agg({'r': { 'C' : np.sum }, 'r2' : { 'D' : np.mean }})
Out[5]: 
                    r        r2
                    C         D
A   B                          
bar one   D -0.460078 -0.460078
          C  0.798220  0.798220
    three D  1.599986  1.599986
          C -0.554798 -0.554798
    two   D  0.124900  0.124900
          C  0.084758  0.084758
foo one   D -0.466082 -0.233041
          C -0.585512 -0.292756
    three D -0.184726 -0.184726
          C  0.130756  0.130756
    two   D -1.985586 -0.992793
          C  1.275138  0.637569

In [6]: grouped[['D','C']].agg([np.sum, np.mean])
Out[6]: 
                  D                   C          
                sum      mean       sum      mean
A   B                                            
bar one   -0.460078 -0.460078  0.798220  0.798220
    three  1.599986  1.599986 -0.554798 -0.554798
    two    0.124900  0.124900  0.084758  0.084758
foo one   -0.466082 -0.233041 -0.585512 -0.292756
    three -0.184726 -0.184726  0.130756  0.130756
    two   -1.985586 -0.992793  1.275138  0.637569

with a trivial patch

diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index add5080..b885b6f 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -2837,9 +2837,6 @@ class NDFrameGroupBy(GroupBy):
             keys = []
             if self._selection is not None:
                 subset = obj
-                if isinstance(subset, DataFrame):
-                    raise NotImplementedError("Aggregating on a DataFrame is "
-                                              "not supported")

                 for fname, agg_how in compat.iteritems(arg):
                     colg = SeriesGroupBy(subset, selection=self._selection,

of course need some tests......

I do not really like this:

In [3]: grouped = df.groupby(['A', 'B'])

In [4]: grouped[['D','C']].agg({'r':np.sum, 'r2':np.mean})
Out[4]: 
                    r        r2
A   B                          
bar one   D -0.460078 -0.460078
          C  0.798220  0.798220
    three D  1.599986  1.599986
...

The fact that C and D end up as rows, feels so different as all other groupby things, that I would not add it to the API. It is rather easy to get with a stack after groupby if you want it that way.

Also, it is a bit strange that it works on grouped[['C', 'D']], but not on grouped itself, while it both are DataFrameGroupby objects (but this is also inconsistent with current master). I would expect both to be the same, but that is not the case now:

In [27]: grouped[['C', 'D']].agg({'r':np.sum, 'r2':np.mean})
...
NotImplementedError: Aggregating on a DataFrame is not supported

In [28]: grouped.agg({'r':np.sum, 'r2':np.mean})
...
KeyError: 'r'
In [21]: grouped[['D','C']].agg({'D':np.sum, 'C':np.mean})
...
NotImplementedError: Aggregating on a DataFrame is not supported

In [22]: grouped.agg({'D':np.sum, 'C':np.mean})
Out[22]:
                  C         D
A   B
bar one    1.249205 -1.576279
    three -0.262759 -0.352865
    two    1.151419 -0.670436
foo one   -0.259004 -0.135123
    three  0.588044 -0.523053
    two    0.817821  2.095902

So apparantly grouped and grouped[['C', 'D']] is not the same, but this difference has never occured to me (I though it was the same just like df and df[['C', 'D']] is the same for a frame with those two columns). So I think this difference is also too subtle to base different behaviour upon.

I think the most important is to have a clear set of rules how the (nested) dicts are interpreted. Eg:

  • For a SeriesGroupby:
    • list: applies functions, and different column names are inferred from function
      • eg grouped['C'].agg(['mean', 'std']))
    • flat dict: different columns names are taken from the dict keys
      • eg grouped['C'].agg({'C_mean': 'mean', 'C_std': 'std'}))
    • nested dict (with lists/dicts as elements): not allowed
  • For a DataFrameGroupby:
    • list: functions are applied to each column, column names are inferred from functions, end up with MultiIndex
      • eg grouped[['C', 'D']].agg(['mean', 'std'])
    • flat dict: to specify a different (set of) function(s) to be applied to each column
      • eg grouped.agg({'C': 'mean', 'D': ['mean', 'std']})
      • and I think grouped[['C', 'D']].agg({'C': 'mean', 'D': ['mean', 'std']}) should be equivalent
    • nested dict: first level is column to which function is applied, second level can be used to give custom names to columns (current behaviour)
      • eg grouped.agg({'C': {'mean1': 'mean', 'med1': 'median'}, 'D': {'mean2': 'mean', 'med2': 'median'}})
      • and I think grouped[['C', 'D']].agg(..) should be equivalent.

The above are the current rules (as far as I know from using it, I don't know if everything is explicitly meant/tested/documented to be the rules).
You can do almost everything, but indeed the case for a DataFrameGroupby of "apply several functions to all columns but with custom name" is not really easy. Because the first level of the dict is interpreted as the column names, you now have to repeat this twice:

In [48]: grouped.agg({'C': {'r':np.sum, 'r2':np.mean},'D': {'r':np.sum, 'r2':np.
mean}})
Out[48]:
                  C                   D
                  r        r2         r        r2
A   B
bar one    1.249205  1.249205 -1.576279 -1.576279
    three -0.262759 -0.262759 -0.352865 -0.352865
    two    1.151419  1.151419 -0.670436 -0.670436
foo one   -0.518008 -0.259004 -0.135123 -0.067562
    three  0.588044  0.588044 -0.523053 -0.523053
    two    1.635643  0.817821  2.095902  1.047951

But allowing to let the first level of the dict to be the custom names instead of column names (as implemented in #11603) to ease this case:

In [130]: grouped[['D','C']].agg({'r':np.sum, 'r2':np.mean})     ## not behaviour of master
Out[130]:
                    r        r2
A   B
bar one   D -0.435276 -0.435276
          C  0.602266  0.602266
    three D -2.090016 -2.090016
          C -1.138887 -1.138887
    two   D -1.012663 -1.012663
          C  1.069958  1.069958
foo one   D -0.609795 -0.304898
          C -0.812805 -0.406402
    three D -1.957863 -1.957863
          C  0.655829  0.655829
    two   D  1.139243  0.569621
          C -0.327944 -0.163972

seems like a potential rabbit hole to me .. (possible conflicts between existing column names / custom names, behaviour dependent on the presence of a column with a certain name, ..)

So I would rather vote to keep the rules as above, and remove the distinction between grouped and grouped[['C', 'D']].
I agree that it makes the specified use case of above (and of https://groups.google.com/forum/#!topic/pydata/nXutBGUDEYw) more cumbersome, but I do not directly see a way to do this cleanly

cc @jreback @TomAugspurger @sinhrks @shoyer

@jreback jreback added a commit to jreback/pandas that referenced this issue Dec 19, 2015

@jreback jreback BUG/API: consistency in .agg with nested dicts #9052 36fb835

jreback closed this in #11603 Dec 19, 2015

Contributor

jreback commented Dec 19, 2015

acutally not closing this

jreback reopened this Dec 19, 2015

jreback referenced this issue Jan 23, 2016

Closed

Refactored Resample API breaking change #11841

2 of 2 tasks complete

@jreback jreback added a commit to jreback/pandas that referenced this issue Feb 2, 2016

@jreback jreback API: add doc examples for #9052 e243f18

jreback closed this in 1dc49f5 Feb 2, 2016

Contributor

xflr6 commented Feb 14, 2016

The following raises SpecificationError in 0.18.0, although there is no ambiguity (SeriesGroupby):

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.arange(8)})

grouped = df.groupby(['A', 'B'])

grouped['D'].agg({'D': np.sum, 'result2': np.mean})

Is this intended or a bug (I'd prefer to be able to reuse the series column name)?

This should work (it is also a regression, as it worked before).
I think this should work because for a SeriesGroupBy, the dict keys can/should always be interpreted as new column names, and not to select existing columns names.

@jreback jreback added a commit to jreback/pandas that referenced this issue Feb 15, 2016

@jreback jreback BUG: addtl fix for compat summary of groupby/resample with dicts
closes #9052
66c23aa
Contributor

jreback commented Feb 15, 2016

@xflr6

this is fixed in #12329

In [3]: grouped['D'].agg({'D': np.sum, 'result2': np.mean})
Out[3]: 
           result2  D
A   B                
bar one          1  1
    three        3  3
    two          5  5
foo one          3  6
    three        7  7
    two          3  6

Note that this works as well, though maybe not as to the users intent (e.g. the C is exactly a label here, nothing to do with the actual aggregation columns.

In [4]: grouped['D'].agg({'D': np.sum, 'c': np.mean})
Out[4]: 
           C  D
A   B          
bar one    1  1
    three  3  3
    two    5  5
foo one    3  6
    three  7  7
    two    3  6

jreback closed this in cac5f8b Feb 15, 2016

arita37 commented Jan 28, 2017

To reference on complex groupby:
We have sometimes 2 dimensionnal data like
date, user_id, val1, val2, val3

and need to transform into 'groupby' :
user_id_, mycol1, mycol2,..

Usually, this is done by

for x in user_id_list : 
   dfi= df[ df.user_id= x] 
   user_dict[x]['mycol1']=  myfun(dfi)
   user_dict[x]['mycol2']=  myfun2(dfi)

Is there a way to this kind of complex and generic grouping in groupby pandas ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment