Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: deprecate relableling dicts in groupby.agg #15931

Merged
merged 4 commits into from
Apr 13, 2017
Merged

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Apr 7, 2017

pre-curser to #14668

This is basically in the whatsnew, but:

In [1]:     df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
   ...:                        'B': range(5),
   ...:                        'C':range(5)})
   ...:     df
   ...: 
Out[1]: 
   A  B  C
0  1  0  0
1  1  1  1
2  1  2  2
3  2  3  3
4  2  4  4

This is good; multiple aggregations on a dataframe with a dict-of-lists

In [2]:    df.groupby('A').agg({'B': ['sum', 'max'],
   ...:                         'C': ['count', 'min']})
   ...: 
Out[2]: 
    B         C    
  sum max count min
A                  
1   3   2     3   0
2   7   4     2   3

This is a dict on a grouped Series -> deprecated

In [3]: df.groupby('A').B.agg({'foo': 'count'})
FutureWarning: using a dictionary on a Series for aggregation
is deprecated and will be removed in a future version
Out[3]: 
   foo
A     
1    3
2    2

Further this has to go as well, a nested dict that does renaming.
Note once we fix #4160 (renaming with a level); the following becomes almost trivial to rename in-line.

In [4]: df.groupby('A').agg({'B': {'foo': ['sum', 'max']}, 
                             'C': {'bar': ['count', 'min']}})
FutureWarning: using a dictionary on a Series for aggregation
is deprecated and will be removed in a future version
Out[4]: 
  foo       bar    
  sum max count min
A                  
1   3   2     3   0
2   7   4     2   3

Note: I will fix this message (as it doesn't actually apply here)

@jreback jreback added Deprecate Functionality to remove in pandas Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Apr 7, 2017
@jreback jreback added this to the 0.20.0 milestone Apr 7, 2017
@jreback
Copy link
Contributor Author

jreback commented Apr 7, 2017

@codecov
Copy link

codecov bot commented Apr 7, 2017

Codecov Report

Merging #15931 into master will decrease coverage by 0.02%.
The diff coverage is 73.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15931      +/-   ##
==========================================
- Coverage   91.03%      91%   -0.03%     
==========================================
  Files         145      145              
  Lines       49587    49636      +49     
==========================================
+ Hits        45141    45171      +30     
- Misses       4446     4465      +19
Flag Coverage Δ
#multiple 88.77% <73.33%> (-0.03%) ⬇️
#single 40.53% <5.33%> (-0.04%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby.py 95.54% <100%> (ø) ⬆️
pandas/types/cast.py 85.11% <20%> (-0.63%) ⬇️
pandas/core/base.py 92.32% <71.42%> (-3.19%) ⬇️
pandas/core/common.py 91.03% <0%> (+0.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7b8a6b1...ff1a5f6. Read the comment docs.

@jreback jreback changed the title DEPR: deprecate relabling dictionarys in groupby.agg DEPR: deprecate relableling dicts in groupby.agg Apr 7, 2017
@jreback
Copy link
Contributor Author

jreback commented Apr 9, 2017

@jorisvandenbossche @chris-b1 @shoyer thoughts (chris this was talked about at the meeting, to try to reduce the number of cases that we would handle). The idea is that we go ahead with this deprecation, then merge .agg (which will also have the same deprecation; or maybe I can just raise as that's new things, but that is after anyhow).

@chris-b1
Copy link
Contributor

chris-b1 commented Apr 9, 2017

This seems like a reasonable deprecation, the current behavior is probably too overloaded and hard to think about.

Might put the recommended way in the deprecation message? Would also be nice to have #4160 in 0.20, so the DataFrame case is more consistent.

@jreback
Copy link
Contributor Author

jreback commented Apr 9, 2017

yes going to try to fix #4160 before the release as well.

@jreback
Copy link
Contributor Author

jreback commented Apr 12, 2017

@jorisvandenbossche @TomAugspurger @chris-b1 @shoyer if you'd have a look. going to merge later today.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1) We are deprecating passing a dict to a grouped/rolled/resampled ``Series``. This allowed
one to ``rename`` the resulting aggregation, but this had a completely different
meaning than passing a dictionary to a grouped ``DataFrame``, which accepts column-to-aggregations.
2) We are deprecating passing a dict-of-dict to a grouped/rolled/resampled ``DataFrame`` in a similar manner.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict-of-dicts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.. code-block:: ipython

In [6]: df.groupby('A').B.agg({'foo': 'count'})
FutureWarning: using a dictionary on a Series for aggregation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to updated this with the new FutureWarning if we change the message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.. ipython:: python

r = df.groupby('A').agg({'B': ['sum', 'max'], 'C': ['count', 'min']})
r.columns = r.columns.set_levels(['foo', 'bar'], level=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .rename works here as well

In [11]: r.rename(columns={"B": "foo", "C": "bar"})
Out[11]:
  foo       bar
  sum max count min
A
1   3   2     3   0
2   7   4     2   3

though I didn't realize .rename worked like that on MI. I thought you'd get tuples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for the first level this indeed works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm that's nice actually.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(potentially different) aggregations.

However, ``.agg(..)`` can *also* accept a dict that allows 'renaming' of the result columns. This is a complicated and confusing syntax, as well as not consistent
between ``Series`` and ``DataFrame``. We are deprecating this 'renaming' functionarility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in functionarility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

.. ipython:: python

df.groupby('A').agg({'B': ['sum', 'max'],
'C': ['count', 'min']})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might do the simpler thing of a dict of scalars instead of list of lists (to not complicate the example).

Eg

In [29]: df.groupby('A').agg({'B': 'sum','C': 'min'})
Out[29]: 
   C  B
A      
1  0  3
2  3  7

Then it even contrasts more with the series one, where the example is also a dict of scalar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure


.. ipython:: python

df.groupby('A').B.agg(['count']).rename({'count': 'foo'})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.groupby('A').B.agg('count').rename('foo') would actually be simpler .. (but is not exactly equivalent to the dict case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, there are a myriad of options!

.. ipython:: python

r = df.groupby('A').agg({'B': ['sum', 'max'], 'C': ['count', 'min']})
r.columns = r.columns.set_levels(['foo', 'bar'], level=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for the first level this indeed works.

'C': range(5)})

with tm.assert_produces_warning(FutureWarning,
check_stacklevel=False) as w:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is giving an error without the check_stacklevel=False ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I never check the stacklevel, too hard to get it exactly right.

df.groupby('A').agg({'B': {'foo': ['sum', 'max']},
'C': {'bar': ['count', 'min']}})
assert "using a dict with renaming" in str(w[0].message)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is tested in other cases as well, but since this is a test specifically for the deprs, maybe also add the case of df.groupby('A')[['B', 'C']].agg({'ma': 'max'}) (then you have the different 'cases' that raise deprecation here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

def name(self):
def _selection_name(self):
""" return a name for myself; this would ideally be the 'name' property, but
we cannot conflict with the Series.name property which can be set """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This explanation is not fully clear to me (but maybe I am not familiar enough with the groupby codebase)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is purely internal.

("using a dict on a Series for aggregation\n"
"is deprecated and will be removed in a future "
"version"),
FutureWarning, stacklevel=7)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using python 3.5, this needed to be 3 instead of 7 for the example of the whatsnew docs (but may depend on the code path taken up to this function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I have no idea what these should be. I think I was playing with these when testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the real issue is that the agg evaulation is recursive (somewhat), so I could figure it out I suppose but...

Copy link
Member

@jorisvandenbossche jorisvandenbossche Apr 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you changed them all to 4, that is indeed good for the others, but for this one you need 3 for the case in those tests.

(patch based on the previous state of this PR, the first two are already OK):

diff --git a/pandas/core/base.py b/pandas/core/base.py
index 25ebb1d..451c773 100644
--- a/pandas/core/base.py
+++ b/pandas/core/base.py
@@ -502,7 +502,7 @@ pandas.DataFrame.%(name)s
                             ("using a dict with renaming "
                              "is deprecated and will be removed in a future "
                              "version"),
-                            FutureWarning, stacklevel=3)
+                            FutureWarning, stacklevel=4)
 
                 arg = new_arg
 
@@ -516,7 +516,7 @@ pandas.DataFrame.%(name)s
                         ("using a dict with renaming "
                          "is deprecated and will be removed in a future "
                          "version"),
-                        FutureWarning, stacklevel=3)
+                        FutureWarning, stacklevel=4)
 
             from pandas.tools.concat import concat
 
diff --git a/pandas/core/groupby.py b/pandas/core/groupby.py
index 978f444..5591ce4 100644
--- a/pandas/core/groupby.py
+++ b/pandas/core/groupby.py
@@ -2843,7 +2843,7 @@ class SeriesGroupBy(GroupBy):
                     ("using a dict on a Series for aggregation\n"
                      "is deprecated and will be removed in a future "
                      "version"),
-                    FutureWarning, stacklevel=7)
+                    FutureWarning, stacklevel=3)
 
             columns = list(arg.keys())
             arg = list(arg.items())

The above gives correct warnings for the three cases in this explicit test for deprecation warnings.

So I would try to remove check_stacklevel=False for those three (and leave it for all other, as indeed with some other code paths, it might be different ..)


raise ValueError("cannot perform both aggregation "
"and transformation operations "
"simultaneously")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new error message? (and in case so, just checking if there is a test added for it?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this are used in .agg changes (which are on top of this). They aren't used in this PR, but were in the same file so left them.

@jreback
Copy link
Contributor Author

jreback commented Apr 12, 2017

ok docs are updated. note that in the post-PR #14668 I already rewrote a lot of the docs (and links) for agg/transform.

@jreback jreback merged commit 1c4dacb into pandas-dev:master Apr 13, 2017
@jreback
Copy link
Contributor Author

jreback commented Apr 13, 2017

merging, will fix up any additional comments in #14668

@zertrin
Copy link
Contributor

zertrin commented Jun 13, 2017

Hi, sorry for digging this up, but even if I understand the rationale for the deprecation, and after reading the What's New and the documentation, I still don't see how to replace the following use case.

(The documentation is only covering the simple case where one either apply exactly one aggregator per column, or the same set of aggregators over all columns, but not when different sets of aggregator are applied to different columns):

Input Dataframe:

mydf = pd.DataFrame(
    {
        'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
        'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
        'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
    },
    index=range(6))
  cat  distance  energy
0   A      1.20    1.80
1   A      1.50    1.95
2   A      1.74    2.04
3   B      0.82    1.25
4   B      1.01    1.60
5   C      0.60    1.01

Cool aggregation and rename in one step (but DEPRECATED):

mydf_agg = mydf.groupby('cat').agg({
    'energy': {'energy_sum': 'sum'},
    'distance': {
        'distance_sum': 'sum',
        'distance_mean': 'mean',
    },
})

Resulting in a MultiIndex columns

        energy     distance              
    energy_sum distance_sum distance_mean
cat                                      
A         5.79         4.44         1.480
B         2.85         1.83         0.915
C         1.01         0.60         0.600

Just have to drop the upper level to get to my resulting dataframe with the renamed columns:

mydf_agg.columns = mydf_agg.columns.droplevel(level=0)
     energy_sum  distance_sum  distance_mean
cat                                         
A          5.79          4.44          1.480
B          2.85          1.83          0.915
C          1.01          0.60          0.600

Of course this is a toy example, in a typical usecase there can be many more columns/aggregator functions.

So my question is: could you provide an example of the currently recommended way to achieve the exact same result (last Dataframe) in the case where different sets of aggregator are applied to different columns.

@zertrin
Copy link
Contributor

zertrin commented Jun 13, 2017

Oh i see that that was originally documented, but subsequently simplified:
ff1a5f6#diff-52364fb643114f3349390ad6bcf24d8fL521

However by trying this approach, I'm still blocked:

mydf_agg2 = mydf.groupby('cat').agg({
    'energy': 'sum',
    'distance': ['sum', 'mean'],
})

    energy distance       
       sum      sum   mean
cat                       
A     5.79     4.44  1.480
B     2.85     1.83  0.915
C     1.01     0.60  0.600

But then, how can I rename with a mapping of (level0 + level1 --> final_name) like this:

{
    'energy.sum': 'energy_sum',
    'distance.sum': 'distance_sum',
    'distance.mean': 'distance_mean',
}

Or even better, by using some kind of callable like this:

def rename_mapping(level0, level1):
    return level0 + '_' + level1

@zertrin
Copy link
Contributor

zertrin commented Jun 13, 2017

Sorry for the spam (this is the last one) but I just found an interesting discussion and solutions here: https://stackoverflow.com/questions/19078325/naming-returned-columns-in-pandas-aggregate-function (don't look at the accepted answer)

In particular, the missing piece of information for me was the existence of the df.columns.ravel() method.

newidx = []
for (n1,n2) in mydf_agg.columns.ravel():
    newidx.append("%s_%s" % (n1,n2))
mydf_agg.columns=newidx

More generally I think this is good to leave a link to this stackoverflow thread here, since after seeing the deprecation message, this GitHub pull request is one of the first place where to look for solutions (after the docs and what's new).

Maybe some of Joel Ostblom's answer and/or of Gadi Oron's answer could make their way into the docs as an example for all of us that relied previously on this relabeling functionality with .agg() ?

In particular, with this deprecation, the use of lambda functions in .agg() is directly impacted (cf Joel Ostblom's answer above) and could warrant a notice in the docs.

@jreback
Copy link
Contributor Author

jreback commented Jun 13, 2017

@zertrin if you want to show a more extended / complex example in the docs that would be great. push up a PR and will comment.

@garfieldthecat
Copy link

Please, please, pretty please, do NOT deprecate this. Not only is removing backward compatibility always an issue, and one of the key obstacles in the adoption of Python for data science - it makes it way more cumbersome to run what should be extremely banal, i.e. a groupby where different aggregate functions are applied to different columns (sum of x, avg of x, min of y, etc), and where you have the explicit need to rename the resulting field (e.g. sum_x won't do). The way you are going, you are forcing people to rename fields manually after the groupby - surely this is as non-pythonic as it gets?

I do not understand in what way removing this feature would possibly clean anything up, or make anything clearer. How would you answer this question now? https://stackoverflow.com/questions/32374620/python-pandas-applying-different-aggregate-functions-to-different-columns

How would you recommend rewriting this very simple and IMHO pythonically elegant line of code?

df.groupby('qtr').agg({"realgdp": {"mean_gdp": "Mean GDP", "std_gdp": "STD of GDP"},
                                "unemp": {"mean_unemp": "Mean unemployment"}})

@TomAugspurger
Copy link
Contributor

I assume you meant

In [21]: df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
    ...:                        "unemp": {"mean_unemp": "mean"}})
    ...:
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py:4139: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Out[21]:
     realgdp                  unemp
    mean_gdp     std_gdp mean_unemp
qtr
1     1692.0  115.258405       4.95
2     1658.8         NaN       5.60
3     1723.0         NaN       4.60
4     1753.9         NaN       4.20

The recommendation in the whatsnew gets you most of the way there:

In [30]: r = df.groupby("qtr").agg({"realgdp": ['mean', 'std'], "unemp": ['mean']})

In [32]: r
Out[32]:
    realgdp             unemp
       mean         std  mean
qtr
1    1692.0  115.258405  4.95
2    1658.8         NaN  5.60
3    1723.0         NaN  4.60
4    1753.9         NaN  4.20

Does that work for you? At this point I typically get rid of the MI in the columns, since I find them awkward to work with.

@garfieldthecat
Copy link

How would you recommend renaming the columns?
If I just do columns.droplevel(0), I end up with multiple columns sharing the same name, as the same aggregate function applies to multiple columns.
I could do something like
r.columns = [' '.join(col).strip() for col in r.columns.values]

so that the fields become: [ "x sum", "x min", "y sum"] etc. (or whatever the aggregate functions were)
and take it from here, but it is still longer and more cumbersome that my previous approach.

**Can someone please, please, please remind me why this is being deprecated?

I see the downsides, I do not see any upside!**

Removing backward compatibility should always be a last resort. Doing so when the new approach becomes way longer and more convoluted, well, it just beggars belief!

@zertrin
Copy link
Contributor

zertrin commented Oct 12, 2017

At this point I typically get rid of the MI in the columns, since I find them awkward to work with.

Yeah, this is one of issue with this change: it makes something trivially simple to do before much harder to do now. Especially when applying the same aggregate over many columns, you can't just drop the first level of the MI.

In summary, this is what this changes results in:

Before:

straightforward, quite easy to understand, flexible (I have the choice for the name of the columns)

mydf_agg = mydf.groupby('cat').agg({
    'energy': {'total_energy': 'sum'},
    'distance': {
        'total_distance': 'sum',
        'average_distance': 'mean',
    },
})
# get rid of the first MultiIndex level in a pretty straightforward way
mydf_agg.columns = mydf_agg.columns.droplevel(level=0)

Result:

    total_energy total_distance average_distance
cat
A           5.79           4.44            1.480
B           2.85           1.83            0.915
C           1.01           0.60            0.600

After:

No way of really customizing the column names after aggregation, the best we can get is some combination of original column name and aggregate function's name:

mydf_agg2 = mydf.groupby('cat').agg({
    'energy': 'sum',
    'distance': ['sum', 'mean'],
})
mydf_agg2.columns = ['_'.join(col) for col in mydf_agg2.columns]

Result:

     energy_sum  distance_sum  distance_mean
cat
A          5.79          4.44          1.480
B          2.85          1.83          0.915
C          1.01          0.60          0.600

Note that I couldn't really choose the name of the resulting columns.... If I want to, I need to find another way of replacing the name. Like a mapping like this (which is annoying to write):

mydf_agg2.rename({"energy_sum": "total_energy", "distance_sum": "total_distance", "distance_mean": "average_distance"}, inplace=True)

Now we finally get the same result as before, just in a longer and more complicated way...


And another annoying issue with this change: when using custom aggregation callables:

  • Before there's no issue since I could specify the destination's column name myself.
  • Now I can't do it that easily since the destination column name is based on the aggregate callable's name and I need to make sure that my custom aggregation callable has a __name__ attribute... Which isn't necessary the case with partial or lambda functions for example.

@TomAugspurger
Copy link
Contributor

Can someone please, please, please remind me why this is being deprecated?

I see the downsides, I do not see any upside!

I'm assuming you saw the release note with the deprecation? A nested dictionary meant we had two behaviors for the renaming, either selecting columns, or assigning names.

Thanks for the thoughtful writeup @zertrin. It sounds like the main difficulty is with the renaming. Would something like

mydf.groupby('cat').agg({
    'energy': 'sum',
    'distance': ['sum', 'mean'],
}).collapse_levels(columns="_")  # [-'.join(col) for col in df.columns]

Work for you? That's when the "default" names are OK. For non-default names, maybe something like

mydf.groupby('cat').agg({
    ...
}).relabel(columns=['c1', 'c2', 'c3'])

@garfieldthecat
Copy link

I'm assuming you saw the release note with the deprecation? A nested dictionary meant we had two >behaviors for the renaming, either selecting columns, or assigning names.

Do you mean these notes? #14668
I read them but I'm not sure I understood the reason. Could you maybe try to re-explain?

@garfieldthecat
Copy link

Also, another problem with renaming is if you use more than one lambda function on the same column, e.g. to calculate the % of both sum(x) and count(x). In this case, you'd end up with multiple columns having the same name: "x_lambda". Quite a mess! You could rename the columns based on their position rather than their names, but it's extremely cumbersome and un-pythonic. All of this would, of course, be avoided by not deprecating.
Or is there maybe a better way I am missing?

@zertrin
Copy link
Contributor

zertrin commented Oct 12, 2017

@TomAugspurger Thanks for proposing alternatives, however these miss to tackle the real issue.

The problem is not whether or not it is possible to do the renaming and how. The answer to that is yes it's possible and your proposed solutions do not bring more value than the other example above (like assigning directly to mydf.columns), except the burden of adding two more methods to the already long list of methods of the DataFrame class.

The real issue is that this change forces us to separate the place where the renaming is defined from the definition of the corresponding aggregate function.

Semantically this is really annoying, because now we have to keep track of two lists and keep them in sync when we want to add another aggregate column. We must track down in which order the new aggregate column will land and where in the renaming list to update after adding one more aggregate function...

So in a nutshell:

  • Before, column renaming and the definition of the operation were together, so they are naturally in sync.
  • Now, first you define the aggregate callable, and afterward you have to rename and be very careful about the resulting column order.

And no one has even begun to address the issue of using custom aggregates. Because these custom callables may have the same __name__ attribute and this results in an exception (partial functions inherit the name of the parent function, and one cannot define it at creation, and all lambda functions are named <lambda> and this is worse because afaik there's no way to define the name of a lambda).

Thus this is a backward incompatible change, and this one has no easy workaround. (there exists tricky workarounds to add the __name__ attribute)

Slightly extended example from above with lambda and partial:

(please note, this is a crafted example for the purpose of demonstrating the problem, but all of the demonstrated issues here did bite me in real life since the change)

Before:

easy and works as expected

import numpy as np
import statsmodels.robust as smrb

percentile17 = lambda x: np.percentile(x, 17)
mad_c1 = partial(smrb.mad, c=1)

mydf_agg = mydf.groupby('cat').agg({
    'energy': {
        'total_energy': 'sum',
        'energy_p98': lambda x: np.percentile(x, 98),
        'energy_p17': percentile17,
    },
    'distance': {
        'total_distance': 'sum',
        'average_distance': 'mean',
        'distance_mad': smrb.mad,
        'distance_mad_c1': mad_c1,
    },
})

results in

          energy                             distance
    total_energy energy_p98 energy_p17 total_distance average_distance distance_mad distance_mad_c1
cat
A           5.79     2.0364     1.8510           4.44            1.480     0.355825           0.240
B           2.85     1.5930     1.3095           1.83            0.915     0.140847           0.095
C           1.01     1.0100     1.0100           0.60            0.600     0.000000           0.000

and all is left is:

# get rid of the first MultiIndex level in a pretty straightforward way
mydf_agg.columns = mydf_agg.columns.droplevel(level=0)

After

import numpy as np
import statsmodels.robust as smrb

percentile17 = lambda x: np.percentile(x, 17)
mad_c1 = partial(smrb.mad, c=1)

mydf_agg = mydf.groupby('cat').agg({
    'energy': [
    	'sum',
    	lambda x: np.percentile(x, 98),
    	percentile17
    ],
    'distance': [
    	'sum',
    	'mean',
    	smrb.mad,
    	mad_c1
    ],
})

The above breaks because the lambda functions will all result in columns named <lambda> which results in

SpecificationError: Function names must be unique, found multiple named <lambda>

Backward incompatible regression: one cannot apply two different lambdas to the same original column anymore.

If one removes the lambda x: np.percentile(x, 98) from above, we get the same issue with the partial function which inherits the function name from the original function:

SpecificationError: Function names must be unique, found multiple named mad

Finally, after overwriting the __name__ attribute of the partial (mad_c1.__name__ = 'mad_c1') we get:

    energy          distance
       sum <lambda>      sum   mean       mad mad_c1
cat
A     5.79   1.8510     4.44  1.480  0.355825  0.240
B     2.85   1.3095     1.83  0.915  0.140847  0.095
C     1.01   1.0100     0.60  0.600  0.000000  0.000

with still the renaming to deal with.

@zertrin
Copy link
Contributor

zertrin commented Nov 2, 2017

@TomAugspurger @jreback do we need to open a separate issue to get this deprecation being reconsidered with all the new facts summarized above that were not initially considered when deciding this?

@shoyer
Copy link
Member

shoyer commented Nov 2, 2017

If you feel strongly about this, then yes, a new issue would be appropriate. I agree that this API is not as expressive as what we had before, but the behavior we had before for .agg() was inconsistent and could not be explained with a simple set of rules. Please read the full discussion on #14668 for context.

I would be interested to see proposals for alternative APIs that solve your use-case without the full complexity of the deprecated GroupBy.agg() API. For example, one solution might be to handle the deprecated behavior (dict-of-dict) with a new dedicated method.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 2, 2017 via email

@jaron-hivery
Copy link

@zertrin did you open a new issue for this discussion?

@zertrin
Copy link
Contributor

zertrin commented Nov 14, 2017

Not yet, have been pretty busy lately, but have a draft. Will try to finish it very soon.

@zertrin
Copy link
Contributor

zertrin commented Nov 19, 2017

@jaron-hivery the new issue is #18366

@pirsquared
Copy link

I'm sure @zertrin saw an email but I provided an easy recipe to produce the same results using existing API.

#18366 (comment)

gfyoung added a commit to forking-repos/pandas that referenced this pull request Oct 28, 2018
Grouped, rolled, and resample Series / DataFrames
will now disallow dicts / nested dicts respectively
as parameters to aggregation (was deprecated before).

xref pandas-devgh-15931.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Oct 28, 2018
Grouped, rolled, and resample Series / DataFrames
will now disallow dicts / nested dicts respectively
as parameters to aggregation (was deprecated before).

xref pandas-devgh-15931.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH/BUG: Rename of MultiIndex DataFrames does not work
9 participants