Skip to content

passing a function to DataFrameGroupBy.agg - confusing documentation/behaviour #15304

@colinmorris

Description

@colinmorris

Problem description

For clarity, I'll split this into 3 sub-issues related to DataFrameGroupBy.agg and its documentation.

1. Not clear what kind of custom function should be provided

The docstring on DataFrameGroupBy.agg says that you can pass in a function, but it's unclear what that function should expect to receive as input, and how what it returns relates to the return value of agg.

After some poking around, it seems to me that the typical pattern is to pass a function that takes a series and returns a scalar, with the dataframe returned by agg having the property that aggregated.loc[group_foo, col_bar] is the result of calling the function on the series that is the column col_bar for rows belonging to group_foo.

If that is the expected behaviour, it should be explained in the docs.

2. Passing an arbitrary kwarg changes the function behaviour

I stumbled on this weirdness while trying to figure out how agg worked:

>>> df = pd.DataFrame(np.random.rand(4,2), columns=['b', 'c'])
>>> df['a'] = [1, 0, 0, 0]
>>> g = df.groupby('a')
>>> g.agg(lambda x: np.product(x.shape))
     b    c
a          
0  3.0  3.0
1  1.0  1.0
>>> g.agg(lambda x, foo=0: np.product(x.shape), foo=0)
   b  c
a      
0  9  9
1  3  3

In other words, by passing in any meaningless kwarg, my function is now called ngroups times with a dataframe for each group, rather than being called with a series ngroups * ncolumns times.

Also, when called in this way, it seems my function can return a list or series of length ncolumns which gets expanded (this doesn't work in the non-kwarg version):

>>> g.agg(lambda x, foo=0: [20, 17], foo=0)
     b   c
a        
0  20  17
1  20  17
3. Note on numpy special casing is confusing

The docstring has this note:

Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).

Which is confusing because passing in lambda x: np.mean(x) does seem to give the 'right' answer (the same one as passing lambda x: np.mean(x, axis=0), or np.mean, or 'mean'). This is true whether or not I throw in a kwarg.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.6.final.0 python-bits: 32 OS: Linux OS-release: 3.13.0-107-generic machine: i686 processor: i686 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 25.1.0
Cython: None
numpy: 1.12.0
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 1.5.7
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.1.3
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: 0.8.0
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapDocsGroupby

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions