-
-
Notifications
You must be signed in to change notification settings - Fork 19.1k
Description
Problem description
For clarity, I'll split this into 3 sub-issues related to DataFrameGroupBy.agg and its documentation.
1. Not clear what kind of custom function should be provided
The docstring on DataFrameGroupBy.agg says that you can pass in a function, but it's unclear what that function should expect to receive as input, and how what it returns relates to the return value of agg.
After some poking around, it seems to me that the typical pattern is to pass a function that takes a series and returns a scalar, with the dataframe returned by agg having the property that aggregated.loc[group_foo, col_bar]
is the result of calling the function on the series that is the column col_bar
for rows belonging to group_foo
.
If that is the expected behaviour, it should be explained in the docs.
2. Passing an arbitrary kwarg changes the function behaviour
I stumbled on this weirdness while trying to figure out how agg worked:
>>> df = pd.DataFrame(np.random.rand(4,2), columns=['b', 'c'])
>>> df['a'] = [1, 0, 0, 0]
>>> g = df.groupby('a')
>>> g.agg(lambda x: np.product(x.shape))
b c
a
0 3.0 3.0
1 1.0 1.0
>>> g.agg(lambda x, foo=0: np.product(x.shape), foo=0)
b c
a
0 9 9
1 3 3
In other words, by passing in any meaningless kwarg, my function is now called ngroups
times with a dataframe for each group, rather than being called with a series ngroups * ncolumns
times.
Also, when called in this way, it seems my function can return a list or series of length ncolumns
which gets expanded (this doesn't work in the non-kwarg version):
>>> g.agg(lambda x, foo=0: [20, 17], foo=0)
b c
a
0 20 17
1 20 17
3. Note on numpy special casing is confusing
The docstring has this note:
Numpy functions mean/median/prod/sum/std/var are special cased so the default behavior is applying the function along axis=0 (e.g., np.mean(arr_2d, axis=0)) as opposed to mimicking the default Numpy behavior (e.g., np.mean(arr_2d)).
Which is confusing because passing in lambda x: np.mean(x)
does seem to give the 'right' answer (the same one as passing lambda x: np.mean(x, axis=0)
, or np.mean
, or 'mean'
). This is true whether or not I throw in a kwarg.
Output of pd.show_versions()
pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 25.1.0
Cython: None
numpy: 1.12.0
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.1
matplotlib: 2.0.0
openpyxl: 1.5.7
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.1.3
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: 0.8.0
pymysql: None
psycopg2: None
jinja2: 2.9.4
boto: None
pandas_datareader: None