Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent return type when grouping dates by frequency with custom reduction function #11742

Closed
stephen-hoover opened this issue Dec 2, 2015 · 3 comments
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Groupby Resample resample method
Milestone

Comments

@stephen-hoover
Copy link
Contributor

If I group a DataFrame by a column of dates, the return type varies depending on whether I just group or whether I also apply a frequency in the Grouper.

Grouping without resampling dates returns a DataFrame when I apply a function which returns a labeled Series, or a Series if the function returns a scalar:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'date': ['10/10/2000', '11/10/2000'], 'value': [10, 13]})

In [3]: def sumfunc(x):
   ...:     return pd.Series([x['value'].sum()], ('sum',))
   ...: 

In [4]: df.groupby(pd.Grouper(key='date')).apply(sumfunc)
Out[4]: 
            sum
date           
10/10/2000   10
11/10/2000   13

In [5]: type(df.groupby(pd.Grouper(key='date')).apply(sumfunc))
Out[5]: pandas.core.frame.DataFrame

In [17]: df.groupby(pd.Grouper(key='date')).apply(lambda x: x.value.sum())
Out[17]: 
date
2000-10-10    10
2000-11-10    13
dtype: int64

In [18]: type(df.groupby(pd.Grouper(key='date')).apply(lambda x: x.value.sum()))
Out[18]: pandas.core.series.Series

If I apply a frequency in the Grouper, I get a Series with a multi-index when the function returns a labeled Series, or a TypeError when it returns a scalar.

In [6]: df['date'] = pd.to_datetime(df['date'])

In [7]: df.groupby(pd.Grouper(freq='M', key='date')).apply(sumfunc)
Out[7]: 
date           
2000-10-31  sum    10
2000-11-30  sum    13
dtype: int64

In [8]: type(df.groupby(pd.Grouper(freq='M', key='date')).apply(sumfunc))
Out[8]: pandas.core.series.Series

In [16]: df.groupby(pd.Grouper(freq='M', key='date')).apply(lambda x: x.value.sum())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-ad73d0ebc475> in <module>()
----> 1 df.groupby(pd.Grouper(freq='M', key='date')).apply(lambda x: x.value.sum())

/Users/shoover/.py35/lib/python3.5/site-packages/pandas/core/groupby.py in apply(self, func, *args, **kwargs)
    713         # ignore SettingWithCopy here in case the user mutates
    714         with option_context('mode.chained_assignment',None):
--> 715             return self._python_apply_general(f)
    716 
    717     def _python_apply_general(self, f):

/Users/shoover/.py35/lib/python3.5/site-packages/pandas/core/groupby.py in _python_apply_general(self, f)
    720 
    721         return self._wrap_applied_output(keys, values,
--> 722                                          not_indexed_same=mutated)
    723 
    724     def aggregate(self, func, *args, **kwargs):

/Users/shoover/.py35/lib/python3.5/site-packages/pandas/core/groupby.py in _wrap_applied_output(self, keys, values, not_indexed_same)
   3253             # Handle cases like BinGrouper
   3254             return self._concat_objects(keys, values,
-> 3255                                         not_indexed_same=not_indexed_same)
   3256 
   3257     def _transform_general(self, func, *args, **kwargs):

/Users/shoover/.py35/lib/python3.5/site-packages/pandas/core/groupby.py in _concat_objects(self, keys, values, not_indexed_same)
   1271                 group_names = self.grouper.names
   1272                 result = concat(values, axis=self.axis, keys=group_keys,
-> 1273                                 levels=group_levels, names=group_names)
   1274             else:
   1275 

/Users/shoover/.py35/lib/python3.5/site-packages/pandas/tools/merge.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    810                        keys=keys, levels=levels, names=names,
    811                        verify_integrity=verify_integrity,
--> 812                        copy=copy)
    813     return op.get_result()
    814 

/Users/shoover/.py35/lib/python3.5/site-packages/pandas/tools/merge.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    866         for obj in objs:
    867             if not isinstance(obj, NDFrame):
--> 868                 raise TypeError("cannot concatenate a non-NDFrame object")
    869 
    870             # consolidate

TypeError: cannot concatenate a non-NDFrame object

Since in this example, assigning dates to months still leaves the same groups, I would have expected identical results whether I set freq='M' or not. I'm guessing that the difference is that the freq='M' causes an extra groupby to happen under the hood, yes? When I ran into this, what I expected to happen was for pd.Grouper(freq='M', key='date') to do a single groupby, combining rows where dates happened to fall into the same month.

Pandas version:

In [9]: pd.__version__
Out[9]: '0.17.1+22.g0c43fcc'
@jreback
Copy link
Contributor

jreback commented Dec 2, 2015

I guess. this is a quite tricky code path. Welcome for you to take a stab at making them consistent.

Keeping in mind that .apply may not always be able to do the same thing as it has to infer return shapes and such.

You should avoid custom functions this as they are non-performant anyhow.

@jreback jreback added Groupby Dtype Conversions Unexpected or buggy dtype conversions API Design Resample resample method labels Dec 2, 2015
@jreback jreback added this to the Next Major Release milestone Dec 2, 2015
@jreback
Copy link
Contributor

jreback commented Dec 2, 2015

xref #9867

@stephen-hoover
Copy link
Contributor Author

I might be able to take a look at this over Christmas, but I think I'll be too busy before then.

I wouldn't use a custom function for something like sum, but sometimes I have aggregations which aren't built-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Dtype Conversions Unexpected or buggy dtype conversions Groupby Resample resample method
Projects
None yet
Development

No branches or pull requests

2 participants