Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: RecursionError using agg on a resampled SeriesGroupBy #42905

Closed
2 tasks done
manoelpqueiroz opened this issue Aug 5, 2021 · 6 comments · Fixed by #43410
Closed
2 tasks done

BUG: RecursionError using agg on a resampled SeriesGroupBy #42905

manoelpqueiroz opened this issue Aug 5, 2021 · 6 comments · Fixed by #43410
Labels
Apply Apply, Aggregate, Transform Bug Groupby Regression Functionality that used to work in a prior pandas version Resample resample method
Milestone

Comments

@manoelpqueiroz
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.


When you mix resample with groupby and try to use the agg method to supply multiple functions to either a DataFrameGroupBy or SeriesGroupBy, Python suddently exits without even raising an error.

I first thought I was running into this because I was supplying a single column expecting a DataFrame with multiple columns, but I can confirm this happens to me whether I provide a column (variable b) or apply the method to the entire GroupBy (variable c):

Code Sample

import pandas as pd

a = pd.DataFrame({
    'class': {
        0: 'beta', 1: 'alpha', 2: 'alpha', 3: 'gaga', 4: 'beta', 5: 'gaga',
        6: 'beta', 7: 'gaga', 8: 'beta', 9: 'gaga', 10: 'alpha', 11: 'beta',
        12: 'alpha', 13: 'gaga', 14: 'alpha'},
    'value': {
        0: 69, 1: 33, 2: 40, 3: 2, 4: 36, 5: 40, 6: 48, 7: 84, 8: 77, 9: 22,
        10: 55, 11: 82, 12: 37, 13: 88, 14: 41},
    'date': {
        0: pd.Timestamp('2021-02-28 00:00:00'),
        1: pd.Timestamp('2021-11-30 00:00:00'),
        2: pd.Timestamp('2021-02-28 00:00:00'),
        3: pd.Timestamp('2021-04-30 00:00:00'),
        4: pd.Timestamp('2021-02-28 00:00:00'),
        5: pd.Timestamp('2021-04-30 00:00:00'),
        6: pd.Timestamp('2021-07-31 00:00:00'),
        7: pd.Timestamp('2021-01-31 00:00:00'),
        8: pd.Timestamp('2021-01-31 00:00:00'),
        9: pd.Timestamp('2021-01-31 00:00:00'),
        10: pd.Timestamp('2021-04-30 00:00:00'),
        11: pd.Timestamp('2021-10-31 00:00:00'),
        12: pd.Timestamp('2021-09-30 00:00:00'),
        13: pd.Timestamp('2021-04-30 00:00:00'),
        14: pd.Timestamp('2021-05-31 00:00:00')}})

# This will exit Python
b = a\
    .set_index('date')\
    .groupby('class')\
    .resample('M')['value']\
    .agg(['sum', 'size'])

# Not informing a column will ALSO make Python exit
c = a\
    .set_index('date')\
    .groupby('class')\
    .resample('M')\
    .agg(['sum', 'size'])

Problem description

I'm not sure if this method is supported for instances of DatetimeIndexResamplerGroupby objects, but calling it without arguments is valid, giving:

<bound method Resampler.aggregate of <pandas.core.resample.DatetimeIndexResamplerGroupby object at 0x00000163B22B0100>>

Also, while the problem arises with either a Series or a DataFrame, given that using agg with multiple functions on a SeriesGroupBy will correctly create a DataFrame, I would expect the same to happen when resampling with timestamps:

In [1]: a.groupby('class')['value'].agg(['sum', 'size'])
Out[1]:
       sum  size
class
alpha  206     5
beta   312     5
gaga   236     5

Expected Output

                  sum  size
class date
alpha 2021-02-28   40     1
      2021-03-31    0     0
      2021-04-30   55     1
      2021-05-31   41     1
      2021-06-30    0     0
      2021-07-31    0     0
      2021-08-31    0     0
      2021-09-30   37     1
      2021-10-31    0     0
      2021-11-30   33     1
beta  2021-01-31   77     1
      2021-02-28  105     2
      2021-03-31    0     0
      2021-04-30    0     0
      2021-05-31    0     0
      2021-06-30    0     0
      2021-07-31   48     1
      2021-08-31    0     0
      2021-09-30    0     0
      2021-10-31   82     1
gaga  2021-01-31  106     2
      2021-02-28    0     0
      2021-03-31    0     0
      2021-04-30  130     3

Output of pd.show_versions()

INSTALLED VERSIONS

commit : c7f7443
python : 3.9.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : pt_BR.cp1252

pandas : 1.3.1
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.2.1
setuptools : 49.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : 3.5.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.24.1
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.1
numexpr : None
odfpy : None
openpyxl : 3.0.6
pandas_gbq : None
pyarrow : None
pyxlsb : 1.0.8
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@manoelpqueiroz manoelpqueiroz added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 5, 2021
@sebasv
Copy link
Contributor

sebasv commented Aug 5, 2021

I just ran your example and got

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-1-7682602f0189> in <module>
     27 
     28 # This will exit Python
---> 29 b = a\
     30     .set_index('date')\
     31     .groupby('class')\

~/anaconda3/lib/python3.8/site-packages/pandas/core/resample.py in aggregate(self, func, *args, **kwargs)
    332     def aggregate(self, func, *args, **kwargs):
    333 
--> 334         result = ResamplerWindowApply(self, func, args=args, kwargs=kwargs).agg()
    335         if result is None:
    336             how = func

~/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in agg(self)
    162         elif is_list_like(arg):
    163             # we require a list, but not a 'str'
--> 164             return self.agg_list_like()
    165 
    166         if callable(arg):

~/anaconda3/lib/python3.8/site-packages/pandas/core/apply.py in agg_list_like(self)
    353                 colg = obj._gotitem(col, ndim=1, subset=selected_obj.iloc[:, index])
    354                 try:
--> 355                     new_res = colg.aggregate(arg)
    356                 except (TypeError, DataError):
    357                     pass

... last 3 frames repeated, from the frame below ...

~/anaconda3/lib/python3.8/site-packages/pandas/core/resample.py in aggregate(self, func, *args, **kwargs)
    332     def aggregate(self, func, *args, **kwargs):
    333 
--> 334         result = ResamplerWindowApply(self, func, args=args, kwargs=kwargs).agg()
    335         if result is None:
    336             how = func

RecursionError: maximum recursion depth exceeded while calling a Python object

the problem doesn't happen when I do

c = a\
    .set_index('date')\
    .groupby('class')\
    .resample('M') \
    .sum()

or

c = a\
    .set_index('date')\
    .groupby('class')\
    .resample('M') \
    .agg('count')

@mroeschke mroeschke added Apply Apply, Aggregate, Transform Groupby Resample resample method and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@simonjayhawkins simonjayhawkins changed the title BUG: Using agg on a resampled SeriesGroupBy exits Python without traceback BUG: RecursionError using agg on a resampled SeriesGroupBy Aug 25, 2021
@simonjayhawkins
Copy link
Member

On pandas 1.2.5, the first example (b=..) gives the expected output. The second example (c=...) raises RecursionError: maximum recursion depth exceeded in __instancecheck__ which is different from the error on master RecursionError: maximum recursion depth exceeded while calling a Python object

I'll label as a regression for now pending further investigation.

@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Aug 25, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Aug 25, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Aug 25, 2021
@simonjayhawkins
Copy link
Member

first bad commit: [212323f] BUG: DataFrame.agg and apply with 'size' returns a scalar (#39935)

after this commit the error was

2021-08-25T11:17:29.5737203Z Traceback (most recent call last):
2021-08-25T11:17:29.5738655Z   File "../42905.py", line 64, in <module>
2021-08-25T11:17:29.5739824Z     result = grp.resample("M")["value"].agg(["sum", "size"])
2021-08-25T11:17:29.5741294Z   File "/home/runner/work/pandas/pandas/pandas/core/resample.py", line 344, in aggregate
2021-08-25T11:17:29.5742977Z     result = ResamplerWindowApply(self, func, args=args, kwargs=kwargs).agg()
2021-08-25T11:17:29.5744583Z   File "/home/runner/work/pandas/pandas/pandas/core/apply.py", line 196, in agg
2021-08-25T11:17:29.5745877Z     return self.agg_list_like(_axis=_axis)
2021-08-25T11:17:29.5747209Z   File "/home/runner/work/pandas/pandas/pandas/core/apply.py", line 358, in agg_list_like
2021-08-25T11:17:29.5748120Z     new_res = colg.aggregate(a)
2021-08-25T11:17:29.5749318Z   File "/home/runner/work/pandas/pandas/pandas/core/resample.py", line 344, in aggregate
2021-08-25T11:17:29.5750342Z     result = ResamplerWindowApply(self, func, args=args, kwargs=kwargs).agg()
2021-08-25T11:17:29.5751265Z   File "/home/runner/work/pandas/pandas/pandas/core/apply.py", line 188, in agg
2021-08-25T11:17:29.5752011Z     result = self.maybe_apply_str()
2021-08-25T11:17:29.5752754Z   File "/home/runner/work/pandas/pandas/pandas/core/apply.py", line 514, in maybe_apply_str
2021-08-25T11:17:29.5753474Z     value = obj.shape[self.axis]
2021-08-25T11:17:29.5754218Z   File "/home/runner/work/pandas/pandas/pandas/core/resample.py", line 166, in __getattr__
2021-08-25T11:17:29.5755240Z     return object.__getattribute__(self, attr)
2021-08-25T11:17:29.5756835Z AttributeError: 'DatetimeIndexResamplerGroupby' object has no attribute 'shape'

cc @rhshadrach

@rhshadrach
Copy link
Member

Thanks @manoelpqueiroz for the report. I'm seeing the same with a git bisect as @simonjayhawkins reported, but don't quite understand it yet. The method BaseGroupBy._selected_obj is being called on master with self._selection as None, resulting in a DataFrame where 1.2.x has it set and returns a Series. It appears to me the current implementation on master would work as expected if this method returned a Series.

I ran a bisect to see where this method started to return a DataFrame with:

a = pd.DataFrame(
    {
        'class': ['beta'],
        'value': [69],
        'date': [pd.Timestamp('2021-02-28 00:00:00')],
    }
)
ndim = a.set_index('date').groupby('class').resample('M')['value']._selected_obj.ndim
assert ndim == 1

and found first bad commit is a222322, cc @jbrockmendel

@jbrockmendel
Copy link
Member

This is pretty nasty. Best guess is that in GroupByMixin._gotitem when we do groupby = self._groupby[key] and catch IndexError, we are catching cases that we shouldn't be.

@rhshadrach
Copy link
Member

Thanks @jbrockmendel - that was pretty much it. I think I have a good resolution here, PR going up shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Bug Groupby Regression Functionality that used to work in a prior pandas version Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants