New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling groupby should not maintain the by column in the resulting DataFrame #14013

Closed
chrisaycock opened this Issue Aug 16, 2016 · 6 comments

Comments

@chrisaycock
Contributor

chrisaycock commented Aug 16, 2016

I found another oddity while digging through #13966.

Begin with the initial DataFrame in that issue:

df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.arange(40)})

Save the grouping:

In [215]: g = df.groupby('A')

Compute the rolling sum:

In [216]: r = g.rolling(4)

In [217]: r.sum()
Out[217]:
         A      B
A
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
  5    4.0   14.0
  6    4.0   18.0
  7    4.0   22.0
  8    4.0   26.0
  9    4.0   30.0
...    ...    ...
2 30   8.0  114.0
  31   8.0  118.0
3 32   NaN    NaN
  33   NaN    NaN
  34   NaN    NaN
  35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

It maintains the by column (A)! That column should not be in the resulting DataFrame.

It gets weirder if I compute the sum over the entire grouping and then re-do the rolling calculation. Now by column is gone as expected:

In [218]: g.sum()
Out[218]:
     B
A
1  190
2  306
3  284

In [219]: r.sum()
Out[219]:
          B
A
1 0     NaN
  1     NaN
  2     NaN
  3     6.0
  4    10.0
  5    14.0
  6    18.0
  7    22.0
  8    26.0
  9    30.0
...     ...
2 30  114.0
  31  118.0
3 32    NaN
  33    NaN
  34    NaN
  35  134.0
  36  138.0
  37  142.0
  38  146.0
  39  150.0

[40 rows x 1 columns]

So the grouping summation has some sort of side effect.

@chrisaycock

This comment has been minimized.

Contributor

chrisaycock commented Aug 16, 2016

A little note while digging through more code: _convert_grouper in groupby.py has:

    if isinstance(grouper, dict):
        ...
    elif isinstance(grouper, Series):
        ...
    elif isinstance(grouper, (list, Series, Index, np.ndarray)):
        ...
    else:
        ...

The grouper is compared twice to Series. I will fix this when I clean-up everything.

@chrisaycock

This comment has been minimized.

Contributor

chrisaycock commented Aug 16, 2016

I can fix the issue if I set the group selection:

g._set_group_selection()

I think we need this function at the start of .rolling().

Seems similar to #12839

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 17, 2016

This is defined behavior; in, that it is identical to .apply on the groupby.

In [10]: df.groupby('A').rolling(4).sum()
Out[10]: 
         A      B
A                
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
...    ...    ...
3 35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

In [11]: df.groupby('A').rolling(4).apply(lambda x: x.sum())
Out[11]: 
         A      B
A                
1 0    NaN    NaN
  1    NaN    NaN
  2    NaN    NaN
  3    4.0    6.0
  4    4.0   10.0
...    ...    ...
3 35  12.0  134.0
  36  12.0  138.0
  37  12.0  142.0
  38  12.0  146.0
  39  12.0  150.0

[40 rows x 2 columns]

you can look back at the issues, IIRC @jorisvandenbossche and I had a long conversation about this.

@chrisaycock

This comment has been minimized.

Contributor

chrisaycock commented Aug 17, 2016

Hmm:

In [617]: df.groupby('A').sum()
Out[617]:
     B
A
1  190
2  306
3  284

In [618]: df.groupby('A').apply(lambda x: x.sum())
Out[618]:
    A    B
A
1  20  190
2  24  306
3  24  284

In addition to .rolling() and .apply(), .ohlc() and .expanding() keep the by column following a .groupby().

@jreback

This comment has been minimized.

Contributor

jreback commented Sep 1, 2016

on reread this should be consistent - so marking as a bug
prob should not include the grouping column/level even though apply does

@ohadle

This comment has been minimized.

ohadle commented Feb 15, 2017

A similar thing happens with index columns.

from pandas import DataFrame, Timestamp

c = pandas.DataFrame({u'ul_payload': {('a', Timestamp('2016-11-01 06:15:00')): 5, ('a', Timestamp('2016-11-01 07:45:00')): 8, ('a', Timestamp('2016-11-01 09:00:00')): 9, ('a', Timestamp('2016-11-01 07:15:00')): 6, ('a', Timestamp('2016-11-01 07:30:00')): 7, ('a', Timestamp('2016-11-01 06:00:00')): 4}, u'dl_payload': {('a', Timestamp('2016-11-01 06:15:00')): 15, ('a', Timestamp('2016-11-01 07:45:00')): 18, ('a', Timestamp('2016-11-01 09:00:00')): 19, ('a', Timestamp('2016-11-01 07:15:00')): 16, ('a', Timestamp('2016-11-01 07:30:00')): 17, ('a', Timestamp('2016-11-01 06:00:00')): 14}})

In [27]: c
Out[27]:
                       dl_payload  ul_payload
a 2016-11-01 06:00:00          14           4
  2016-11-01 06:15:00          15           5
  2016-11-01 07:15:00          16           6
  2016-11-01 07:30:00          17           7
  2016-11-01 07:45:00          18           8
  2016-11-01 09:00:00          19           9

In [29]: c.groupby(level=0).rolling(window=3).agg(np.sum)
Out[29]:
                         dl_payload  ul_payload
a a 2016-11-01 06:00:00         NaN         NaN
    2016-11-01 06:15:00         NaN         NaN
    2016-11-01 07:15:00        45.0        15.0
    2016-11-01 07:30:00        48.0        18.0
    2016-11-01 07:45:00        51.0        21.0
    2016-11-01 09:00:00        54.0        24.0

But not with group_keys=False:

In [48]: c.groupby(level=0, group_keys=False).rolling(window=3).agg(np.sum)
Out[48]:
                       dl_payload  ul_payload
a 2016-11-01 06:00:00         NaN         NaN
  2016-11-01 06:15:00         NaN         NaN
  2016-11-01 07:15:00        45.0        15.0
  2016-11-01 07:30:00        48.0        18.0
  2016-11-01 07:45:00        51.0        21.0
  2016-11-01 09:00:00        54.0        24.0

@jreback jreback modified the milestones: Next Minor Release, 0.20.0 Mar 29, 2017

@jreback jreback added the Prio-high label Mar 29, 2017

@jreback jreback modified the milestones: Interesting Issues, Next Major Release Nov 26, 2017

@jreback jreback added this to Bug in Interesting Things Nov 26, 2017

@WillAyd WillAyd referenced this issue May 9, 2018

Merged

Consistent Return Structure for Rolling Apply #20984

4 of 4 tasks complete

@jreback jreback modified the milestones: Next Major Release, 0.23.0 May 9, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment