Time-based .rolling() fails with .groupby() #13966

Closed
chrisaycock opened this Issue Aug 11, 2016 · 8 comments

Comments

Projects
None yet
6 participants
Contributor

chrisaycock commented Aug 11, 2016 edited

Starting with this example:

df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.arange(40)})

I can easily compute the rolling mean by identifier:

In [20]: df.groupby('A').rolling(4).B.mean()
Out[20]:
A
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
   5      3.5
   6      4.5
   7      5.5
   8      6.5
   9      7.5
         ...
2  30    28.5
   31    29.5
3  32     NaN
   33     NaN
   34     NaN
   35    33.5
   36    34.5
   37    35.5
   38    36.5
   39    37.5
Name: B, dtype: float64

Now I want to add a timestamp column:

dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

The timestamps are ordered within each identifier, but pandas complains:

In [25]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: B must be monotonic

Re-sorting leads to a different error:

In [26]: df.sort_values('B', inplace=True)

In [27]: df.groupby('A').rolling('4s', on='B').C.mean()
...
ValueError: invalid on specified as B, must be a column (if DataFrame) or None

But we know that these column names are valid:

n [28]: df.rolling('4s', on='B').C.mean()
Out[28]:
0      0.000000
20    10.000000
1      7.000000
21    10.500000
2      8.800000
22    11.000000
3      9.857143
23    11.500000
4     10.857143
24    12.500000
        ...
35    24.714286
15    23.500000
36    25.714286
16    24.500000
37    26.714286
17    25.500000
38    27.714286
18    26.500000
19    25.857143
39    27.500000
Name: C, dtype: float64

It seems like a bug that time-based .rolling() does not work with .groupby().

Agree that this should work!

Is there a good workaround?

@semio semio added a commit to semio/ddf_utils that referenced this issue Dec 13, 2016

@semio semio new procedure: windows
related issue: #25

Note: there is a bug using groupby with rolling on specific column for now, so
we are not using the `on` parameter in rolling.
pandas-dev/pandas#13966
3b14540

Is the any way to do time aware rolling with group by for now before the new pandas release?

Contributor

jreback commented Dec 20, 2016

@ShashankBharadwaj this is not fixed, so new release or not is not going to matter.

jreback added this to the Next Major Release milestone Dec 20, 2016

zscholl commented Apr 20, 2017 edited

I stumbled on this yesterday as I was trying to solve the same problem.

I found a workaround, it's definitely not efficient, but it works.

import pandas as pd

import numpy as np

%load_ext watermark

%watermark -v -m -p pandas,numpy
CPython 3.5.1
IPython 4.2.0

pandas 0.19.2
numpy 1.11.0

compiler   : MSC v.1900 64 bit (AMD64)
system     : Windows
release    : 7
machine    : AMD64
processor  : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
CPU cores  : 8
interpreter: 64bit

# load up the example dataframe
dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                   'B': np.concatenate((dates, dates)),
                   'C': np.arange(40)})

# sort on the datetime column
df.sort_values('B', inplace=True)

#group it
grouped = df.groupby('A')

# tmp array to hold frames
frames = []

for group in grouped.groups:
    frame = grouped.get_group(group)
    frame['avg'] = frame.rolling('4s', on='B').C.mean()
    frames.append(frame)

pd.concat(frames)
Out[18]: 
    A                   B   C   avg
0   1 2016-01-01 09:30:00   0   0.0
1   1 2016-01-01 09:30:01   1   0.5
2   1 2016-01-01 09:30:02   2   1.0
3   1 2016-01-01 09:30:03   3   1.5
4   1 2016-01-01 09:30:04   4   2.5
5   1 2016-01-01 09:30:05   5   3.5
6   1 2016-01-01 09:30:06   6   4.5
7   1 2016-01-01 09:30:07   7   5.5
8   1 2016-01-01 09:30:08   8   6.5
9   1 2016-01-01 09:30:09   9   7.5
10  1 2016-01-01 09:30:10  10   8.5
11  1 2016-01-01 09:30:11  11   9.5
12  1 2016-01-01 09:30:12  12  10.5
13  1 2016-01-01 09:30:13  13  11.5
14  1 2016-01-01 09:30:14  14  12.5
15  1 2016-01-01 09:30:15  15  13.5
16  1 2016-01-01 09:30:16  16  14.5
17  1 2016-01-01 09:30:17  17  15.5
18  1 2016-01-01 09:30:18  18  16.5
19  1 2016-01-01 09:30:19  19  17.5
20  2 2016-01-01 09:30:00  20  20.0
21  2 2016-01-01 09:30:01  21  20.5
22  2 2016-01-01 09:30:02  22  21.0
23  2 2016-01-01 09:30:03  23  21.5
24  2 2016-01-01 09:30:04  24  22.5
25  2 2016-01-01 09:30:05  25  23.5
26  2 2016-01-01 09:30:06  26  24.5
27  2 2016-01-01 09:30:07  27  25.5
28  2 2016-01-01 09:30:08  28  26.5
29  2 2016-01-01 09:30:09  29  27.5
30  2 2016-01-01 09:30:10  30  28.5
31  2 2016-01-01 09:30:11  31  29.5
32  3 2016-01-01 09:30:12  32  32.0
33  3 2016-01-01 09:30:13  33  32.5
34  3 2016-01-01 09:30:14  34  33.0
35  3 2016-01-01 09:30:15  35  33.5
36  3 2016-01-01 09:30:16  36  34.5
37  3 2016-01-01 09:30:17  37  35.5
38  3 2016-01-01 09:30:18  38  36.5
39  3 2016-01-01 09:30:19  39  37.5

Hope this helps anyone in the meantime before a bug fix is provided. I haven't contributed to pandas yet, but having used it so much, maybe it's about time :)

Contributor

chrisaycock commented Apr 20, 2017

@zscholl Thanks for your workaround. A bug fixes would be great if you're offering.

zscholl commented Apr 20, 2017

I'll take a look at it in the coming weeks, @chrisaycock and see what I can do!

Contributor

jreback commented Apr 20, 2017

so this actually works on master now. fixed by #15694 (this will make sure that when sorting a multi-index it actually IS sorting it, previously it would not guarantee monotonic levels, only lexsortedness).

so this just needs tests

In [7]: pd.options.display.max_rows=12

In [8]: pd.__version__
Out[8]: '0.19.0+829.gb17e286'

In [9]: dates = pd.date_range(start='2016-01-01 09:30:00', periods=20, freq='s')
   ...: df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
   ...:                    'B': np.concatenate((dates, dates)),
   ...:                    'C': np.arange(40)})
   ...: 

In [10]: df.groupby('A').rolling('4s', on='B').C.mean()
Out[10]: 
A  B                  
1  2016-01-01 09:30:00     0.0
   2016-01-01 09:30:01     0.5
   2016-01-01 09:30:02     1.0
   2016-01-01 09:30:03     1.5
   2016-01-01 09:30:04     2.5
   2016-01-01 09:30:05     3.5
                          ... 
3  2016-01-01 09:30:14    33.0
   2016-01-01 09:30:15    33.5
   2016-01-01 09:30:16    34.5
   2016-01-01 09:30:17    35.5
   2016-01-01 09:30:18    36.5
   2016-01-01 09:30:19    37.5
Name: C, Length: 40, dtype: float64

@jreback jreback added a commit to jreback/pandas that referenced this issue Apr 22, 2017

@jreback jreback BUG: groupby-rolling with a timedelta
closes #13966
xref to #15130, closed by #15175
e28d07e

@jreback jreback modified the milestone: 0.20.0, Next Major Release Apr 22, 2017

@jreback jreback added a commit to jreback/pandas that referenced this issue Apr 22, 2017

@jreback jreback BUG: groupby-rolling with a timedelta
closes #13966
xref to #15130, closed by #15175
b30a50f

jreback closed this in #16091 Apr 22, 2017

@jreback jreback added a commit that referenced this issue Apr 22, 2017

@jreback jreback BUG: groupby-rolling with a timedelta (#16091)
closes #13966
xref to #15130, closed by #15175
f0bd908

@linebp linebp added a commit to linebp/pandas that referenced this issue May 2, 2017

@jreback @linebp jreback + linebp BUG: groupby-rolling with a timedelta (#16091)
closes #13966
xref to #15130, closed by #15175
09f9e10

@pcluo pcluo added a commit to pcluo/pandas that referenced this issue May 22, 2017

@jreback @pcluo jreback + pcluo BUG: groupby-rolling with a timedelta (#16091)
closes #13966
xref to #15130, closed by #15175
a66a612
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment