BUG: resample closed='left' not binning correctly. #4197

nehalecky · 2013-07-10T23:22:24Z

related: http://stackoverflow.com/questions/21329425/resampling-a-pandas-dataframe-with-loffset-introduces-an-additional-offset-of-an

Hey pandas team. Sorry to have gone MIA the past week, super busy with work. I promise (and look forward to) contributing more soon. :)

Still, I wanted to note that I came across what I believe to a be a bug in resample() when trying to change the interval of the binning with closed='left'. I know that there have been a few changes to the resample() API since Wes' book, however, I don't believe they changed this functionality, but I have been wrong before :)

Bug can be reproduced using the example from Wes' book, generating 12 mins of data like:

In [3]: rng = pd.date_range('1/1/2000', periods=12, freq='T')
In [4]: ts = pd.Series(np.arange(12), index=rng)
In [5]: ts
Out[5]: 
2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int64

we can do a simple resample to 5 mins like:

In [6]: ts.resample('5min', how='sum')
Out[6]: 
2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int64

For my use, I need this resampling to be 'backwards looking' so that the summations at each resampled timestamp include the previous 4 minutes. Documentation (and Wes' book) suggest this is achieved by binning with closed='left', however, this results in the same output as above:

In [7]: ts.resample('5min', how='sum', closed='left')
Out[8]: 
2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int64

I was looking for the following result (note that the first timestamp is at 00:05:00 and with hanging data dropped):

2000-01-01 00:05:00    10
2000-01-01 00:10:00    35
Freq: 5T, dtype: int64

I am able to generate this by combining loffset='5min' and then slicing into the resultant Series to remove the:

In [10]: ts.resample('5min', how='sum', closed='left', loffset='5min')[:-1]
Out[10]: 
2000-01-01 00:05:00    10
2000-01-01 00:10:00    35
Freq: 5T, dtype: int64

but this is hardly ideal as it's not known in advance if time series ends with a timestamp that resolves equally to the final timestamp of the resampling procedure!

Apologies if I am missing something—any thoughts, help or guidance is welcomed!
Thanks so much.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2013-07-11T02:29:52Z

Passing label=right will take care of gettings the timestamps: sums aligned correctly. That might be a bit cleaner than passing the loffset.

It does seem like having a parameter to only include full ranges when upsampling. Anyone se a reason not to?

An easy, but very wasteful way to check is by resampling by count and checking if it's equal to 5.

In [45]: cts = ts.resample('5min', label='right', how='count')

In [46]: cts
Out[46]: 
2000-01-01 00:05:00    5
2000-01-01 00:10:00    5
2000-01-01 00:15:00    2
dtype: int64

In [48]: full = cts[cts == 5].index

In [49]: full
Out[49]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:05:00, 2000-01-01 00:10:00]
Length: 2, Freq: None, Timezone: None

In [53]: ts.resample('5min', label='right', how='sum').ix[full]
Out[53]: 
2000-01-01 00:05:00    10
2000-01-01 00:10:00    35
dtype: int64

nehalecky · 2013-07-11T03:34:12Z

Hey @TomAugspurger, thanks for the feedback.

I've been digging a little bit more into this and it seems that the logic applied with resample of a DatetimeIndex is based in how the frequency of resample is anchored to the time range. I'll try and explain:

Again, we can create a time series, however, now with a starting time at 00:09:00:

In [5]: rng = pd.date_range('1/1/2000 00:09:00', periods=12, freq='T')
In [6]: ts = pd.Series(np.arange(12), index=rng)
In [7]: ts
Out[7]: 
2000-01-01 00:09:00     0
2000-01-01 00:10:00     1
2000-01-01 00:11:00     2
2000-01-01 00:12:00     3
2000-01-01 00:13:00     4
2000-01-01 00:14:00     5
2000-01-01 00:15:00     6
2000-01-01 00:16:00     7
2000-01-01 00:17:00     8
2000-01-01 00:18:00     9
2000-01-01 00:19:00    10
2000-01-01 00:20:00    11
Freq: T, dtype: int64

Resampling in a frequency that, in reference to the initial timestamp, can equally divide the time period from the start of the hour (i.e., 00:00:00), works well:

In [9]: ts.resample('3T')
Out[9]: 
2000-01-01 00:09:00     1
2000-01-01 00:12:00     4
2000-01-01 00:15:00     7
2000-01-01 00:18:00    10
Freq: 3T, dtype: int64

And note that, this is equal to the resample with count=left:

In [10]: ts.resample('3T', closed='left')
Out[10]: 
2000-01-01 00:09:00     1
2000-01-01 00:12:00     4
2000-01-01 00:15:00     7
2000-01-01 00:18:00    10
Freq: 3T, dtype: int64

However, when one applies a frequency that doesn't equally divide the time range, the anchored reference to the start of the hour resolves to an initial and ending timestamp that did not exist in the initial DataFrame:

In [12]: ts.resample('4T')
Out[12]: 
2000-01-01 00:08:00     1.0
2000-01-01 00:12:00     4.5
2000-01-01 00:16:00     8.5
2000-01-01 00:20:00    11.0
Freq: 4T, dtype: float64

Even passing a closed='left' does not resolve this:

In [13]: ts.resample('4T', closed='left')
Out[13]: 
2000-01-01 00:08:00     1.0
2000-01-01 00:12:00     4.5
2000-01-01 00:16:00     8.5
2000-01-01 00:20:00    11.0
Freq: 4T, dtype: float64

Your suggestion of passing a label='right' results in shift that includes an additional timestamp that did not exist in the initial DataFrame (demonstrating that the binning logic is over-reaching with certain parameter combinations?):

In [15]: ts.resample('4T', label='right')
Out[15]: 
2000-01-01 00:12:00     1.0
2000-01-01 00:16:00     4.5
2000-01-01 00:20:00     8.5
2000-01-01 00:24:00    11.0
Freq: 4T, dtype: float64

My personal opinion is that, when downsampling, timestamps should NOT be inferred that lie outside of the initial time range resampled (thus ensuring that we aren't creating data that does not exist), regardless of what slice is passed to resample(). Perhaps I am completely missing how this is managed in pandas. If so, I would love to know what I do not know.

Again, thank you all for all of your great work! Really appreciated! :)

TomAugspurger · 2013-07-11T04:12:59Z

If you're interested in looking at the code, most of the logic is in /pandas/tseries/resample.py. I've got to look a bit closer before I can say anything with confidence, but here are a couple thoughts.

You're bringing up a good point about creating timestamps new maybe being a bad idea when downsampling. This may be up to the user though. They can control where the timestamps start by using base:

In [140]: ts.resample('4T', base=1)
Out[140]: 
2000-01-01 00:09:00    1.5
2000-01-01 00:13:00    5.5
2000-01-01 00:17:00    9.5
Freq: 4T, dtype: float64

Does that look right to you?

We could make this more convenient by allowing base to optionally take "start" or "end". If it's "start" is passed then we can infer that the correct base should be ts.index[0].minute % 4 = 1.

mroeschke · 2020-03-31T03:46:40Z

I agree that I think base is the proper way to handle this. Changing the resample behavior after all this time might be to much of a change. Additionally #31809 will provide an easier way to specify where the origin timestamp should start at. Closing.

nehalecky · 2020-03-31T06:49:24Z

Thanks @mroeschke! Look forward to checking out enhancements to resample provided by #31809. Keep it up, pandas team!

jreback mentioned this issue Jan 14, 2014

Inconsistent behaviour in resample between daily and weekly #5937

Closed

jreback mentioned this issue Jan 25, 2014

BUG: df.groupby(TimeGrouper(freq='48min', closed='right')) #6085

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Apr 4, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015

mroeschke closed this as completed Mar 31, 2020

hasB4K mentioned this issue Apr 11, 2020

ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: resample closed='left' not binning correctly. #4197

BUG: resample closed='left' not binning correctly. #4197

nehalecky commented Jul 10, 2013

TomAugspurger commented Jul 11, 2013

nehalecky commented Jul 11, 2013

TomAugspurger commented Jul 11, 2013

mroeschke commented Mar 31, 2020

nehalecky commented Mar 31, 2020

BUG: resample closed='left' not binning correctly. #4197

BUG: resample closed='left' not binning correctly. #4197

Comments

nehalecky commented Jul 10, 2013

TomAugspurger commented Jul 11, 2013

nehalecky commented Jul 11, 2013

TomAugspurger commented Jul 11, 2013

mroeschke commented Mar 31, 2020

nehalecky commented Mar 31, 2020