Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: resample closed='left' not binning correctly. #4197

Closed
nehalecky opened this issue Jul 10, 2013 · 5 comments · Fixed by #31809
Closed

BUG: resample closed='left' not binning correctly. #4197

nehalecky opened this issue Jul 10, 2013 · 5 comments · Fixed by #31809
Labels
Datetime Datetime data dtype Indexing Related to indexing on series/frames, not to indexes themselves Resample resample method

Comments

@nehalecky
Copy link
Contributor

related: http://stackoverflow.com/questions/21329425/resampling-a-pandas-dataframe-with-loffset-introduces-an-additional-offset-of-an

Hey pandas team. Sorry to have gone MIA the past week, super busy with work. I promise (and look forward to) contributing more soon. :)

Still, I wanted to note that I came across what I believe to a be a bug in resample() when trying to change the interval of the binning with closed='left'. I know that there have been a few changes to the resample() API since Wes' book, however, I don't believe they changed this functionality, but I have been wrong before :)

Bug can be reproduced using the example from Wes' book, generating 12 mins of data like:

In [3]: rng = pd.date_range('1/1/2000', periods=12, freq='T')
In [4]: ts = pd.Series(np.arange(12), index=rng)
In [5]: ts
Out[5]: 
2000-01-01 00:00:00     0
2000-01-01 00:01:00     1
2000-01-01 00:02:00     2
2000-01-01 00:03:00     3
2000-01-01 00:04:00     4
2000-01-01 00:05:00     5
2000-01-01 00:06:00     6
2000-01-01 00:07:00     7
2000-01-01 00:08:00     8
2000-01-01 00:09:00     9
2000-01-01 00:10:00    10
2000-01-01 00:11:00    11
Freq: T, dtype: int64

we can do a simple resample to 5 mins like:

In [6]: ts.resample('5min', how='sum')
Out[6]: 
2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int64

For my use, I need this resampling to be 'backwards looking' so that the summations at each resampled timestamp include the previous 4 minutes. Documentation (and Wes' book) suggest this is achieved by binning with closed='left', however, this results in the same output as above:

In [7]: ts.resample('5min', how='sum', closed='left')
Out[8]: 
2000-01-01 00:00:00    10
2000-01-01 00:05:00    35
2000-01-01 00:10:00    21
Freq: 5T, dtype: int64

I was looking for the following result (note that the first timestamp is at 00:05:00 and with hanging data dropped):

2000-01-01 00:05:00    10
2000-01-01 00:10:00    35
Freq: 5T, dtype: int64

I am able to generate this by combining loffset='5min' and then slicing into the resultant Series to remove the:

In [10]: ts.resample('5min', how='sum', closed='left', loffset='5min')[:-1]
Out[10]: 
2000-01-01 00:05:00    10
2000-01-01 00:10:00    35
Freq: 5T, dtype: int64

but this is hardly ideal as it's not known in advance if time series ends with a timestamp that resolves equally to the final timestamp of the resampling procedure!

Apologies if I am missing something—any thoughts, help or guidance is welcomed!
Thanks so much.

@TomAugspurger
Copy link
Contributor

Passing label=right will take care of gettings the timestamps: sums aligned correctly. That might be a bit cleaner than passing the loffset.

It does seem like having a parameter to only include full ranges when upsampling. Anyone se a reason not to?

An easy, but very wasteful way to check is by resampling by count and checking if it's equal to 5.

In [45]: cts = ts.resample('5min', label='right', how='count')

In [46]: cts
Out[46]: 
2000-01-01 00:05:00    5
2000-01-01 00:10:00    5
2000-01-01 00:15:00    2
dtype: int64

In [48]: full = cts[cts == 5].index

In [49]: full
Out[49]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:05:00, 2000-01-01 00:10:00]
Length: 2, Freq: None, Timezone: None

In [53]: ts.resample('5min', label='right', how='sum').ix[full]
Out[53]: 
2000-01-01 00:05:00    10
2000-01-01 00:10:00    35
dtype: int64

@nehalecky
Copy link
Contributor Author

Hey @TomAugspurger, thanks for the feedback.

I've been digging a little bit more into this and it seems that the logic applied with resample of a DatetimeIndex is based in how the frequency of resample is anchored to the time range. I'll try and explain:

Again, we can create a time series, however, now with a starting time at 00:09:00:

In [5]: rng = pd.date_range('1/1/2000 00:09:00', periods=12, freq='T')
In [6]: ts = pd.Series(np.arange(12), index=rng)
In [7]: ts
Out[7]: 
2000-01-01 00:09:00     0
2000-01-01 00:10:00     1
2000-01-01 00:11:00     2
2000-01-01 00:12:00     3
2000-01-01 00:13:00     4
2000-01-01 00:14:00     5
2000-01-01 00:15:00     6
2000-01-01 00:16:00     7
2000-01-01 00:17:00     8
2000-01-01 00:18:00     9
2000-01-01 00:19:00    10
2000-01-01 00:20:00    11
Freq: T, dtype: int64

Resampling in a frequency that, in reference to the initial timestamp, can equally divide the time period from the start of the hour (i.e., 00:00:00), works well:

In [9]: ts.resample('3T')
Out[9]: 
2000-01-01 00:09:00     1
2000-01-01 00:12:00     4
2000-01-01 00:15:00     7
2000-01-01 00:18:00    10
Freq: 3T, dtype: int64

And note that, this is equal to the resample with count=left:

In [10]: ts.resample('3T', closed='left')
Out[10]: 
2000-01-01 00:09:00     1
2000-01-01 00:12:00     4
2000-01-01 00:15:00     7
2000-01-01 00:18:00    10
Freq: 3T, dtype: int64

However, when one applies a frequency that doesn't equally divide the time range, the anchored reference to the start of the hour resolves to an initial and ending timestamp that did not exist in the initial DataFrame:

In [12]: ts.resample('4T')
Out[12]: 
2000-01-01 00:08:00     1.0
2000-01-01 00:12:00     4.5
2000-01-01 00:16:00     8.5
2000-01-01 00:20:00    11.0
Freq: 4T, dtype: float64

Even passing a closed='left' does not resolve this:

In [13]: ts.resample('4T', closed='left')
Out[13]: 
2000-01-01 00:08:00     1.0
2000-01-01 00:12:00     4.5
2000-01-01 00:16:00     8.5
2000-01-01 00:20:00    11.0
Freq: 4T, dtype: float64

Your suggestion of passing a label='right' results in shift that includes an additional timestamp that did not exist in the initial DataFrame (demonstrating that the binning logic is over-reaching with certain parameter combinations?):

In [15]: ts.resample('4T', label='right')
Out[15]: 
2000-01-01 00:12:00     1.0
2000-01-01 00:16:00     4.5
2000-01-01 00:20:00     8.5
2000-01-01 00:24:00    11.0
Freq: 4T, dtype: float64

My personal opinion is that, when downsampling, timestamps should NOT be inferred that lie outside of the initial time range resampled (thus ensuring that we aren't creating data that does not exist), regardless of what slice is passed to resample(). Perhaps I am completely missing how this is managed in pandas. If so, I would love to know what I do not know.

Again, thank you all for all of your great work! Really appreciated! :)

@TomAugspurger
Copy link
Contributor

If you're interested in looking at the code, most of the logic is in /pandas/tseries/resample.py. I've got to look a bit closer before I can say anything with confidence, but here are a couple thoughts.

You're bringing up a good point about creating timestamps new maybe being a bad idea when downsampling. This may be up to the user though. They can control where the timestamps start by using base:

In [140]: ts.resample('4T', base=1)
Out[140]: 
2000-01-01 00:09:00    1.5
2000-01-01 00:13:00    5.5
2000-01-01 00:17:00    9.5
Freq: 4T, dtype: float64

Does that look right to you?

We could make this more convenient by allowing base to optionally take "start" or "end". If it's "start" is passed then we can infer that the correct base should be ts.index[0].minute % 4 = 1.

@mroeschke
Copy link
Member

I agree that I think base is the proper way to handle this. Changing the resample behavior after all this time might be to much of a change. Additionally #31809 will provide an easier way to specify where the origin timestamp should start at. Closing.

@nehalecky
Copy link
Contributor Author

Thanks @mroeschke! Look forward to checking out enhancements to resample provided by #31809. Keep it up, pandas team!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Indexing Related to indexing on series/frames, not to indexes themselves Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants