-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: resample closed='left' not binning correctly. #4197
Comments
Passing It does seem like having a parameter to only include full ranges when upsampling. Anyone se a reason not to? An easy, but very wasteful way to check is by resampling by count and checking if it's equal to 5. In [45]: cts = ts.resample('5min', label='right', how='count')
In [46]: cts
Out[46]:
2000-01-01 00:05:00 5
2000-01-01 00:10:00 5
2000-01-01 00:15:00 2
dtype: int64
In [48]: full = cts[cts == 5].index
In [49]: full
Out[49]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:05:00, 2000-01-01 00:10:00]
Length: 2, Freq: None, Timezone: None
In [53]: ts.resample('5min', label='right', how='sum').ix[full]
Out[53]:
2000-01-01 00:05:00 10
2000-01-01 00:10:00 35
dtype: int64 |
Hey @TomAugspurger, thanks for the feedback. I've been digging a little bit more into this and it seems that the logic applied with resample of a Again, we can create a time series, however, now with a starting time at In [5]: rng = pd.date_range('1/1/2000 00:09:00', periods=12, freq='T')
In [6]: ts = pd.Series(np.arange(12), index=rng)
In [7]: ts
Out[7]:
2000-01-01 00:09:00 0
2000-01-01 00:10:00 1
2000-01-01 00:11:00 2
2000-01-01 00:12:00 3
2000-01-01 00:13:00 4
2000-01-01 00:14:00 5
2000-01-01 00:15:00 6
2000-01-01 00:16:00 7
2000-01-01 00:17:00 8
2000-01-01 00:18:00 9
2000-01-01 00:19:00 10
2000-01-01 00:20:00 11
Freq: T, dtype: int64 Resampling in a frequency that, in reference to the initial timestamp, can equally divide the time period from the start of the hour (i.e., In [9]: ts.resample('3T')
Out[9]:
2000-01-01 00:09:00 1
2000-01-01 00:12:00 4
2000-01-01 00:15:00 7
2000-01-01 00:18:00 10
Freq: 3T, dtype: int64 And note that, this is equal to the resample with In [10]: ts.resample('3T', closed='left')
Out[10]:
2000-01-01 00:09:00 1
2000-01-01 00:12:00 4
2000-01-01 00:15:00 7
2000-01-01 00:18:00 10
Freq: 3T, dtype: int64 However, when one applies a frequency that doesn't equally divide the time range, the anchored reference to the start of the hour resolves to an initial and ending timestamp that did not exist in the initial DataFrame: In [12]: ts.resample('4T')
Out[12]:
2000-01-01 00:08:00 1.0
2000-01-01 00:12:00 4.5
2000-01-01 00:16:00 8.5
2000-01-01 00:20:00 11.0
Freq: 4T, dtype: float64 Even passing a In [13]: ts.resample('4T', closed='left')
Out[13]:
2000-01-01 00:08:00 1.0
2000-01-01 00:12:00 4.5
2000-01-01 00:16:00 8.5
2000-01-01 00:20:00 11.0
Freq: 4T, dtype: float64 Your suggestion of passing a In [15]: ts.resample('4T', label='right')
Out[15]:
2000-01-01 00:12:00 1.0
2000-01-01 00:16:00 4.5
2000-01-01 00:20:00 8.5
2000-01-01 00:24:00 11.0
Freq: 4T, dtype: float64 My personal opinion is that, when downsampling, timestamps should NOT be inferred that lie outside of the initial time range resampled (thus ensuring that we aren't creating data that does not exist), regardless of what slice is passed to Again, thank you all for all of your great work! Really appreciated! :) |
If you're interested in looking at the code, most of the logic is in You're bringing up a good point about creating timestamps new maybe being a bad idea when downsampling. This may be up to the user though. They can control where the timestamps start by using In [140]: ts.resample('4T', base=1)
Out[140]:
2000-01-01 00:09:00 1.5
2000-01-01 00:13:00 5.5
2000-01-01 00:17:00 9.5
Freq: 4T, dtype: float64 Does that look right to you? We could make this more convenient by allowing |
I agree that I think |
Thanks @mroeschke! Look forward to checking out enhancements to |
related: http://stackoverflow.com/questions/21329425/resampling-a-pandas-dataframe-with-loffset-introduces-an-additional-offset-of-an
Hey pandas team. Sorry to have gone MIA the past week, super busy with work. I promise (and look forward to) contributing more soon. :)
Still, I wanted to note that I came across what I believe to a be a bug in
resample()
when trying to change the interval of the binning withclosed='left'
. I know that there have been a few changes to theresample()
API since Wes' book, however, I don't believe they changed this functionality, but I have been wrong before :)Bug can be reproduced using the example from Wes' book, generating 12 mins of data like:
we can do a simple resample to 5 mins like:
For my use, I need this resampling to be 'backwards looking' so that the summations at each resampled timestamp include the previous 4 minutes. Documentation (and Wes' book) suggest this is achieved by binning with
closed='left'
, however, this results in the same output as above:I was looking for the following result (note that the first timestamp is at
00:05:00
and with hanging data dropped):I am able to generate this by combining
loffset='5min'
and then slicing into the resultant Series to remove the:but this is hardly ideal as it's not known in advance if time series ends with a timestamp that resolves equally to the final timestamp of the resampling procedure!
Apologies if I am missing something—any thoughts, help or guidance is welcomed!
Thanks so much.
The text was updated successfully, but these errors were encountered: