Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' #31809

Merged
merged 26 commits into from
May 10, 2020

Conversation

hasB4K
Copy link
Member

@hasB4K hasB4K commented Feb 8, 2020

EDIT: this PR has changed, now instead of adding adjust_timestamp we are adding origin and offset arguments to resample and pd.Grouper (see #31809 (comment))


Hello,

This enhancement is an alternative to the base argument present in pd.Grouper or in the method resample. It adds the adjust_timestamp argument to change the current behavior of: https://github.com/pandas-dev/pandas/blob/master/pandas/core/resample.py#L1728

  • adjust_timestamp is the timestamp on which to adjust the grouping. If None is passed, the first day of the time series at midnight is used.

Currently the bins of the grouping are adjusted based on the beginning of the day of the time series starting point. This works well with frequencies that are multiples of a day (like 30D) or that divides a day (like 90s or 1min). But it can create inconsistencies with some frequencies that do not meet this criteria.

Here is a simple snippet from a test that I added that proves that the current behavior can lead to some inconsistencies. Inconsistencies that can be fixed if we use adjust_timestamp:

import pandas as pd
import numpy as np
import pandas._testing as tm
import pytest


freq = "1399min"  # prime number that is smaller than 24h
start, end = "1/1/2000 00:00:00", "1/31/2000 00:00"
middle = "1/15/2000 00:00:00"

rng = pd.date_range(start, end, freq="1231min")  # prime number
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts2 = ts[middle:end]

# proves that grouper without a fixed adjust_timestamp does not work
# when dealing with unusual frequencies
simple_grouper = pd.Grouper(freq=freq)
count_ts = ts.groupby(simple_grouper).agg("count")
count_ts = count_ts[middle:end]
count_ts2 = ts2.groupby(simple_grouper).agg("count")
with pytest.raises(AssertionError):
    tm.assert_index_equal(count_ts.index, count_ts2.index)

# test adjusted_timestamp on 1970-01-01 00:00:00
adjust_timestamp = pd.Timestamp(0)
adjusted_grouper = pd.Grouper(freq=freq, adjust_timestamp=adjust_timestamp)
adjusted_count_ts = ts.groupby(adjusted_grouper).agg("count")
adjusted_count_ts = adjusted_count_ts[middle:end]
adjusted_count_ts2 = ts2.groupby(adjusted_grouper).agg("count")
tm.assert_series_equal(adjusted_count_ts, adjusted_count_ts2)

I think this PR is ready to be merged, but I am of course open to any suggestions or criticism. 😉
For instance, I am not sure if the naming of adjust_timestamp is correct. An alternative could be base_timestamp or ref_timestamp 🤔?

Cheers,


@pep8speaks
Copy link

pep8speaks commented Feb 8, 2020

Hello @hasB4K! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-09 21:51:23 UTC

@hasB4K hasB4K force-pushed the grouper-adjust-timestamp branch 6 times, most recently from 8c49bb6 to 0cc6149 Compare February 8, 2020 22:26
@mroeschke
Copy link
Member

Instead of adding a new keyword, might be nice if base could take a Timestamp instead since they are both relevant when a frequency is passed.

@hasB4K
Copy link
Member Author

hasB4K commented Feb 9, 2020

I would rather suggest the following:

I always thought that the base argument has kind of an ambiguous name. And the current behavior is quite confusing. It needs to be an integer (or a floating point) that matches the unit of the frequency:

  • So base=1 with a frequency of 5D is equal to 1D that we add as an offset.
  • So base=2 with a frequency of 5min is equal to 2min that we add as an offset.

This behavior is very confusing for the users (myself included), but it also creates bugs: see #25161, #25226

Instead of relying on base I would rather deprecate this argument. The argument loffset (currently broken for pd.Grouper as shown in #28302, but fixable in the current PR) is kind of equivalent to what base is doing (especially since it is a Timedelta).

Example of the current use of loffset with resample:

>>> start, end = "1/1/2000 00:00:00", "1/31/2000 00:00"
>>> rng = pd.date_range(start, end, freq="1231min")
>>> ts = pd.Series(np.arange(len(rng)), index=rng)
>>> ts.resample("1min", loffset=-pd.Timedelta("1min")).count()
1999-12-31 23:59:00    1
2000-01-01 00:00:00    0
2000-01-01 00:01:00    0
2000-01-01 00:02:00    0
2000-01-01 00:03:00    0
                      ..
2000-01-30 22:00:00    0
2000-01-30 22:01:00    0
2000-01-30 22:02:00    0
2000-01-30 22:03:00    0
2000-01-30 22:04:00    1
Freq: T, Length: 43086, dtype: int64

Example of the current broken loffset argument:

>>> ts.groupby(pd.Grouper(freq="1min", loffset=-pd.Timedelta("1min"))).count()
2000-01-01 00:00:00    1
2000-01-01 00:01:00    0
2000-01-01 00:02:00    0
2000-01-01 00:03:00    0
2000-01-01 00:04:00    0
                      ..
2000-01-30 22:01:00    0
2000-01-30 22:02:00    0
2000-01-30 22:03:00    0
2000-01-30 22:04:00    0
2000-01-30 22:05:00    1
Freq: T, Length: 43086, dtype: int64

That being said, I agree that the naming of adjust_timestamp is not ideal. I would rename it into: origin or base_timestamp.

The line https://github.com/pandas-dev/pandas/blob/master/pandas/core/resample.py#L1728 would be replaced by something roughly equivalent to:

origin = start_of_day if origin is None else start_of_day
origin = origin.value + loffset.value

TL;DR:

  1. I would fix in this PR loffset for pd.Grouper and deprecate the confusing base argument
  2. I would rename the added argument adjust_timestamp into origin
  3. We would have origin = origin.value + loffset.value
  4. I would add more tests to check the behavior of loffset and origin

What do you think?

@hasB4K
Copy link
Member Author

hasB4K commented Feb 9, 2020

I just realised that loffset and base are not equivalent at all since this works:

>>> start, end = "1/1/2000 00:00:00", "1/31/2000 00:00"
>>> rng = pd.date_range(start, end, freq="1231min")
>>> ts = pd.Series(np.arange(len(rng)), index=rng)
>>> ts.resample("1min", loffset=-pd.Timedelta("365D")).count()
1999-01-01 00:00:00    1
1999-01-01 00:01:00    0
1999-01-01 00:02:00    0
1999-01-01 00:03:00    0
1999-01-01 00:04:00    0
                      ..
1999-01-30 22:01:00    0
1999-01-30 22:02:00    0
1999-01-30 22:03:00    0
1999-01-30 22:04:00    0
1999-01-30 22:05:00    1
Freq: T, Length: 43086, dtype: int64

So I would suggest the following instead:

  1. Deprecate the confusing base argument
  2. Add an origin_offset argument that would be a Timestamp
  3. Rename the added argument adjust_timestamp into origin_timestamp
  4. We would have origin_nanos = origin_timestamp.value + origin_offset.value

I will not fix loffset in this PR since I am not sure of the behavior with pd.Grouper and how to fix it.

What do you think ?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think base and loffset actually are pretty useful. However for non-evenly divisible freq the issue is that you likely simply want to use the first (or maybe the last) timestamp as the base. So how about we just add that ability in base to accept the string first or last rather than adding another keyword?

@jreback jreback added API - Consistency Internal Consistency of API/Behavior Resample resample method labels Feb 9, 2020
@hasB4K
Copy link
Member Author

hasB4K commented Feb 9, 2020

@jreback this won't fix the issue that I'm trying to tackle. The idea is to be able to have a fixed timestamp as a "origin" that does not depend of the time series. So neither the base argument with first (which is the current behavior) or last string will fix the issue.

I could use the base argument and use it as the "origin" argument that I want to add if baseis not a number like suggested @mroeschke. But I think this could create some confusion in the API (I still believe that base is useful but can be quite confusing to use).

@jreback
Copy link
Contributor

jreback commented Feb 9, 2020

@jreback this won't fix the issue that I'm trying to tackle. The idea is to be able to have a fixed timestamp as a "origin" that does not depend of the time series. So neither the base argument with first (which is the current behavior) or last string will fix the issue.

I could use the base argument and use it as the "origin" argument that I want to add if baseis not a number like suggested @mroeschke. But I think this could create some confusion in the API (I still believe that base is useful but can be quite confusing to use).

@hasB4K not averse with changing things. But we currently have base, loffset, so I don' really like the idea of another another pretty opaque options.

I would be onboard with deprecating both of these and replacing with 2 options, e.g. origin and offset come to mind.

@hasB4K
Copy link
Member Author

hasB4K commented Feb 9, 2020

I would be onboard with deprecating both of these and replacing with 2 options, e.g. origin and offset come to mind.

So would this signature be ok with you @jreback?

pd.Grouper(
    # ... arguments untouched with this enhancement like `freq`, `sort`, ...

    # ADDED arguments
    origin: pd.Timestamp, default None
        Only when freq is passed.
        The timestamp used as reference for the grouping bins.
        If None is passed, the first day of the time series at midnight is used.
    offset: pd.Timedelta, default None
        Only when freq is passed.
        An offset timedelta added to the origin.

    # DEPRECATED arguments
    base: int, default 0 (DEPRECATED)
    loffset: str, DateOffset, timedelta object (DEPRECATED)
)

@jreback
Copy link
Contributor

jreback commented Feb 9, 2020

@hasB4K

sure that looks reasonable.

we would need to have a pretty nice deprecation message that shows one how to convert base and/or loffset to the new args (as well as a whatsnew and warning box in the docs); they can bascially be the same though. its how we want folks to migrate.

@hasB4K
Copy link
Member Author

hasB4K commented Feb 9, 2020

sure that looks reasonable.

Perfect, I will implement that in this PR then 🙂

we would need to have a pretty nice deprecation message that shows one how to convert base and/or loffset to the new args (as well as a whatsnew and warning box in the docs); they can bascially be the same though. its how we want folks to migrate.

Yep, it seems quite necessary! Is there an example of a nice deprecation message in the current (or in the old) code that I could look into?
For now, I was thinking of adding to the documentation of resample and pd.Grouper examples of "how to migrate". And in the code something like this argument is deprecated, please see: <url>.

@jreback
Copy link
Contributor

jreback commented Feb 9, 2020

Yep, it seems quite necessary! Is there an example of a nice deprecation message in the current (or in the old) code that I could look into?
For now, I was thinking of adding to the documentation of resample and pd.Grouper examples of "how to migrate". And in the code something like this argument is deprecated, please see: <url>.

there are some (recently removed in 1.0.0) deprecation messages in resample on how to handle the freq arg. myabe not great but ok :->

@hasB4K hasB4K changed the title ENH: add 'adjust_timestamp' argument to 'resample' and 'pd.Grouper' ENH: add 'origin and 'offset' arguments to 'resample' and 'pd.Grouper' Feb 10, 2020
@hasB4K hasB4K force-pushed the grouper-adjust-timestamp branch 2 times, most recently from 2d2beba to 3097767 Compare February 13, 2020 23:10
@hasB4K hasB4K changed the title ENH: add 'origin and 'offset' arguments to 'resample' and 'pd.Grouper' ENH: add 'origin' and 'offset' arguments to 'resample' and 'pd.Grouper' Feb 13, 2020
@hasB4K hasB4K force-pushed the grouper-adjust-timestamp branch 6 times, most recently from 3cbeac4 to bbcbf7c Compare February 14, 2020 15:02
@hasB4K
Copy link
Member Author

hasB4K commented Feb 14, 2020

@jreback I still need to add more examples for 'origin' and 'offset' and update the "what's new" part of the doc, but otherwise, it's ready for review 🙂

@hasB4K hasB4K force-pushed the grouper-adjust-timestamp branch from 16a6831 to de6b477 Compare May 9, 2020 20:37
@hasB4K
Copy link
Member Author

hasB4K commented May 9, 2020

@jreback Thank you for the merge of #33498! I rebased the current PR with master, let me know if you need anything else 🙂

@jreback jreback merged commit 4a267c6 into pandas-dev:master May 10, 2020
@jreback
Copy link
Contributor

jreback commented May 10, 2020

very nice @hasB4K this was quite some PR!

please have a read thru the built docs (https://dev.pandas.io/), will take a little bfeore they are there.

and if needed issue a followup to clarify.

and keep em coming!

@hasB4K hasB4K deleted the grouper-adjust-timestamp branch May 10, 2020 16:39
@hasB4K
Copy link
Member Author

hasB4K commented May 10, 2020

Thank you @jreback! 😃

The inputs and guidance from @mroeschke, @WillAyd and you was really interesting and challenging in the good way! I am really glad of the current state of this new functionality. Thank you all! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Resample resample method
Projects
None yet
5 participants