Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/API: implement DayDST #44364

Closed
wants to merge 28 commits into from

Conversation

jbrockmendel
Copy link
Member

xref #41943 cc @mroeschke

The tests all passed locally up until the last commit, have a handful of broken resample tests now.

The approach here is to keep Day as-is and implement DayDST, fix the actively-wrong infer_freq behavior (see test_infer_freq_across_dst_not_daily), then decide what to do with freq="infer" and freq="D" based on the presence of a tz.

I guess a deprecation path would be to warn users passing "D" or Day() that it will have DST-aware semantics in the future, and to keep Timedelta-like behavior they should pass "24H". Would we want to warn in cases where they are equvialent, e.g. tz="UTC" or with a date_range that doesn't happen to cross a DST transition?

With such a deprecation cycle, we could avoid implementing DayDST and when the time comes just call it Day.

@mroeschke
Copy link
Member

The deprecation path would be nice such that we don't need to implement DayDST, but I remember chasing down all the warnings in the test suite was...involved. I think warning when if tz is not None and tz != UTC should be sufficient.

Here was my CalendarDay branch which I think is similar to what you are doing with DayDST: #22288

@jbrockmendel
Copy link
Member Author

So I guess the deprecation would be to take all the places where this PR uses DayDST and issue a warning telling users to use "24h" if they don't want CalendarDay behavior. Then when we enforce the deprecation, we change Day to behave like DayDST does in this branch (so DayDST never lands in master)

@jorisvandenbossche
Copy link
Member

The approach here is to keep Day as-is and implement DayDST, fix the actively-wrong infer_freq behavior (see test_infer_freq_across_dst_not_daily), then decide what to do with freq="infer" and freq="D" based on the presence of a tz.

Do you mean that the code that is in test_infer_freq_across_dst_not_daily is doing something wrong on master currently? (to me that seems to work fine on master, just giving "D" instead of "DayDST")

What do you mean exactly with "decide what to do with freq="infer" "?
Can you explain the trade-offs or potential changes in some more detail? (and maybe with some concrete examples)

I guess a deprecation path would be to warn users passing "D" or Day() that it will have DST-aware semantics in the future, and to keep Timedelta-like behavior they should pass "24H". Would we want to warn in cases where they are equvialent, e.g. tz="UTC" or with a date_range that doesn't happen to cross a DST transition?

I think ideally we only warn for a specific behaviour that will actually change in the future (to the extent this is achievable)
For example, for someone passing "D" in things like resample("D") or pd.date_range(..., freq="D"), AFAIK nothing will change in the future? If that's the case, I don't think those use cases should raise a warning.

@jbrockmendel
Copy link
Member Author

Do you mean that the code that is in test_infer_freq_across_dst_not_daily is doing something wrong on master currently? (to me that seems to work fine on master, just giving "D" instead of "DayDST")

Sort of. It is doing what it was originally intended to do (https://github.com/pandas-dev/pandas/pull/44364/files#diff-d4ebb574a7d2ddbedc3b4eed31907f98a8cf66721cd8080810890997ca2f8fe3L435), and in that sense is not doing anything wrong. BUT that originally intended behavior is the cause of bugs.

e.g. with non-None freq, we should always have dti[n+1] == dti[n] + dti.freq, but that fails ATM with the dti in test_infer_freq_across_dst_not_daily.

What do you mean exactly with "decide what to do with freq="infer" "?

Consider e.g. pd.DatetimeIndex(["2016-01-01", "2016-01-01", "2016-01-03"], tz="US/Pacific", freq="infer"). We could reasonably infer either "D" or "DayDST" here. ATM in this PR I infer DayDST.

For example, for someone passing "D" in things like resample("D") or pd.date_range(..., freq="D"), AFAIK nothing will change in the future?

The resample case I'm not sure of, but some date_range cases will see a change (I think the case of date_range with freq="D" is the hardest part of this actually). Consider pd.date_range("2016-01-01", periods=3, tz="US/Pacific", freq="D"). ATM in this PR we get back freq=DayDST. If instead we deprecate, we would issue a warning in this case that if they don't want DayDST-like behavior, they should pass '24h'.

@jbrockmendel
Copy link
Member Author

@mroeschke i stumbled on #24330; did we already do this deprecation and then not enforce it quite right?

@mroeschke
Copy link
Member

@mroeschke i stumbled on #24330; did we already do this deprecation and then not enforce it quite right?

Unfortunately no, I think #22867 was my attempt to enforce the deprecation but I lost steam handling all the warnings.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 29, 2021

What do you mean exactly with "decide what to do with freq="infer" "?

Consider e.g. pd.DatetimeIndex(["2016-01-01", "2016-01-01", "2016-01-03"], tz="US/Pacific", freq="infer"). We could reasonably infer either "D" or "DayDST" here. ATM in this PR I infer DayDST.

But we want "D" and "DayDST" to mean the same (eventually)? (in which case there no decision to make)

In general, from your answers in #44364 (comment), I get the feeling that it's still not clearly defined what we actually consider the wrong behaviour / how we want to solve this.
It might be useful to make a summary explaining (with examples) the current situation, in which cases it does the correct / the wrong thing, and what we want it to do in the future (a section like this will be needed anyway for the whatsnew)

@jorisvandenbossche
Copy link
Member

The resample case I'm not sure of, but some date_range cases will see a change (I think the case of date_range with freq="D" is the hardest part of this actually).

My understanding is that nothing in date_range has to change (as this already has the behaviour that we want, i.e. using calendar days)

Consider pd.date_range("2016-01-01", periods=3, tz="US/Pacific", freq="D"). ATM in this PR we get back freq=DayDST. If instead we deprecate, we would issue a warning in this case that if they don't want DayDST-like behavior, they should pass '24h'.

If our goal is that freq="D" will continue to mean calendar day long term, I don't think we need to warn users that they should pass "24H" if they want a different behaviour as what they are currently getting?

@jbrockmendel
Copy link
Member Author

In general, from your answers in #44364 (comment), I get the feeling that it's still not clearly defined what we actually consider the wrong behaviour / how we want to solve this.
It might be useful to make a summary explaining (with examples) the current situation, in which cases it does the correct / the wrong thing, and what we want it to do in the future (a section like this will be needed anyway for the whatsnew)

Good idea. pd.date_range with freq="D" has constructs the dates as if "D" meant calendar-day, but the attached .freq means 24H. This leads to subtle bugs in cases that cross a DST crossing:

ts = pd.Timestamp("2019-11-02 00:00:00-0400", tz="Canada/Eastern")
dti = pd.date_range(start=ts, periods=5, freq="D")

>>> dti[1:] == dti[:-1] + dti.freq
array([ True, False,  True,  True])

Changing Day to behave like calendar-day fixes this particular example, but comes with its own problems:

  • freq="D" in TimedeltaIndex behaves like 24H; do we treat that like freq="24H"? what about freq=Day()?
  • resample code treats D and 24H as interchangeable (... i think. this is a weak point in my knowledge of the code)
  • A user passing freq="D" to date_range without a DST crossing might want to keep the 24H-like behavior.

@mroeschke
Copy link
Member

The resample case I'm not sure of, but some date_range cases will see a change (I think the case of date_range with freq="D" is the hardest part of this actually).

My understanding is that nothing in date_range has to change (as this already has the behaviour that we want, i.e. using calendar days)

Consider pd.date_range("2016-01-01", periods=3, tz="US/Pacific", freq="D"). ATM in this PR we get back freq=DayDST. If instead we deprecate, we would issue a warning in this case that if they don't want DayDST-like behavior, they should pass '24h'.

If our goal is that freq="D" will continue to mean calendar day long term, I don't think we need to warn users that they should pass "24H" if they want a different behaviour as what they are currently getting?

As a recap, #41943 (comment) describes:

  • The original inconsistency problem
  • The solutions

IMO the crux is to make "D" behave consistently (24 hours vs calendar day) in the presence of timezones. Based on past discussion people generally favored "D" meaning calendar day.

@jorisvandenbossche
Copy link
Member

resample code treats D and 24H as interchangeable (... i think. this is a weak point in my knowledge of the code)

According to the summary in the issue (#41943), resample already treats "D" as calendar day. And based on a quick check, it seems it does not treat "24H" the same, but actually treats is as a proper 24H fixed frequency:

In [18]: idx = pd.date_range("2016-10-29", freq='H', periods=4*24, tz='Europe/Brussels')
    ...: pd.Series(range(len(idx)), index=idx).resample('D').count()
Out[18]: 
2016-10-29 00:00:00+02:00    24
2016-10-30 00:00:00+02:00    25
2016-10-31 00:00:00+01:00    24
2016-11-01 00:00:00+01:00    23
Freq: D, dtype: int64

In [19]: idx = pd.date_range("2016-10-29", freq='H', periods=4*24, tz='Europe/Brussels')
    ...: pd.Series(range(len(idx)), index=idx).resample('24H').count()
Out[19]: 
2016-10-29 00:00:00+02:00    24
2016-10-30 00:00:00+02:00    24
2016-10-30 23:00:00+01:00    24
2016-10-31 23:00:00+01:00    24
Freq: 24H, dtype: int64

So based on what we discussed before, nothing has to change here.

freq="D" in TimedeltaIndex behaves like 24H; do we treat that like freq="24H"? what about freq=Day()?

Since timedeltas are by definition timezone naive, and since "D" is always equivalent to "24H" for timezone naive timestamps, I personally don't have a problem interpreting "D" as such for timedeltas as well.

But I think this was still a discussion point in earlier issues, see #22274 (comment) and last two paragraphs of #22274 (comment)

I think the question here is whether we want to 1) deprecate "D" in timedelta context, or 2) special case it as fixed 24H frequency.

A user passing freq="D" to date_range without a DST crossing might want to keep the 24H-like behavior.

But if there is no DST crossing, "D" or "24H" give the same output from date_range, so it already has 24H-like behaviour (so there is nothing to keep?)

pd.date_range with freq="D" has constructs the dates as if "D" meant calendar-day, but the attached .freq means 24H. This leads to subtle bugs in cases that cross a DST crossing:

Yes, but to be clear: that's not necessarily an issue in pd.date_range, but with offset arithmetic (regardless how the timeseries was created). Based on the previous discussion, I think we decided we want to keep the output of pd.date_range.

@jorisvandenbossche
Copy link
Member

A user passing freq="D" to date_range without a DST crossing might want to keep the 24H-like behavior.

But if there is no DST crossing, "D" or "24H" give the same output from date_range, so it already has 24H-like behaviour (so there is nothing to keep?)

Ah, I suppose you mean "keep the 24H-like behaviour" in subsequent offset arithmetic ?

Let's use an example without date_range, to just focus on the arithmetic part:

In [40]: dti = pd.DatetimeIndex(["2019-11-02", "2019-11-03", "2019-11-04", "2019-11-05"]).tz_localize("Canada/Eastern")

In [41]: dti
Out[41]: 
DatetimeIndex(['2019-11-02 00:00:00-04:00', '2019-11-03 00:00:00-04:00',
               '2019-11-04 00:00:00-05:00', '2019-11-05 00:00:00-05:00'],
              dtype='datetime64[ns, Canada/Eastern]', freq=None)

In [42]: dti + pd.offsets.Day()
Out[42]: 
DatetimeIndex(['2019-11-03 00:00:00-04:00', '2019-11-03 23:00:00-05:00',
               '2019-11-05 00:00:00-05:00', '2019-11-06 00:00:00-05:00'],
              dtype='datetime64[ns, Canada/Eastern]', freq=None)

So here "D" acts as "24H", and so if we make "D" always act as calendar day, the above example will show a change in behaviour.
If we want this change, I assume that the above example will raise a deprecation warning initially, saying that you should use a "24H" offset instead if you want to preserve the behaviour.

So also if you are using pd.date_range with "D", and then in a next step do offset arithmetic that will cross a DST, you will see this warning (but not from date_range, but from the arithmetic).
At this point it is true that the user will have to decide what to do exactly, since they can not necessarily change "D" to "24H" in date_range because that will give a change in output for the timeseries creation. If the user wants calendar day behaviour of "D" while creating the timeseries, but then fixed 24H behaviour in a subsequent arithmetic, they will have to replace tdi + tdi.freq with tdi + pd.offsets.Hour(24)?

@jbrockmendel
Copy link
Member Author

Ah, I suppose you mean "keep the 24H-like behaviour" in subsequent offset arithmetic ?

Yes, thank you for making that explicit. It is not only arithmetic with DateOffset objects though. e.g. add/sub with a Timedelta scalar will preserve a Tick freq, which will change if Day ceases to subset Tick.

This means there is a potential behavior change when you construct a DTI without a DST crossing, then add a timedelta scalar:

ts = pd.Timestamp("2019-11-01 00:00:00-0400", tz="Canada/Eastern")
dti = pd.date_range(start=ts, periods=4, freq="D")
dti2 = dti[:-1]   # ends before DST crossing

v1 = dti2 + dti2.freq
v2 = dti2 + pd.Timedelta(days=1)

assert v1.equals(v2)   # <- works in master, will fail once "D" is calendarday
assert v1.equals(dti[1:])  # <- fails in master, will work once "D" is calendarday

@jorisvandenbossche
Copy link
Member

assert v1.equals(v2) # <- works in master, will fail once "D" is calendarday

I find it "expected" that this will not be equal, if "D" is a calendar day while a Timedelta is always a fixed-time delta.

@jbrockmendel
Copy link
Member Author

assert v1.equals(v2) # <- works in master, will fail once "D" is calendarday

I find it "expected" that this will not be equal, if "D" is a calendar day while a Timedelta is always a fixed-time delta.

I agree, am saying that it constitutes an API change

@jbrockmendel
Copy link
Member Author

cc @mroeschke https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=69864&view=logs&j=054c7612-f473-5cf0-0a65-77b54574553a&t=7218990d-7c28-52cc-5797-2fee2b0645b4&l=121

Caused unexpected warning(s): [('ResourceWarning', ResourceWarning('unclosed file <_io.BufferedRandom name=16>'), '/Users/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/jinja2/environment.py', 474)]

@mroeschke
Copy link
Member

cc @mroeschke https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=69864&view=logs&j=054c7612-f473-5cf0-0a65-77b54574553a&t=7218990d-7c28-52cc-5797-2fee2b0645b4&l=121

Caused unexpected warning(s): [('ResourceWarning', ResourceWarning('unclosed file <_io.BufferedRandom name=16>'), '/Users/runner/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/jinja2/environment.py', 474)]

:(

Guess the search continues. Probably still dependency related at this point.

@jbrockmendel
Copy link
Member Author

Another option would be to change the behavior of Day in 2.0 without a deprecation cycle.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2021

Another option would be to change the behavior of Day in 2.0 without a deprecation cycle.

this is probably the way to go

@github-actions
Copy link
Contributor

github-actions bot commented Jan 5, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jan 5, 2022
@jbrockmendel jbrockmendel added this to the 2.0 milestone Jan 6, 2022
@jbrockmendel
Copy link
Member Author

Mothballing to clear the queue since this is on the "do in 2.0" list

@jbrockmendel jbrockmendel added Mothballed Temporarily-closed PR the author plans to return to Frequency DateOffsets labels Jan 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frequency DateOffsets Mothballed Temporarily-closed PR the author plans to return to Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants