Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG: Add is_dst method to DatetimeIndex and Timestamp to solve AmbiguousTimeError #22560

Closed
wants to merge 12 commits into from

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Sep 1, 2018

The issues above required a new method to keep track if each timestamp in the DatetimeIndex was previously in daylight savings time in order to pass into tz_localize(..., ambiguous=...)

Therefore, added a is_dst method to DatetimeIndex and Timestamp.

@@ -353,7 +353,6 @@ class NaTType(_NaT):
strptime = _make_error_func('strptime', datetime)
strftime = _make_error_func('strftime', datetime)
isocalendar = _make_error_func('isocalendar', datetime)
dst = _make_error_func('dst', datetime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing?

Py_ssize_t n = len(values)
# Cython boolean memoryviews are not supported yet
# https://github.com/cython/cython/issues/2204
# bint[:] result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the workaround is to use a uint8

>>> dti.is_dst()
Index([True, True, False, False], dtype='object')
"""
return Index(timezones._is_dst(self.asi8, self.tz))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make is_dst public?

@jbrockmendel
Copy link
Member

Is this related to the “fold” arg added in py3?

@gfyoung gfyoung added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Sep 1, 2018
@mroeschke
Copy link
Member Author

Is this related to the “fold” arg added in py3?

Yup fairly similar. The ambiguous=True/False argument is essentially similar to the fold=0/1 argument in py3.

These AmbiguousTimeErrors stem from internal routines that transform a timestamp to an ambiguous time. I think it's best we try to return a sensible time based on the original timeseries instead of leaving the user with an error they cannot control.

@codecov
Copy link

codecov bot commented Sep 4, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@a5fe9cf). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #22560   +/-   ##
=========================================
  Coverage          ?   92.04%           
=========================================
  Files             ?      169           
  Lines             ?    50789           
  Branches          ?        0           
=========================================
  Hits              ?    46747           
  Misses            ?     4042           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.45% <100%> (?)
#single 42.29% <33.33%> (?)
Impacted Files Coverage Δ
pandas/core/indexes/datetimelike.py 97.77% <100%> (ø)
pandas/core/indexes/datetimes.py 95.56% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5fe9cf...c03792e. Read the comment docs.

@mroeschke
Copy link
Member Author

Ready for another look @jbrockmendel when you get the chance.

@mroeschke mroeschke added this to the 0.24.0 milestone Sep 4, 2018
@@ -186,16 +187,28 @@ cdef object get_utc_trans_times_from_dateutil_tz(object tz):
return new_trans


cdef int64_t[:] unbox_utcoffsets(object transinfo):
cdef int64_t[:] unbox_utcoffsets(object transinfo, dst):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add bint to type dst


sz = len(transinfo)
arr = np.empty(sz, dtype='i8')

key = int(dst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may not need this if typing bint


sz = len(transinfo)
arr = np.empty(sz, dtype='i8')

key = int(dst)
for i in range(sz):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment about what you are extracting

----------
tz : object
timezone
dst : bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type with bitn

Parameters
----------
tz : object
timezone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we accept only timezone objects and not strings here?

return result
transitions, offsets, typ = get_dst_info(tz, True)
offsets = np.array(offsets)
# Fixed timezone offsets do not have DST transitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line here

if typ not in {'pytz', 'dateutil'}:
return result
positions = transitions.searchsorted(values, side='right') - 1
# DST has nonzero offset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -1012,6 +1012,22 @@ def test_iteration_preserves_nanoseconds(self, tz):
for i, ts in enumerate(index):
assert ts == index[i]

@pytest.mark.parametrize('arg, expected_arg', [
[[], []],
[date_range('2018-11-04', periods=4, freq='H', tz='US/Pacific'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try with a dateutil as well (I think we default to pytz)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pganssle Do dateutil timezones provide a way to access the possible daylight savings time shifts like pytz.timezone._transition_info does?

# I'm interested in the 2nd element of the tuples below
In [4]: tz._transition_info
Out[4]:
[(datetime.timedelta(-1, 58020), datetime.timedelta(0), 'LMT'),
 (datetime.timedelta(-1, 57600), datetime.timedelta(0), 'PST'),
 (datetime.timedelta(-1, 61200), datetime.timedelta(0, 3600), 'PDT'),
...

Context: I am trying to create a DatetimeIndex.is_dst method that indicates whether a particular timestamp is in daylight savings time. I know it's possible to do bool(datetime.dst()) on each timestamp of the DatetimeIndex, but I am curious if there's a more efficient way determine DST for a dateutil timezone than iterating.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke Yes and no. Yes in the sense that this is Python and obviously dateutil has to store that information somewhere. No in the sense that neither dateutil nor pytz exposes that information publicly. Note that _transition_info is a private member.

That said:

  1. I'm not entirely sure why you want to implement a "fast isdst" - I'm not sure I've ever heard of a case where it was useful to know whether or not a time had or did not have DST.
  2. I don't really see what this would buy you - the member you list looks like what you would get from utcoffset, dst and tzname, respectively. How do you expect to make use of this information?

assert ts_naive.is_dst() is False

ts_aware = ts_naive.tz_localize('US/Pacific')
assert ts_aware.is_dst() is True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test dateutil

@pep8speaks
Copy link

Hello @mroeschke! Thanks for updating the PR.

@pganssle
Copy link
Contributor

pganssle commented Sep 5, 2018

I would recommend against adding an is_dst method. If this is for resolving ambiguous times, it would be preferable to use something more akin to a fold method, since there are several situations where an ambiguity cannot be resolved by querying DST or not (base offset shifts).

In general DST/not DST is only a proxy for the real information you want in all cases I can think of - the offset, the tzname or which side of an ambiguity you are on. All of those can be independently queried and/or specified without mentioning DST.

@jbrockmendel
Copy link
Member

These AmbiguousTimeErrors stem from internal routines that transform a timestamp to an ambiguous time.

I'm not clear on what problem this is solving. Can you give an example of the internal routines that produce ambiguous times?

@pganssle and I don't always see eye-to-eye on how to treat public vs private attributes, but in general if he's got a strong opinion, it is worth listening to.

@jorisvandenbossche
Copy link
Member

I also wanted to mention that we deliberately added an ambiguous keyword and deprecated is/infer_dst because not all ambiguities are DST issues, if I remember well (#7963, there was some discussion about it, but I am not very familiar with timezone handling details), so we should maybe consider that issue here as well. But I see that @pganssle in the meantime already noted something similar :)

@mroeschke
Copy link
Member Author

mroeschke commented Sep 7, 2018

@jbrockmendel

Can you give an example of the internal routines that produce ambiguous times?

When rounding a tz-aware date time to a particular frequency, essentially the steps are:

  • convert the date to a naive timestamp
  • round the i8 value to the frequency
  • relocalize the i8 value to the timezone

You can't necessarily convert to UTC first because it can affect the rounding (unless I am missing something).

So in #18946, a date that sits on the DST border 2017-10-29 02:00:00+02:00 that's rounded to an hour ('H'):

  • Convert to 2017-10-29 02:00:00
  • Round to the nearest hour (2017-10-29 02:00:00)
  • Relocalize to 2017-10-29 02:00:00+02:00 or 2017-10-29 02:00:00+01:00?

I guess #18946 can be solved by added an ambiguous kwarg to round also. However the routine above is the cause for other issues as well #18885, but that may be solved by converting to UTC first.

Using DST information may be a too hand-wavy solution to solve both #18946 and #18885. At minimum, I had also though having a is_dst would be a nice-to-have enhancement for DatetimeIndexes.

@pganssle
Copy link
Contributor

pganssle commented Sep 8, 2018

@mroeschke I'm guessing that rounding issues in UTC is mainly at the day-or-higher level, right? Since rounding to the nearest minute or hour will cause no problems.

I think the correct solution here is to introduce the fold concept into pandas, and start deprecating pytz-isms.

I see a few good ways forward:

  1. Write fold compatible wrappers for pytz and start deprecating pytz-isms like localize and normalize.
  2. Write wrappers for dateutil time zones that expose some pytz-isms (which you can deprecate), and stop using pytz entirely.
  3. Implement a third time zone library optimized for array types and drop both pytz and dateutil.

In all cases, I think fold semantics is the best way to go.

I am very strongly hoping that within a year I'll be able to get a fast, compiled Rust or C backend for many of dateutil's modules, including tz, which should make it a lot cheaper to make bulk function calls (particularly if I expose a C API or similar so you don't have any Python function call overhead), but number 3 might still be the best option since it's likely that supporting something that would work really well for array types in dateutil and work really well for normal Python implementations would be more complicated than just maintaining two separate libraries.

@jbrockmendel
Copy link
Member

You can't necessarily convert to UTC first because it can affect the rounding (unless I am missing something).

I think there may be an implicit assumption in here that should be made explicit. Does the user want the rounding to be based on wall time or actual time? (My intuition is actual time)

ts = pd.Timestamp('2017-10-29 02:59:00+0200', tz='Europe/Madrid')
m = pd.Timedelta(minutes=1)

>>> ts + m
Timestamp('2017-10-29 02:00:00+0100', tz='Europe/Madrid')
>>> ts - 59*m
Timestamp('2017-10-29 02:00:00+0200', tz='Europe/Madrid')

If I were rounding ts to the nearest hour, it seems clear I'd want the one 1 minute away rather than 59 minutes away, despite the fact they have the same difference in wall time.

Is this intuition not applicable to the case you're considering?

@pganssle
Copy link
Contributor

pganssle commented Sep 9, 2018

@jbrockmendel I think for time scales of hours or less it's probably actual time, but time scale of days they almost certainly want wall time.

Imagine rounding to midnight UTC, then going back to PST/PDT. No one wants to round to 17:00 sometimes and 16:00 others.

To be honest it's probably always wall time, but on sufficiently small time scales the two are equivalent.

@mroeschke
Copy link
Member Author

I agree with @pganssle and would imagine users would want to round based on wall time in general.

Thanks for the feedback guys. I'm convinced I need to surface a fold (or ambiguous in pandas terms) argument in round then. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
7 participants