Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Timestamp.replace() behaves naively at DST boundaries (plus bonus segfault!) #7825

Closed
ischwabacher opened this issue Jul 23, 2014 · 19 comments · Fixed by #21491
Closed
Labels
Bug Timezones Timezone data dtype
Milestone

Comments

@ischwabacher
Copy link
Contributor

In [1]: import pandas as pd

In [2]: t = pd.Timestamp('2013-11-3', tz='America/Chicago')

In [3]: t
Out[3]: Timestamp('2013-11-03 00:00:00-0500', tz='America/Chicago')

In [4]: t.replace(hour=3)
Out[4]: Timestamp('2013-11-03 03:00:00-0500', tz='America/Chicago')
# This time doesn't exist.

In [5]: pd.Timestamp('2013-11-3 03:00:00', tz='America/Chicago')
Out[5]: Timestamp('2013-11-03 03:00:00-0600', tz='America/Chicago')

In trying to replicate the definition of Timestamp.replace in tslib.pyx, I encountered this further issue:

In [6]: from datetime import datetime

In [7]: datetime.replace(t, hour=3)
Segmentation fault: 11

Trying another tack, I found this:

In [1]: from datetime import datetime

In [2]: import pytz

In [3]: t = pytz.timezone('America/Chicago').localize(datetime(2013, 11, 3))

In [4]: t.strftime('%Y-%m-%d %H:%M:%S %Z%z')
Out[4]: '2013-11-03 00:00:00 CDT-0500'

In [5]: t.replace(hour=3).strftime('%Y-%m-%d %H:%M:%S %Z%z')
Out[5]: '2013-11-03 03:00:00 CDT-0500'
# This time still never happened

In summary: datetime is horribly broken, and the current workaround doesn't actually get all the way around. I know breaking backward compatibility is painful, and throwing the stdlib out the window is even more so, but this situation reminds me of nothing so forcefully as Microsoft Excel's 1900/1904 issue.

Version information:

In [1]: from pandas.util.print_versions import show_versions

In [2]: show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.5.final.0
python-bits: 64
OS: Darwin
OS-release: 13.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.0
Cython: None
numpy: 1.8.1
scipy: 0.12.1
statsmodels: None
IPython: 1.1.0
sphinx: None
patsy: None
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Jul 23, 2014

cc @rockg
cc @sinhrks
cc @adamgreenhall

hmm, not sure anything can be done about datetime.replace() itself; pandas doesn't control that at all.

ATM, no checking is done for replace if it has a tz.

It think that is reasonable though.

want to submit a pull-request?

@jreback jreback added this to the 0.15.0 milestone Jul 23, 2014
@jreback jreback changed the title BUG: Timestamp.replace() behaves naively at DST boundaries (plus bonus segfault!) BUG: Timestamp.replace() behaves naively at DST boundaries (plus bonus segfault!) Jul 23, 2014
@ischwabacher
Copy link
Contributor Author

Yes, but I still have a bunch of other open stuff that I haven't gotten to yet, and my preferred solution is to rip datetime out and replace it, which is unlikely to be a popular viewpoint.

@jreback
Copy link
Contributor

jreback commented Jul 23, 2014

In [36]: t + pd.offsets.Hour(1)
Out[36]: Timestamp('2013-11-03 01:00:00-0500', tz='America/Chicago')

In [37]: t + pd.offsets.Hour(2)
Out[37]: Timestamp('2013-11-03 01:00:00-0600', tz='America/Chicago')

In [38]: t + pd.offsets.Hour(3)
Out[38]: Timestamp('2013-11-03 02:00:00-0600', tz='America/Chicago')

In [39]: t + pd.offsets.Hour(4)
Out[39]: Timestamp('2013-11-03 03:00:00-0600', tz='America/Chicago')

why would you want to touch datetime?

@ischwabacher
Copy link
Contributor Author

In this case, I need "the window from 1:00 to 2:00", which is zero, one or two hours long, depending on the day.

Also:

In [1]: import pandas as pd

In [2]: list(pd.date_range('2013-11-1', periods=10, freq='24H', tz='America/Chicago'))
Out[2]: 
[Timestamp('2013-11-01 00:00:00-0500', tz='America/Chicago', offset='24H'),
 Timestamp('2013-11-02 00:00:00-0500', tz='America/Chicago', offset='24H'),
 Timestamp('2013-11-03 00:00:00-0500', tz='America/Chicago', offset='24H'),
 Timestamp('2013-11-04 00:00:00-0600', tz='America/Chicago', offset='24H'),
# This is (arguably) wrong but should certainly agree with the below
 Timestamp('2013-11-05 00:00:00-0600', tz='America/Chicago', offset='24H'),
 Timestamp('2013-11-06 00:00:00-0600', tz='America/Chicago', offset='24H'),
 Timestamp('2013-11-07 00:00:00-0600', tz='America/Chicago', offset='24H'),
 Timestamp('2013-11-08 00:00:00-0600', tz='America/Chicago', offset='24H'),
 Timestamp('2013-11-09 00:00:00-0600', tz='America/Chicago', offset='24H'),
 Timestamp('2013-11-10 00:00:00-0600', tz='America/Chicago', offset='24H')]

In [3]: out = [pd.Timestamp('2013-11-01 00:00:00-0500', tz='America/Chicago')]

In [4]: for i in range(9):
   ...:     out.append(out[-1] + pd.offsets.Hour(24))
   ...: 

In [5]: out
Out[5]: 
[Timestamp('2013-10-31 23:09:00-0551', tz='America/Chicago'),
# this brokenness comes from datetime.datetime
 Timestamp('2013-11-02 00:00:00-0500', tz='America/Chicago'),
 Timestamp('2013-11-03 00:00:00-0500', tz='America/Chicago'),
 Timestamp('2013-11-03 23:00:00-0600', tz='America/Chicago'),
# this is (arguably) correct but should certainly agree with the above
 Timestamp('2013-11-04 23:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-05 23:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-06 23:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-07 23:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-08 23:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-09 23:00:00-0600', tz='America/Chicago')]

@jreback
Copy link
Contributor

jreback commented Jul 23, 2014

What you are doing is not comparable. date_range guarantees the same time each day (with a 24H,1D) offset. The 2nd method is simply adding 24H to EACH entry in tern, which of course is affected by DST.

I am still confused about the

In summary: datetime is horribly broken, and the current workaround doesn't actually get all the way around. I know breaking backward compatibility is painful, and throwing the stdlib out the window is even more so, but this situation reminds me of nothing so forcefully as Microsoft Excel's 1900/1904 issue

What exactly is horribly broken? yes datetime doesn't handle dst properly. so what. you are not using it directly anyhow (or are you somewhere that is not apparent)? pandas is very careful internally to deal with DST transitions correctly (of course their are bugs, which get squashed).

So what is the bug here? (aside from replace not correctly doing things? which is obviously just calls datetime replace naively, I agree, but I am not sure this is a cause to dump datetime).

@ischwabacher
Copy link
Contributor Author

What exactly is horribly broken?

Erm... my hyperbole engine? Which is overheating? It's true that I'm being a little over the top.

I think there are two issues here. One is that datetime assumes that a time zone has a fixed offset, which is of course not true when you view a time zone a mapping between UTC and local time, which you need to do if you're going to support arithmetic operations. Hence this:

In [1]: import pandas as pd

In [2]: pd.Timestamp('2013-11-01 00:00:00-0500', tz='America/Chicago')
Out[2]: Timestamp('2013-10-31 23:09:00-0551', tz='America/Chicago')

This is also why Timestamp is going to need to replace datetime.replace, and it's why pytz.timezone.normalize exists.

So there are three clear bugs here: first, that constructing a Timestamp from local time + offset + time zone (which is the ideal way to avoid local time ambiguity) yields a time that is UTC-correct but has the offset of local mean time (or whatever that time zone's first offset is in tzinfo). Second, Timestamp.replace gives nonexistent times when crossing a local time discontinuity. Third, datetime.replace segfaults when fed a Timestamp (regardless of whether there's a local time discontinuity).

The second issue is that in general, 1D != 24H. I know I'm going to have a hard time selling this one, especially since the BDFL has explicitly weighed in with the opposite opinion, but I really think that this is a case of "explicit is better than implicit". Some days just aren't 24 hours long, and users who assume that they are will get burned one way or another. Yes, we can intercept 24H and make it into 1D, but what if that 24 is the result of a computation that could have come out to 20, or 35? What if the user wants 1D1H, as in ['2013-11-01 00:00:00-0500', '2013-11-02 01:00:00-0500', '2013-11-03 02:00:00-0600']? What happens to an hour when we want to implement leap seconds?

There are really two notions of datetime arithmetic, one of them well represented by timedelta64 and the other by date_range. timedelta64 is about absolute time while date_range is about local time. I realize that I'm asking for a major (and controversial) design change: I think that DateOffset should be about local time, and date_range should simply inherit that behavior because it uses DateOffsets.

If there's enough support for this I can split it off into a separate issue; otherwise it can just die here.

@rockg
Copy link
Contributor

rockg commented Jul 24, 2014

I definitely agree that we need to support different notions of day when it concerns tz definitions and addition. This came up on a prior issue...see #5175 for what is being discussed there.

@jreback
Copy link
Contributor

jreback commented Jul 24, 2014

  1. I don't understand where the construction bug is, looks ok to me
  2. yes this issue is about replace.
  3. the fact that datetime.replace segfaults when fed a Timestamp: not sure what if anything can be done here. Not sure that is 'valid' input and should simply be avoided. If someone wants to delve into into this, then ok

@ischwabacher
Copy link
Contributor Author

  1. I don't understand where the construction bug is, looks ok to me

Timestamp('2013-10-31 23:09:00-0551', tz='America/Chicago') has the wrong offset.

@jreback
Copy link
Contributor

jreback commented Jul 24, 2014

ahh it's using the current offset rather than normalizing

ok pls create a new issue for that one

@ischwabacher
Copy link
Contributor Author

Just to note: One of the reasons I want to get this right is that absolutely everybody else gets it wrong, in some small way or another. Until I discovered pandas, my solution was to create the same time zone in pytz and dateutil.tz and swap them out depending on what operation I needed to perform on a given datetime. (dt + timedelta(minutes=N) needs a pytz timezone while dt + timedelta(days=N) and dt.replace() need a dateutil.tz timezone.) I never want to have to break out that particular stapler again.

@jreback
Copy link
Contributor

jreback commented Jul 24, 2014

@ischwabacher make that last a separate issue as well (they might be connected, but easier to deal with this way)

@ischwabacher
Copy link
Contributor Author

My current workaround is this:

In [1]: import pandas as pd

In [2]: import pytz

In [3]: tz = pytz.timezone('America/Chicago')

In [4]: t = pd.Timestamp('2013-11-3', tz=tz)

In [5]: [tz.normalize(t.replace(hour=n)).replace(hour=n) for n in range(24)]
Out[5]: 
[Timestamp('2013-11-03 00:00:00-0500', tz='America/Chicago'),
 Timestamp('2013-11-03 01:00:00-0500', tz='America/Chicago'),
 Timestamp('2013-11-03 02:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 03:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 04:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 05:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 06:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 07:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 08:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 09:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 10:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 11:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 12:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 13:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 14:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 15:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 16:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 17:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 18:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 19:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 20:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 21:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 22:00:00-0600', tz='America/Chicago'),
 Timestamp('2013-11-03 23:00:00-0600', tz='America/Chicago')]

@jreback
Copy link
Contributor

jreback commented Sep 9, 2014

@ischwabacher what needs to be fixed here again? (as I think some of this was moved to another issue)

@ischwabacher
Copy link
Contributor Author

Well, none of the misbehavior in the OP has changed (as of 0.14.1-376-g0b5fa07), so the title issue is still outstanding.

#7833 is fixed, so we're not getting crazy offsets when constructing Timestamps from local+offset+tz. I would say that #7835 is still only partially resolved, but the remaining part of that is more of an API design question than an actual bug. I still need to read through and understand #5175 / #5292, so I'll wait to comment on those until I've done that; it looks like that will be related to #7835. So all the bugs mentioned in this issue except the OP are already fixed; all of the design issues that have come up are addressed in their own issues.

I am finally steeling myself to take on tslib.pyx in order to fix this and #8225, but I'm having problems getting cygdb to run and debugging cython using naked gdb is a chore.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2014

@ischwabacher when you have a chance if you'd put up a ToDo list at the top of the PR and we'll make this into a master issue (for the sub-issues) you mentioned

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 29, 2014
@ischwabacher
Copy link
Contributor Author

Yeah, I've been pretty disorganized. I will try to put this together tonight, but I would bet it won't actually come together until some time tomorrow night at the earliest.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2014

no worries
I have pushed this till 0.15.1
but if u want to included something for 0.15.0 we can

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback
Copy link
Contributor

jreback commented Jan 5, 2018

if someone wants to evaluate this issue for closure, xref #18618 which we think closes. just want to be sure bases are covered.

@jreback jreback modified the milestones: 0.23.0, Next Major Release Apr 14, 2018
@jreback jreback modified the milestones: Next Major Release, 0.23.2 Jun 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants