BUG: Fix to_datetime to properly deal with tz offsets #3944 #5958

danbirken · 2014-01-15T23:30:56Z

Currently for certain formats of datetime strings, the tz offset will
just be ignored.
#3944

ghost · 2014-01-18T22:19:26Z

I'm going to stop pretending I haven't seen this PR for just long enough to give you a glimpse
into the psyche of a maintainer.

Convince me that this can't possibly break anything for anyone and that the change is what
the code should have been doing all along.

Some of those variations seem like they stretch iso8601 to the breaking point, I may be wrong.

danbirken · 2014-01-18T23:14:16Z

Perhaps I should have written the test in a more readable way to highlight the bug. I was just writing it for the future to get all the test cases in a very concise manner and try to flex various ways the datetime strings are parsed.

The current situation is just wrong:

In [3]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[3]: 1357027200000000000

In [4]: pd.to_datetime('2013-01-01T08:00:00+0800').value
Out[4]: 1356998400000000000

These timestamp strings 01-01-2013 08:00:00+08:00 and 2013-01-01T08:00:00+0800 represent the same thing. However, depending on which internal parsing method is triggered, the timezone offset is either used or ignored. It should be consistent, and the most logical system would be to use the timezone offset if given. And in fact, every other method that parses things into datetime64s does use the offset, so this just fixes it for the one case where it didn't (which I assume was an oversight). This is that same code after the fix:

In [3]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[3]: 1356998400000000000

In [2]: pd.to_datetime('2013-01-01T08:00:00+0800').value
Out[2]: 1356998400000000000

This is certainly what the code should be doing.

This could break somebodies code who depends on this bug, but on the plus side:

a) All of the existing tests pass
b) If it did break somebody's code it would be really easy to fix (either strip the timezones out of your datetime strings or just accept with the fact that the code is now properly converting them to UTC)
c) As this code gets more complicated (like with the potential changes to speed it up), keeping this inconsistency is just going to make everything more confusing because the bug will be triggered in less predictable ways

ghost · 2014-01-18T23:26:52Z

The timestring is not a valid iso8061 string, so the behavior might be said to be undefined.
But I see that dateutil.parse does produce the same result for both and that's fairly convincing
in that this is a bug.

Can you show an example where another code path in pandas parses this differently from
to_datetime()?

I'm half convinced this can go in to 0.14.0, @jreback?

ghost · 2014-01-18T23:32:00Z

Can you do a test_perf.sh run to check for possible perf impact?

jreback · 2014-01-19T00:01:00Z

I agree that this is a bug (the very fact that we have different results for this one case). For the most part this path is not hit very often AFAICT (e.g. you need to have no specified format and its not parsed by the ISO8601 c-parser), so other that our tests I wouldn't see how this is even hit.

@danbirken can you you contrive a case where its actually hit either in read_csv or in to_datetime?

(and not by directly calling array_to_datetime)

jreback · 2014-01-19T00:04:40Z

I guess this proves the point (but this is way esoteric, who specifies dtypes when constructing Series!)

In [11]: Series(['2013-01-01T08:00:00.000000000+0800','12-31-2012 23:00:00-01:00','01-01-2013 08:00:00+08:00','01-01-2013 08:00:00'],dtype='M8[ns]')
Out[11]: 
0   2013-01-01 00:00:00
1   2012-12-31 23:00:00
2   2013-01-01 08:00:00
3   2013-01-01 08:00:00
dtype: datetime64[ns]

danbirken · 2014-01-19T00:12:54Z

So the original reason I saw this bug was the code path in Timestamp() uses parse_date properly:

In [13]: pd.Timestamp('01-01-2013 08:00:00+08:00').value
Out[13]: 1356998400000000000

In [14]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[14]: 1357027200000000000

Here is a case with read_csv that is wrong:

timestamp,data
01-01-2013 08:00:00+08:00,1
01-01-2013 09:00:00+08:00,2
01-01-2013 10:00:00+08:00,3

Old (timezone offset is ignored):

In [12]: pd.read_csv('/tmp/a.csv', index_col=0, parse_dates=True)
Out[12]:
                     data
timestamp
2013-01-01 08:00:00     1
2013-01-01 09:00:00     2
2013-01-01 10:00:00     3

Fixed (timezone offset used):

In [2]: pd.read_csv('/tmp/a.csv', index_col=0, parse_dates=True)
Out[2]:
                     data
timestamp
2013-01-01 00:00:00     1
2013-01-01 01:00:00     2
2013-01-01 02:00:00     3

I'd actually imagine this gets triggered quite a bit, since people use pandas to import log data a lot (which will have timestamps in weird formats, probably including a tz offset).

Running the ./test_perf.sh stuff now...

ghost · 2014-01-19T00:32:12Z

I'm sold on the first example. The rest are just cases where a strange timestamp doesn't
get picked up correctly, it's not a guarantee we're bound by (although we'd like to match datutil parser
in general).

start of 0.14, if no objections @jreback ?

jreback · 2014-01-19T00:45:38Z

technically this is a bug fix but I agree that people may actually be relying in this
0.14 it is then

jreback · 2014-01-19T00:47:05Z

@danbirken this does not close #3844 correct?

Currently for certain formats of datetime strings, the tz offset will just be ignored.

danbirken · 2014-01-19T01:34:45Z

Well I can't get ./test_perf.sh to cooperate, but I ran some ad-hoc tests and it appears to me to be within 1% of whatever the previous performance was. This changes from one fully cython-ed function to another, so it makes sense that it still would be quite fast.

I admit I don't fully understand what is going on in that particular issue (#3844), so I really have no idea if this is related.

FYI: I just made a small update because I just saw that convert_to_tsobject already calls check_dts_bounds, so I got rid of the double call.

danbirken · 2014-01-19T01:36:34Z

Oh I see, you probably mean #3944.

I think that issue should be closed as "will not fix", and this is really a different bug (though it is related). I can make another GH issue for this change if you would like for tracking purposes.

jreback · 2014-01-19T01:53:12Z

no that's fine

ghost · 2014-01-19T09:37:08Z

@danbirken, please open an issue on test_perf if there is one and cc me.

ghost · 2014-02-04T09:24:34Z

banzai

BUG: Fix to_datetime to properly deal with tz offsets #3944

danbirken mentioned this pull request Jan 15, 2014

API/BUG: inconsistent return in Timestamp/to_datetime for current year #3944

Closed

BUG: Fix to_datetime to properly deal with tz offsets pandas-dev#3944

998e030

Currently for certain formats of datetime strings, the tz offset will just be ignored.

ghost mentioned this pull request Jan 20, 2014

Unify datetime parsing #6010

Closed

ghost pushed a commit that referenced this pull request Feb 4, 2014

Merge pull request #5958 from danbirken/fix-to-datetime-timezone-offset

8a60158

BUG: Fix to_datetime to properly deal with tz offsets #3944

ghost merged commit 8a60158 into pandas-dev:master Feb 4, 2014

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix to_datetime to properly deal with tz offsets #3944 #5958

BUG: Fix to_datetime to properly deal with tz offsets #3944 #5958

danbirken commented Jan 15, 2014

ghost commented Jan 18, 2014

danbirken commented Jan 18, 2014

ghost commented Jan 18, 2014

ghost commented Jan 18, 2014

jreback commented Jan 19, 2014

jreback commented Jan 19, 2014

danbirken commented Jan 19, 2014

ghost commented Jan 19, 2014

jreback commented Jan 19, 2014

jreback commented Jan 19, 2014

danbirken commented Jan 19, 2014

danbirken commented Jan 19, 2014

jreback commented Jan 19, 2014

ghost commented Jan 19, 2014

ghost commented Feb 4, 2014

BUG: Fix to_datetime to properly deal with tz offsets #3944 #5958

BUG: Fix to_datetime to properly deal with tz offsets #3944 #5958

Conversation

danbirken commented Jan 15, 2014

ghost commented Jan 18, 2014

danbirken commented Jan 18, 2014

ghost commented Jan 18, 2014

ghost commented Jan 18, 2014

jreback commented Jan 19, 2014

jreback commented Jan 19, 2014

danbirken commented Jan 19, 2014

ghost commented Jan 19, 2014

jreback commented Jan 19, 2014

jreback commented Jan 19, 2014

danbirken commented Jan 19, 2014

danbirken commented Jan 19, 2014

jreback commented Jan 19, 2014

ghost commented Jan 19, 2014

ghost commented Feb 4, 2014