Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix to_datetime to properly deal with tz offsets #3944 #5958

Merged
1 commit merged into from
Feb 4, 2014

Conversation

danbirken
Copy link
Contributor

Currently for certain formats of datetime strings, the tz offset will
just be ignored.
#3944

@ghost
Copy link

ghost commented Jan 18, 2014

I'm going to stop pretending I haven't seen this PR for just long enough to give you a glimpse
into the psyche of a maintainer.

Convince me that this can't possibly break anything for anyone and that the change is what
the code should have been doing all along.

Some of those variations seem like they stretch iso8601 to the breaking point, I may be wrong.

@danbirken
Copy link
Contributor Author

Perhaps I should have written the test in a more readable way to highlight the bug. I was just writing it for the future to get all the test cases in a very concise manner and try to flex various ways the datetime strings are parsed.

The current situation is just wrong:

In [3]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[3]: 1357027200000000000

In [4]: pd.to_datetime('2013-01-01T08:00:00+0800').value
Out[4]: 1356998400000000000

These timestamp strings 01-01-2013 08:00:00+08:00 and 2013-01-01T08:00:00+0800 represent the same thing. However, depending on which internal parsing method is triggered, the timezone offset is either used or ignored. It should be consistent, and the most logical system would be to use the timezone offset if given. And in fact, every other method that parses things into datetime64s does use the offset, so this just fixes it for the one case where it didn't (which I assume was an oversight). This is that same code after the fix:

In [3]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[3]: 1356998400000000000

In [2]: pd.to_datetime('2013-01-01T08:00:00+0800').value
Out[2]: 1356998400000000000

This is certainly what the code should be doing.

This could break somebodies code who depends on this bug, but on the plus side:

a) All of the existing tests pass
b) If it did break somebody's code it would be really easy to fix (either strip the timezones out of your datetime strings or just accept with the fact that the code is now properly converting them to UTC)
c) As this code gets more complicated (like with the potential changes to speed it up), keeping this inconsistency is just going to make everything more confusing because the bug will be triggered in less predictable ways

@ghost
Copy link

ghost commented Jan 18, 2014

The timestring is not a valid iso8061 string, so the behavior might be said to be undefined.
But I see that dateutil.parse does produce the same result for both and that's fairly convincing
in that this is a bug.

Can you show an example where another code path in pandas parses this differently from
to_datetime()?

I'm half convinced this can go in to 0.14.0, @jreback?

@ghost
Copy link

ghost commented Jan 18, 2014

Can you do a test_perf.sh run to check for possible perf impact?

@jreback
Copy link
Contributor

jreback commented Jan 19, 2014

I agree that this is a bug (the very fact that we have different results for this one case). For the most part this path is not hit very often AFAICT (e.g. you need to have no specified format and its not parsed by the ISO8601 c-parser), so other that our tests I wouldn't see how this is even hit.

@danbirken can you you contrive a case where its actually hit either in read_csv or in to_datetime?

(and not by directly calling array_to_datetime)

@jreback
Copy link
Contributor

jreback commented Jan 19, 2014

I guess this proves the point (but this is way esoteric, who specifies dtypes when constructing Series!)

In [11]: Series(['2013-01-01T08:00:00.000000000+0800','12-31-2012 23:00:00-01:00','01-01-2013 08:00:00+08:00','01-01-2013 08:00:00'],dtype='M8[ns]')
Out[11]: 
0   2013-01-01 00:00:00
1   2012-12-31 23:00:00
2   2013-01-01 08:00:00
3   2013-01-01 08:00:00
dtype: datetime64[ns]

@danbirken
Copy link
Contributor Author

So the original reason I saw this bug was the code path in Timestamp() uses parse_date properly:

In [13]: pd.Timestamp('01-01-2013 08:00:00+08:00').value
Out[13]: 1356998400000000000

In [14]: pd.to_datetime('01-01-2013 08:00:00+08:00').value
Out[14]: 1357027200000000000

Here is a case with read_csv that is wrong:

timestamp,data
01-01-2013 08:00:00+08:00,1
01-01-2013 09:00:00+08:00,2
01-01-2013 10:00:00+08:00,3

Old (timezone offset is ignored):

In [12]: pd.read_csv('/tmp/a.csv', index_col=0, parse_dates=True)
Out[12]:
                     data
timestamp
2013-01-01 08:00:00     1
2013-01-01 09:00:00     2
2013-01-01 10:00:00     3

Fixed (timezone offset used):

In [2]: pd.read_csv('/tmp/a.csv', index_col=0, parse_dates=True)
Out[2]:
                     data
timestamp
2013-01-01 00:00:00     1
2013-01-01 01:00:00     2
2013-01-01 02:00:00     3

I'd actually imagine this gets triggered quite a bit, since people use pandas to import log data a lot (which will have timestamps in weird formats, probably including a tz offset).

Running the ./test_perf.sh stuff now...

@ghost
Copy link

ghost commented Jan 19, 2014

I'm sold on the first example. The rest are just cases where a strange timestamp doesn't
get picked up correctly, it's not a guarantee we're bound by (although we'd like to match datutil parser
in general).

start of 0.14, if no objections @jreback ?

@jreback
Copy link
Contributor

jreback commented Jan 19, 2014

technically this is a bug fix but I agree that people may actually be relying in this
0.14 it is then

@jreback
Copy link
Contributor

jreback commented Jan 19, 2014

@danbirken this does not close #3844 correct?

Currently for certain formats of datetime strings, the tz offset will
just be ignored.
@danbirken
Copy link
Contributor Author

Well I can't get ./test_perf.sh to cooperate, but I ran some ad-hoc tests and it appears to me to be within 1% of whatever the previous performance was. This changes from one fully cython-ed function to another, so it makes sense that it still would be quite fast.

I admit I don't fully understand what is going on in that particular issue (#3844), so I really have no idea if this is related.

FYI: I just made a small update because I just saw that convert_to_tsobject already calls check_dts_bounds, so I got rid of the double call.

@danbirken
Copy link
Contributor Author

Oh I see, you probably mean #3944.

I think that issue should be closed as "will not fix", and this is really a different bug (though it is related). I can make another GH issue for this change if you would like for tracking purposes.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2014

no that's fine

@ghost
Copy link

ghost commented Jan 19, 2014

@danbirken, please open an issue on test_perf if there is one and cc me.

@ghost ghost mentioned this pull request Jan 20, 2014
@ghost
Copy link

ghost commented Feb 4, 2014

banzai

ghost pushed a commit that referenced this pull request Feb 4, 2014
BUG: Fix to_datetime to properly deal with tz offsets #3944
@ghost ghost merged commit 8a60158 into pandas-dev:master Feb 4, 2014
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants