Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.23.4 changed read_csv parsing for a mixed-timezone datetimes #24987

Closed
TomAugspurger opened this issue Jan 28, 2019 · 15 comments

Comments

Projects
None yet
6 participants
@TomAugspurger
Copy link
Contributor

commented Jan 28, 2019

Previously, a column in a CSV with mixed timezones would (I think) convert each value to UTC and discard the tzinfo.

import pandas as pd
import io

content = """\
Sat, 22 Apr 2017 15:11:58 -0500
Fri, 21 Apr 2017 14:20:57 -0500
Thu, 9 Mar 2017 11:15:30 -0600"""

df = pd.read_csv(io.StringIO(content), parse_dates=True, header=None, names=['day', 'datetime'], index_col='datetime')

On 0.23.4 that's

In [7]: df.index
Out[7]:
DatetimeIndex(['2017-04-22 20:11:58', '2017-04-21 19:20:57',
               '2017-03-09 17:15:30'],
              dtype='datetime64[ns]', name='datetime', freq=None)

On 0.24 that's

In [7]: df.index
Out[7]:
Index([2017-04-22 15:11:58-05:00, 2017-04-21 14:20:57-05:00,
       2017-03-09 11:15:30-06:00],
      dtype='object', name='datetime')

I'm not sure what the expected behavior is here, but I think the old behavior is as good as any.

I haven't verified, but #22380 seems like a likely candidate for introducing the change.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2019

cc @jbrockmendel, @mroeschke, @gfyoung, @swyoon if you have thoughts here.

@TomAugspurger TomAugspurger referenced this issue Jan 28, 2019

Merged

adapt to pandas changes #8588

2 of 2 tasks complete
@mroeschke

This comment has been minimized.

Copy link
Member

commented Jan 28, 2019

Appears this was an intentional change as described by the original issue: #22256 (comment)

Also AFAICT, the coercion to UTC behavior before was not documented in read_csv(?). In general, I am keen on preserving the distinct UTC offsets instead of coercing to UTC as it leads to less surprises.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2019

I don't see anything in the original issue #22256 (comment) about mixed timezones.

So to clarify, what's the expected output of

In [8]: pd.read_csv(io.StringIO('a\n2000-01-01T00:00:00+06:00\n2000-01-01T00:00:00+05:00'), parse_dates=['a']).a

In 0.23.4, that's

Out[4]:
0   1999-12-31 18:00:00
1   1999-12-31 19:00:00
Name: a, dtype: datetime64[ns]

and in 0.24.0 that's

Out[8]:
0    2000-01-01 00:00:00+06:00
1    2000-01-01 00:00:00+05:00
Name: a, dtype: object

If we want the 0.24.0 behavior, do we have an alternative recommendation for "parse this mess into a column of datetimes"? Something like

pd.read_csv(..., date_parser=lambda x: pd.to_datetime(x, utc=True))

If that's the recommendation going forward, I can submit a PR updating the release note for 0.24 to note the breaking change, and an example with mixed timezones.

@jbrockmendel

This comment has been minimized.

Copy link
Member

commented Jan 28, 2019

FWIW I see two upsides to not-converting to UTC. First, conversion is lossy; mixed timezones is a pretty weird special case and if it is intentional, we preserve it. Second, I'm pretty sure this is consistent with to_datetime's treatment of strings, which is a plus.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2019

Actually, does date_parser operate row-wise? We probably don't want to recommend that. The alternative would be something like

date_columns = ['a']
df = pd.read_csv(...)
df[date_columns] = df[date_columns].apply(pd.to_datetime, utc=True)

so read it in as strings, and then convert later?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2019

mixed timezones is a pretty weird special case and if it is intentional, we preserve it.

I would distinguish mixed timezones in a storage format like CSV from our internal representation. IMO, it's important to support easily ingesting this kind of data.

It does appear to be consistent with to_datetime

In [6]: pd.to_datetime(['2000-01-01T00:00:00+06:00', '2000-01-01T00:00:00+05:00'])
Out[6]: Index([2000-01-01 00:00:00+06:00, 2000-01-01 00:00:00+05:00], dtype='object')

In [7]: pd.to_datetime(['2000-01-01T00:00:00+06:00', '2000-01-01T00:00:00+05:00'], utc=True)
Out[7]: DatetimeIndex(['1999-12-31 18:00:00+00:00', '1999-12-31 19:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)
@mroeschke

This comment has been minimized.

Copy link
Member

commented Jan 28, 2019

Yeah I'd recommend post processing via apply (or to_datetime directly if it's just one column).

The to_datetime example above was recently fixed in 0.24.0 as well. http://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2019

@gfyoung

This comment has been minimized.

Copy link
Member

commented Jan 28, 2019

I'm a little late to the discussion, but I do agree with the general sentiment here. I think a documentation change will suffice.

Actually, does date_parser operate row-wise

@TomAugspurger : The behavior of date_parser varies depending on how parse_dates is passed in. However, IIUC it does not operate "row-wise" as I believe you are implying.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Jan 29, 2019

Thanks @gfyoung, that's the recommendation I went with in #24989.

@jreback jreback added this to the 0.24.1 milestone Jan 29, 2019

@hendrics

This comment has been minimized.

Copy link

commented Mar 18, 2019

Sorry, too late to the discussion but we've just update to 0.24.2 from 0.23.4... It breaks a lot of our code which was relying on reading back 'UTC' version for years.
This is the sainest approach and date_parser=lambda x: pd.to_datetime(x, utc=True) should be the default really and not give back a set of object strings?

Having the fixed offset which most likely is wrong due to DST transitions serves no purpose unless you are localising to a known timezone using IANA timezone name.

@jbrockmendel

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

This is the sainest approach and date_parser=lambda x: pd.to_datetime(x, utc=True) should be the default really and not give back a set of object strings?

My understanding is that ATM it gives back an array of Timestamp objects, not strings. Is it actually giving an array of strings?

This was changed because the all-utc version drops information. If a user has a mixed-timezone array, the default is to assume it is intentional.

@hendrics

This comment has been minimized.

Copy link

commented Mar 18, 2019

On 0.24 that's

In [7]: df.index
Out[7]:
Index([2017-04-22 15:11:58-05:00, 2017-04-21 14:20:57-05:00,
      2017-03-09 11:15:30-06:00],
     dtype='object', name='datetime')

from @TomAugspurger example above - the dtype is object - but yeah the individual members of the index are of type datetime.datetime with different offset... not the Timestamp objects per se... and not the strings as I've implied earlier... my bad...

But my point is that "mixed timezones" or mixed timezone offsets is very common for any location with DST transitions... half of the year they are going to be different from the other half... but it is normal... What it was doing before was ok except that UTC timezone was not set... My point is it is better default behaviour, and it already does it for +00:00 offsets.... if I have +00:00 for all the timestamps it slaps UTC timezone on the index... And that is probably the right default thing especially given that it was doing the same before without slapping a timezone...

Also timezone AINA string is what I think when we talk about timezones as it is unambiguous... yet timezone offset does not give me this information anyway, only how to convert to UTC.

I could have 2 AINA timezone producing exactly the same offset strings and hence timestamps for half a year and be wrong the other half of the year or having the DST transition on different days.

@gfyoung

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

But my point is that "mixed timezones" or mixed timezone offsets is very common for any location with DST transitions

That's true, but coercion to UTC is imposing an unnecessary transformation on datetime objects.
In addition, the behavior was not documented. From a design standpoint, we aim to give the end user their CSV data in as unadulterated form as possible because after all, the function is called read_csv. Nowhere in the name implies that we perform transformations on your data.

Now straddling the DST transition point has historically been a tricky spot for datetime indices on our end, so making a unilateral decision on how to address them would have been unusual, to say the least.

So from those standpoints, I stand by our decision to change the behavior as we did for 0.24.0.

@hendrics

This comment has been minimized.

Copy link

commented Mar 19, 2019

I guess my issue is that I do not get a DatetimeIndex any more if I have mixed timestamps....

on a side note there's an amazing difference in speed...

buf = io.StringIO()
pd.DataFrame(
    index=pd.date_range(
        start='2015-03-10T00:00', 
        end='2020-03-12T00:00', 
        tz='America/Havana', 
        freq='1H'
      )
).to_csv(buf)


%timeit buf.seek(0); pd.read_csv(buf, parse_dates=[0], infer_datetime_format=True)
4.78 s ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit buf.seek(0); pd.read_csv(buf, parse_dates=[0], date_parser=lambda x: pd.to_datetime(x, utc=True))
750 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

so it seems using pd.read_csv(..., date_parser=pd.to_datetime(x, utc=True)) is beneficial for the speed performance...

This difference is only present for a mixed timezone offset. When timezone offset is '00:00' I didn't observer any difference between different modes and it is around 160ms for the same size data...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.