Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.23.4 changed read_csv parsing for a mixed-timezone datetimes #24987

Closed
TomAugspurger opened this issue Jan 28, 2019 · 15 comments
Closed

0.23.4 changed read_csv parsing for a mixed-timezone datetimes #24987

TomAugspurger opened this issue Jan 28, 2019 · 15 comments
Labels
IO CSV read_csv, to_csv Timezones Timezone data dtype
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Previously, a column in a CSV with mixed timezones would (I think) convert each value to UTC and discard the tzinfo.

import pandas as pd
import io

content = """\
Sat, 22 Apr 2017 15:11:58 -0500
Fri, 21 Apr 2017 14:20:57 -0500
Thu, 9 Mar 2017 11:15:30 -0600"""

df = pd.read_csv(io.StringIO(content), parse_dates=True, header=None, names=['day', 'datetime'], index_col='datetime')

On 0.23.4 that's

In [7]: df.index
Out[7]:
DatetimeIndex(['2017-04-22 20:11:58', '2017-04-21 19:20:57',
               '2017-03-09 17:15:30'],
              dtype='datetime64[ns]', name='datetime', freq=None)

On 0.24 that's

In [7]: df.index
Out[7]:
Index([2017-04-22 15:11:58-05:00, 2017-04-21 14:20:57-05:00,
       2017-03-09 11:15:30-06:00],
      dtype='object', name='datetime')

I'm not sure what the expected behavior is here, but I think the old behavior is as good as any.

I haven't verified, but #22380 seems like a likely candidate for introducing the change.

@TomAugspurger TomAugspurger added IO CSV read_csv, to_csv Timezones Timezone data dtype labels Jan 28, 2019
@TomAugspurger
Copy link
Contributor Author

cc @jbrockmendel, @mroeschke, @gfyoung, @swyoon if you have thoughts here.

@mroeschke
Copy link
Member

mroeschke commented Jan 28, 2019

Appears this was an intentional change as described by the original issue: #22256 (comment)

Also AFAICT, the coercion to UTC behavior before was not documented in read_csv(?). In general, I am keen on preserving the distinct UTC offsets instead of coercing to UTC as it leads to less surprises.

@TomAugspurger
Copy link
Contributor Author

I don't see anything in the original issue #22256 (comment) about mixed timezones.

So to clarify, what's the expected output of

In [8]: pd.read_csv(io.StringIO('a\n2000-01-01T00:00:00+06:00\n2000-01-01T00:00:00+05:00'), parse_dates=['a']).a

In 0.23.4, that's

Out[4]:
0   1999-12-31 18:00:00
1   1999-12-31 19:00:00
Name: a, dtype: datetime64[ns]

and in 0.24.0 that's

Out[8]:
0    2000-01-01 00:00:00+06:00
1    2000-01-01 00:00:00+05:00
Name: a, dtype: object

If we want the 0.24.0 behavior, do we have an alternative recommendation for "parse this mess into a column of datetimes"? Something like

pd.read_csv(..., date_parser=lambda x: pd.to_datetime(x, utc=True))

If that's the recommendation going forward, I can submit a PR updating the release note for 0.24 to note the breaking change, and an example with mixed timezones.

@jbrockmendel
Copy link
Member

FWIW I see two upsides to not-converting to UTC. First, conversion is lossy; mixed timezones is a pretty weird special case and if it is intentional, we preserve it. Second, I'm pretty sure this is consistent with to_datetime's treatment of strings, which is a plus.

@TomAugspurger
Copy link
Contributor Author

Actually, does date_parser operate row-wise? We probably don't want to recommend that. The alternative would be something like

date_columns = ['a']
df = pd.read_csv(...)
df[date_columns] = df[date_columns].apply(pd.to_datetime, utc=True)

so read it in as strings, and then convert later?

@TomAugspurger
Copy link
Contributor Author

mixed timezones is a pretty weird special case and if it is intentional, we preserve it.

I would distinguish mixed timezones in a storage format like CSV from our internal representation. IMO, it's important to support easily ingesting this kind of data.

It does appear to be consistent with to_datetime

In [6]: pd.to_datetime(['2000-01-01T00:00:00+06:00', '2000-01-01T00:00:00+05:00'])
Out[6]: Index([2000-01-01 00:00:00+06:00, 2000-01-01 00:00:00+05:00], dtype='object')

In [7]: pd.to_datetime(['2000-01-01T00:00:00+06:00', '2000-01-01T00:00:00+05:00'], utc=True)
Out[7]: DatetimeIndex(['1999-12-31 18:00:00+00:00', '1999-12-31 19:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

@mroeschke
Copy link
Member

Yeah I'd recommend post processing via apply (or to_datetime directly if it's just one column).

The to_datetime example above was recently fixed in 0.24.0 as well. http://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jan 28, 2019 via email

@gfyoung
Copy link
Member

gfyoung commented Jan 28, 2019

I'm a little late to the discussion, but I do agree with the general sentiment here. I think a documentation change will suffice.

Actually, does date_parser operate row-wise

@TomAugspurger : The behavior of date_parser varies depending on how parse_dates is passed in. However, IIUC it does not operate "row-wise" as I believe you are implying.

@TomAugspurger
Copy link
Contributor Author

Thanks @gfyoung, that's the recommendation I went with in #24989.

@jreback jreback added this to the 0.24.1 milestone Jan 29, 2019
@hendrics
Copy link

Sorry, too late to the discussion but we've just update to 0.24.2 from 0.23.4... It breaks a lot of our code which was relying on reading back 'UTC' version for years.
This is the sainest approach and date_parser=lambda x: pd.to_datetime(x, utc=True) should be the default really and not give back a set of object strings?

Having the fixed offset which most likely is wrong due to DST transitions serves no purpose unless you are localising to a known timezone using IANA timezone name.

@jbrockmendel
Copy link
Member

This is the sainest approach and date_parser=lambda x: pd.to_datetime(x, utc=True) should be the default really and not give back a set of object strings?

My understanding is that ATM it gives back an array of Timestamp objects, not strings. Is it actually giving an array of strings?

This was changed because the all-utc version drops information. If a user has a mixed-timezone array, the default is to assume it is intentional.

@hendrics
Copy link

hendrics commented Mar 18, 2019

On 0.24 that's

In [7]: df.index
Out[7]:
Index([2017-04-22 15:11:58-05:00, 2017-04-21 14:20:57-05:00,
      2017-03-09 11:15:30-06:00],
     dtype='object', name='datetime')

from @TomAugspurger example above - the dtype is object - but yeah the individual members of the index are of type datetime.datetime with different offset... not the Timestamp objects per se... and not the strings as I've implied earlier... my bad...

But my point is that "mixed timezones" or mixed timezone offsets is very common for any location with DST transitions... half of the year they are going to be different from the other half... but it is normal... What it was doing before was ok except that UTC timezone was not set... My point is it is better default behaviour, and it already does it for +00:00 offsets.... if I have +00:00 for all the timestamps it slaps UTC timezone on the index... And that is probably the right default thing especially given that it was doing the same before without slapping a timezone...

Also timezone AINA string is what I think when we talk about timezones as it is unambiguous... yet timezone offset does not give me this information anyway, only how to convert to UTC.

I could have 2 AINA timezone producing exactly the same offset strings and hence timestamps for half a year and be wrong the other half of the year or having the DST transition on different days.

@gfyoung
Copy link
Member

gfyoung commented Mar 18, 2019

But my point is that "mixed timezones" or mixed timezone offsets is very common for any location with DST transitions

That's true, but coercion to UTC is imposing an unnecessary transformation on datetime objects.
In addition, the behavior was not documented. From a design standpoint, we aim to give the end user their CSV data in as unadulterated form as possible because after all, the function is called read_csv. Nowhere in the name implies that we perform transformations on your data.

Now straddling the DST transition point has historically been a tricky spot for datetime indices on our end, so making a unilateral decision on how to address them would have been unusual, to say the least.

So from those standpoints, I stand by our decision to change the behavior as we did for 0.24.0.

@hendrics
Copy link

hendrics commented Mar 19, 2019

I guess my issue is that I do not get a DatetimeIndex any more if I have mixed timestamps....

on a side note there's an amazing difference in speed...

buf = io.StringIO()
pd.DataFrame(
    index=pd.date_range(
        start='2015-03-10T00:00', 
        end='2020-03-12T00:00', 
        tz='America/Havana', 
        freq='1H'
      )
).to_csv(buf)


%timeit buf.seek(0); pd.read_csv(buf, parse_dates=[0], infer_datetime_format=True)
4.78 s ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit buf.seek(0); pd.read_csv(buf, parse_dates=[0], date_parser=lambda x: pd.to_datetime(x, utc=True))
750 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

so it seems using pd.read_csv(..., date_parser=pd.to_datetime(x, utc=True)) is beneficial for the speed performance...

This difference is only present for a mixed timezone offset. When timezone offset is '00:00' I didn't observer any difference between different modes and it is around 160ms for the same size data...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

6 participants