New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.concat reset the tz-aware index to UTC #18523

Closed
antvig opened this Issue Nov 27, 2017 · 9 comments

Comments

Projects
None yet
4 participants
@antvig

antvig commented Nov 27, 2017

Hi,
I reopen #18422 with copy-pastable example

Code Sample, a copy-pastable example if possible

import pandas as pd

idx1 = pd.date_range('2011-01-01', periods=3, freq='H', tz='Europe/Paris')
idx2 = pd.date_range(start=idx1[0], end=idx1[-1], freq='H')

df1 = pd.DataFrame({'a': [1, 2, 3]}, index=idx1)
df2 = pd.DataFrame({'b': [1, 2, 3]}, index=idx2)

res = pd.concat([df1, df2], axis=1)

print(df1.index.tzinfo)
print(df2.index.tzinfo)
print(res.index.tzinfo)

Output

Europe/Paris
Europe/Paris
UTC

Problem description

pd.concat reset the tz-aware index to UTC

This seems to come from the two different representation of time zone 'Europe/Paris' as this ipython output shows :

In [23]: df1.index.tzinfo
Out[23]: <DstTzInfo 'Europe/Paris' LMT+0:09:00 STD>

In [24]: df2.index.tzinfo
Out[24]: <DstTzInfo 'Europe/Paris' CET+1:00:00 STD>

In [26]: res.index.tzinfo
Out[26]: <UTC>

Expected Output

Europe/Paris
Europe/Paris
Europe/Paris

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung

This comment has been minimized.

Member

gfyoung commented Nov 27, 2017

How odd indeed! This example is indeed reproducible on my end this time around.

Have a look and see why we're getting this discrepancy in timezone representation. PR is most welcome!

@antvig

This comment has been minimized.

antvig commented Nov 28, 2017

The issue is from DatetimeIndex : when creating a localized DatetimeIndex, the DatetimeIndex and Timestamps in the DatetimeIndex have different tzinfo ...

In [9]: idx = pd.DatetimeIndex(start='2011-01-01', periods=3, freq='H', tz='Europe/Paris')

In [10]: idx.tzinfo
Out[10]: <DstTzInfo 'Europe/Paris' LMT+0:09:00 STD>

In [11]: idx[0].tzinfo
Out[11]: <DstTzInfo 'Europe/Paris' CET+1:00:00 STD>
@jreback

This comment has been minimized.

Contributor

jreback commented Nov 29, 2017

@antvig this is a bug, probably a duplicate of something in : https://github.com/pandas-dev/pandas/issues?q=is%3Aopen+is%3Aissue+label%3AReshaping+label%3ATimezones

but not because of what you show: #18523 (comment)

rather the combined index is reset to UTC rather than preserve the incoming tz.

i'll mark this.

@antvig

This comment has been minimized.

antvig commented Nov 29, 2017

The combined index preserve the incoming tz if the two dataframe indexes have exactly the same timezone

idx1 = pd.date_range('2011-01-01', periods=3, freq='H', tz='Europe/Paris')
idx2 = pd.date_range('2011-01-01', periods=3, freq='H', tz='Europe/Paris')

df1 = pd.DataFrame({'a': [1, 2, 3]}, index=idx1)
df2 = pd.DataFrame({'b': [1, 2, 3]}, index=idx2)
res = pd.concat([df1, df2], axis=1)

print(df1.index.tzinfo)
print(df2.index.tzinfo)
print(res.index.tzinfo)

Output

Europe/Paris
Europe/Paris
Europe/Paris
@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Nov 29, 2017

Seem related to #17572

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 29, 2017

@jorisvandenbossche see my comment above, #17572 is irrelevant

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Nov 29, 2017

And how do you explain #18523 (comment)?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Nov 29, 2017

This is due to _maybe_utc_convert which checks the timezones by equality:

if self.tz != other.tz:
this = self.tz_convert('UTC')
other = other.tz_convert('UTC')
return this, other

As explained in #17572, such equality check does not always give True, depending on how the tzinfo object is constructed, although they are in the same timezone (#17572 links to this SO question)

@pandas-dev pandas-dev deleted a comment from jreback Nov 29, 2017

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 1, 2017

here is a patch to fix:

diff --git a/pandas/core/indexes/datetimes.py b/pandas/core/indexes/datetimes.py
index 196c881f9..01229e042 100644
--- a/pandas/core/indexes/datetimes.py
+++ b/pandas/core/indexes/datetimes.py
@@ -1138,7 +1138,7 @@ class DatetimeIndex(DatelikeOps, TimelikeOps, DatetimeIndexOpsMixin,
                 raise TypeError('Cannot join tz-naive with tz-aware '
                                 'DatetimeIndex')
 
-            if self.tz != other.tz:
+            if str(self.tz) != str(other.tz):
                 this = self.tz_convert('UTC')
                 other = other.tz_convert('UTC')
         return this, other

thanks to @pganssle for the advice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment