Lost timezone after groupby transform #24198

pocin · 2018-12-10T09:18:14Z

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame({'end_time': [pd.to_datetime('now', utc=True).tz_convert('Asia/Singapore')],
                  'id': [1]})

df['max_end_time'] = df.groupby('id').end_time.transform(max)

df.info() shows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
end_time        1 non-null datetime64[ns, Asia/Singapore]
id              1 non-null int64
max_end_time    1 non-null datetime64[ns]
dtypes: datetime64[ns, Asia/Singapore](1), datetime64[ns](1), int64(1)
memory usage: 104.0 bytes

df.to_dict() shows

{'end_time': {0: Timestamp('2018-12-10 17:08:52.630644+0800', tz='Asia/Singapore')},
 'id': {0: 1},
 'max_end_time': {0: Timestamp('2018-12-10 09:08:52.630644')}}

Problem description

The timezone is dropped silently and timestamp converted to UTC after groupby - transform operation on tz aware datetime column

Expected Output

assert df['end_time'] == df['max_end_time']

Output of `pd.show_versions()`

``` INSTALLED VERSIONS ------------------ commit: None python: 3.7.1.final.0 python-bits: 64 OS: Linux OS-release: 4.9.85-38.58.amzn1.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.10.0
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

</details>

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-12-10T15:27:29Z

This is working correctly on my branch at #24024, but not on master.

In [1]: paste
import pandas as pd

df = pd.DataFrame({'end_time': [pd.to_datetime('now', utc=True).tz_convert('Asia/Singapore')],
                  'id': [1]})

df['max_end_time'] = df.groupby('id').end_time.transform(max)
## -- End pasted text --

In [2]: df.dtypes
Out[2]:
end_time        datetime64[ns, Asia/Singapore]
id                                       int64
max_end_time    datetime64[ns, Asia/Singapore]
dtype: object

In [3]: df
Out[3]:
                          end_time  id                     max_end_time
0 2018-12-10 23:26:55.386177+08:00   1 2018-12-10 15:26:55.386177+08:00

I'll add this as a test case.

TomAugspurger · 2018-12-10T15:29:07Z

Ahh... but it isn't correct over there either. The value is wrong.

WillAyd · 2018-12-10T16:07:34Z

Yea I'm not terribly surprised by this as a lot of the cythonized groupby operations use the float value of timestamps for comparisons and may never cast back.

Can such ops be generalized when working with tz-aware data? Is it ever ambiguous if crossing time zones to do such a thing?

If it makes sense than wondering if this is something than an EA needs to handle appropriately

WillAyd · 2018-12-10T17:40:06Z

Potential duplicate or at least related to #21603

mroeschke · 2018-12-10T18:12:18Z

@WillAyd Since it appears that the groupby routines convert to UTC first before running the aggregates, should be as straightforward to ensure that tz-aware columns undergo a tz_localize('UTC').tz_convert(tz) after the cython routine. Shouldn't need to worry about ambiguous/nonexistent times.

WillAyd · 2018-12-10T18:20:41Z

Makes sense. Do we want the groupby ops to handle this special casing though or is this better fit in the EA implementation? The former would probably be less work atm and provide better performance, though the latter would probably fit better into our programming paradigm go forward

mroeschke · 2018-12-10T19:10:15Z

I don't have a clear picture how generically the groupby ops will interface with the EA in this case. I don't think we'd want to do timezone conversion within EA internals (timezone localization would be okay though) i.e. EA(utc --> tz_local) vs EA(utc).tz_convert(tz_local)

I think the most important piece is that it works generically across all groupby ops and methods.

WillAyd · 2018-12-10T19:53:40Z

The more I think about this I'm not sure it can actually be supported generically, at least not for operations that reduce. If we had two timestamps with the same UTC time but in different timezones which timezone would max return?

should be as straightforward to ensure that tz-aware columns undergo a tz_localize('UTC').tz_convert(tz)

This would assume that the input always has the same timezone, no?

mroeschke · 2018-12-10T21:10:19Z

Timestamps with different timezones would exist in an object dtype column, but yeah I am not sure how the max dispatch with object dtype would handle that scenario since they would be equivalent. But for more common ops, the timestamps should have the same timezone.

I was making a naive assumption that .values would be called before passing to the groupby cython routines, and .values would be returning naive timestamps but in UTC. I was also assuming the tz_convert step needs to happen per column since the result can have multiple timezones potentially.

TomAugspurger added Groupby Timezones Timezone data dtype labels Dec 10, 2018

mroeschke mentioned this issue Feb 11, 2019

BUG: groupby.transform retains timezone information #25264

Merged

4 tasks

jreback added this to the 0.24.2 milestone Feb 11, 2019

jreback closed this as completed in #25264 Feb 16, 2019

billy-doyle mentioned this issue Jul 21, 2019

Lost timezone information after groupby transform #27496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost timezone after groupby transform #24198

Lost timezone after groupby transform #24198

pocin commented Dec 10, 2018 •

edited

Loading

TomAugspurger commented Dec 10, 2018 •

edited

Loading

TomAugspurger commented Dec 10, 2018

WillAyd commented Dec 10, 2018

WillAyd commented Dec 10, 2018

mroeschke commented Dec 10, 2018

WillAyd commented Dec 10, 2018

mroeschke commented Dec 10, 2018

WillAyd commented Dec 10, 2018

mroeschke commented Dec 10, 2018

Lost timezone after groupby transform #24198

Lost timezone after groupby transform #24198

Comments

pocin commented Dec 10, 2018 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Dec 10, 2018 • edited Loading

TomAugspurger commented Dec 10, 2018

WillAyd commented Dec 10, 2018

WillAyd commented Dec 10, 2018

mroeschke commented Dec 10, 2018

WillAyd commented Dec 10, 2018

mroeschke commented Dec 10, 2018

WillAyd commented Dec 10, 2018

mroeschke commented Dec 10, 2018

pocin commented Dec 10, 2018 •

edited

Loading

Output of `pd.show_versions()`

TomAugspurger commented Dec 10, 2018 •

edited

Loading