Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost timezone after groupby transform #24198

Closed
pocin opened this issue Dec 10, 2018 · 9 comments

Comments

@pocin
Copy link

commented Dec 10, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd

df = pd.DataFrame({'end_time': [pd.to_datetime('now', utc=True).tz_convert('Asia/Singapore')],
                  'id': [1]})

df['max_end_time'] = df.groupby('id').end_time.transform(max)

df.info() shows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 3 columns):
end_time        1 non-null datetime64[ns, Asia/Singapore]
id              1 non-null int64
max_end_time    1 non-null datetime64[ns]
dtypes: datetime64[ns, Asia/Singapore](1), datetime64[ns](1), int64(1)
memory usage: 104.0 bytes

df.to_dict() shows

{'end_time': {0: Timestamp('2018-12-10 17:08:52.630644+0800', tz='Asia/Singapore')},
 'id': {0: 1},
 'max_end_time': {0: Timestamp('2018-12-10 09:08:52.630644')}}

Problem description

The timezone is dropped silently and timestamp converted to UTC after groupby - transform operation on tz aware datetime column

Expected Output

assert df['end_time'] == df['max_end_time']

Output of pd.show_versions()

``` INSTALLED VERSIONS ------------------ commit: None python: 3.7.1.final.0 python-bits: 64 OS: Linux OS-release: 4.9.85-38.58.amzn1.x86_64 machine: x86_64 processor: byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.10.0
pip: 18.1
setuptools: 40.5.0
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.13
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

</details>
@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2018

This is working correctly on my branch at #24024, but not on master.

In [1]: paste
import pandas as pd

df = pd.DataFrame({'end_time': [pd.to_datetime('now', utc=True).tz_convert('Asia/Singapore')],
                  'id': [1]})

df['max_end_time'] = df.groupby('id').end_time.transform(max)
## -- End pasted text --

In [2]: df.dtypes
Out[2]:
end_time        datetime64[ns, Asia/Singapore]
id                                       int64
max_end_time    datetime64[ns, Asia/Singapore]
dtype: object

In [3]: df
Out[3]:
                          end_time  id                     max_end_time
0 2018-12-10 23:26:55.386177+08:00   1 2018-12-10 15:26:55.386177+08:00

I'll add this as a test case.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Dec 10, 2018

Ahh... but it isn't correct over there either. The value is wrong.

@WillAyd

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

Yea I'm not terribly surprised by this as a lot of the cythonized groupby operations use the float value of timestamps for comparisons and may never cast back.

Can such ops be generalized when working with tz-aware data? Is it ever ambiguous if crossing time zones to do such a thing?

If it makes sense than wondering if this is something than an EA needs to handle appropriately

@WillAyd

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

Potential duplicate or at least related to #21603

@mroeschke

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

@WillAyd Since it appears that the groupby routines convert to UTC first before running the aggregates, should be as straightforward to ensure that tz-aware columns undergo a tz_localize('UTC').tz_convert(tz) after the cython routine. Shouldn't need to worry about ambiguous/nonexistent times.

@WillAyd

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

Makes sense. Do we want the groupby ops to handle this special casing though or is this better fit in the EA implementation? The former would probably be less work atm and provide better performance, though the latter would probably fit better into our programming paradigm go forward

@mroeschke

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

I don't have a clear picture how generically the groupby ops will interface with the EA in this case. I don't think we'd want to do timezone conversion within EA internals (timezone localization would be okay though) i.e. EA(utc --> tz_local) vs EA(utc).tz_convert(tz_local)

I think the most important piece is that it works generically across all groupby ops and methods.

@WillAyd

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

The more I think about this I'm not sure it can actually be supported generically, at least not for operations that reduce. If we had two timestamps with the same UTC time but in different timezones which timezone would max return?

should be as straightforward to ensure that tz-aware columns undergo a tz_localize('UTC').tz_convert(tz)

This would assume that the input always has the same timezone, no?

@mroeschke

This comment has been minimized.

Copy link
Member

commented Dec 10, 2018

Timestamps with different timezones would exist in an object dtype column, but yeah I am not sure how the max dispatch with object dtype would handle that scenario since they would be equivalent. But for more common ops, the timestamps should have the same timezone.

I was making a naive assumption that .values would be called before passing to the groupby cython routines, and .values would be returning naive timestamps but in UTC. I was also assuming the tz_convert step needs to happen per column since the result can have multiple timezones potentially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.