BUG: timezone lost in groupby-agg with cython functions #15426

Closed
munierSalem opened this Issue Feb 16, 2017 · 10 comments

Comments

Projects
None yet
3 participants

munierSalem commented Feb 16, 2017 edited by jreback

xref #10668 (for more examples)

Hello!

I'm running into some odd behavior trying to group rows of a pandas dataframe by ID and then selecting out max/min datetimes (w/ timezones). This is with python 2.7, pandas 0.18.1 and numpy 1.11.1 (I saw in earlier posts a similar problem was apparently fixed w/ pandas 0.15).

Specifically, if I try:
print orders.groupby('OrderID')['start_time'].agg(np.min).iloc[:5]

I get:

OrderID
O161101XVS100000044   2016-11-01 12:03:12.920000-04:00
O161101XVS100000047   2016-11-01 12:03:36.693000-04:00
O161101XVS100000098   2016-11-01 12:09:08.330000-04:00
O161101XVS100000122   2016-11-01 12:09:59.950000-04:00
O161101XVS100000152   2016-11-01 12:11:29.790000-04:00
Name: start_time, dtype: datetime64[ns, US/Eastern]

Where the raw data had times closer to 8 am (US/Eastern). In other words, it reverted back to UTC times, even though it says it's eastern times, and has UTC-4 offset.

But if I instead try:
print orders.groupby('OrderID')['start_time'].agg(lambda x: np.min(x)).iloc[:5]

I now get:

OrderID
O161101XVS100000044   2016-11-01 08:03:12.920000-04:00
O161101XVS100000047   2016-11-01 08:03:36.693000-04:00
O161101XVS100000098   2016-11-01 08:09:08.330000-04:00
O161101XVS100000122   2016-11-01 08:09:59.950000-04:00
O161101XVS100000152   2016-11-01 08:11:29.790000-04:00
Name: start_time, dtype: datetime64[ns, US/Eastern]

Which is the behavior I intended. This second method is vastly slower, and I would have assumed the two approaches would yield identical results ...

Contributor

TomAugspurger commented Feb 16, 2017

Could you try it out on a more recent version of pandas, or add a copy-pastable example so that someone else can check? Might have been fixed already.

Contributor

jreback commented Feb 16, 2017

further np.min almost always does the wrong thing, it doesn't respect timezones.

use .min()

Contributor

TomAugspurger commented Feb 16, 2017 edited

Actually, here's a repro:

In [63]: ts = pd.Series(pd.date_range('2016', periods=12, freq='H').tz_localize("UTC").tz_convert("US/Eastern"))

In [64]: ts
Out[64]:
0    2015-12-31 19:00:00-05:00
1    2015-12-31 20:00:00-05:00
2    2015-12-31 21:00:00-05:00
3    2015-12-31 22:00:00-05:00
4    2015-12-31 23:00:00-05:00
                ...
7    2016-01-01 02:00:00-05:00
8    2016-01-01 03:00:00-05:00
9    2016-01-01 04:00:00-05:00
10   2016-01-01 05:00:00-05:00
11   2016-01-01 06:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [65]: ts.groupby(level=0).agg(np.min)
Out[65]:
0    2016-01-01 00:00:00-05:00
1    2016-01-01 01:00:00-05:00
2    2016-01-01 02:00:00-05:00
3    2016-01-01 03:00:00-05:00
4    2016-01-01 04:00:00-05:00
                ...
7    2016-01-01 07:00:00-05:00
8    2016-01-01 08:00:00-05:00
9    2016-01-01 09:00:00-05:00
10   2016-01-01 10:00:00-05:00
11   2016-01-01 11:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [66]: ts.groupby(level=0).min()
Out[66]:
0    2016-01-01 00:00:00-05:00
1    2016-01-01 01:00:00-05:00
2    2016-01-01 02:00:00-05:00
3    2016-01-01 03:00:00-05:00
4    2016-01-01 04:00:00-05:00
                ...
7    2016-01-01 07:00:00-05:00
8    2016-01-01 08:00:00-05:00
9    2016-01-01 09:00:00-05:00
10   2016-01-01 10:00:00-05:00
11   2016-01-01 11:00:00-05:00
dtype: datetime64[ns, US/Eastern]

further np.min almost always does the wrong thing, it doesn't respect timezones.

my thought too, but .min also shows the issue.

Contributor

TomAugspurger commented Feb 16, 2017

I think the expected output there is identical to the input (since the index is already unique).

Contributor

jreback commented Feb 16, 2017

In [19]: data = """O161101XVS100000044   2016-11-01 12:03:12.920000-04:00
    ...: O161101XVS100000047   2016-11-01 12:03:36.693000-04:00
    ...: O161101XVS100000098   2016-11-01 12:09:08.330000-04:00
    ...: O161101XVS100000122   2016-11-01 12:09:59.950000-04:00
    ...: O161101XVS100000152   2016-11-01 12:11:29.790000-04:00
    ...: """

In [20]: df = pd.read_csv(StringIO(data),header=None,sep='\s+')

In [21]: df.columns=['value','date','time']

In [22]: df['datetime'] = pd.to_datetime(df.date + ' ' + df.time).dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

In [23]: df
Out[23]: 
                 value        date                   time                         datetime
0  O161101XVS100000044  2016-11-01  12:03:12.920000-04:00 2016-11-01 12:03:12.920000-04:00
1  O161101XVS100000047  2016-11-01  12:03:36.693000-04:00 2016-11-01 12:03:36.693000-04:00
2  O161101XVS100000098  2016-11-01  12:09:08.330000-04:00 2016-11-01 12:09:08.330000-04:00
3  O161101XVS100000122  2016-11-01  12:09:59.950000-04:00 2016-11-01 12:09:59.950000-04:00
4  O161101XVS100000152  2016-11-01  12:11:29.790000-04:00 2016-11-01 12:11:29.790000-04:00

In [24]: df.dtypes
Out[24]: 
value                           object
date                            object
time                            object
datetime    datetime64[ns, US/Eastern]
dtype: object

In [25]: df.groupby('value').datetime.min()
Out[25]: 
value
O161101XVS100000044   2016-11-01 16:03:12.920000-04:00
O161101XVS100000047   2016-11-01 16:03:36.693000-04:00
O161101XVS100000098   2016-11-01 16:09:08.330000-04:00
O161101XVS100000122   2016-11-01 16:09:59.950000-04:00
O161101XVS100000152   2016-11-01 16:11:29.790000-04:00
Name: datetime, dtype: datetime64[ns, US/Eastern]
Contributor

jreback commented Feb 16, 2017

dupe of this: #10668

though I like this example.

jreback closed this Feb 16, 2017

jreback added the Duplicate label Feb 16, 2017

jreback added this to the No action milestone Feb 16, 2017

Contributor

jreback commented Feb 16, 2017

actually, let's leave this one open instead.

jreback reopened this Feb 16, 2017

jreback removed the Duplicate label Feb 16, 2017

@jreback jreback modified the milestone: 0.20.0, No action Feb 16, 2017

jreback changed the title from Odd timezone behavior using groupby/agg in pandas to BUG: timezone lost in groupby-agg with cython functions Feb 16, 2017

Contributor

jreback commented Feb 16, 2017 edited

@munierSalem if you'd like to debug would be great!

The groupby tz support is a bit buggy. Basically since these are converted to i8 undert the hood to actually do the operations, need to:

  • if a datetime64tz (by-definition this will be a single tz, if its multiple this would be an object column)
    • convert to UTC, keep track of tz
    • convert to i8 (already there)
    • perform operation, comes back as i8
    • localize to UTC, convert to original tz

roughtly here:
https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby.py#L1896

@jreback I can fix in my local repo, but I'll need to wait to do so from home to push back ... working behind a draconian corporate firewall :(

Contributor

jreback commented Feb 16, 2017

sure np

jreback closed this in 6c17f67 Feb 27, 2017

@AnkurDedania AnkurDedania added a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

@stephenrauch @AnkurDedania stephenrauch + AnkurDedania BUG: GH15426 timezone lost in groupby-agg with cython functions
closes #15426

Author: Stephen Rauch <stephen.rauch+github@gmail.com>

Closes #15433 from stephenrauch/tz-lost-in-groupby-agg and squashes the following commits:

64a84ca [Stephen Rauch] BUG: GH15426 timezone lost in groupby-agg with cython functions
8d90d6c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment