Taking first row from each group in groupby sometimes strips tzinfo #10668

Closed
louispotok opened this Issue Jul 24, 2015 · 8 comments

Comments

Projects
None yet
5 participants

xref #12898 (same fix)

(c.f. http://stackoverflow.com/questions/31617084/how-to-have-groupby-first-not-remove-timezone-info-from-datetime-columns)
Take a dataframe with a column of tz-aware datetime.datetime objects, and group it by a different column, then return the first row from each group. There are some ways to do this that leave the datetime as it is; and then at least two ways that convert it to a tz-naive pandas Timestamp object.

In [1]: import pandas as pd

In [2]: import datetime

In [3]: import pytz

In [4]: dates = [datetime.datetime(2015,1,i,tzinfo=pytz.timezone('US/Pacific')) for i in range(1,5)]

In [5]: df = pd.DataFrame({'A': ['a','b']*2,'B': dates})

In [6]: df
Out[6]: 
   A                          B
0  a  2015-01-01 00:00:00-08:00
1  b  2015-01-02 00:00:00-08:00
2  a  2015-01-03 00:00:00-08:00
3  b  2015-01-04 00:00:00-08:00

In [7]: grouped = df.groupby('A') 

In [8]: grouped.nth(0) #B stays a datetime.datetime with timezone info
Out[8]: 
                           B
A                           
a  2015-01-01 00:00:00-08:00
b  2015-01-02 00:00:00-08:00

In [9]: grouped.head(1) #B stays a datetime.datetime with timezone 
Out[9]: 
                           B
0  2015-01-01 00:00:00-08:00
1  2015-01-02 00:00:00-08:00

In [10]: grouped.first() #B is naive pd.TimeStamp in UTC
Out[10]: 
                    B
A                    
a 2015-01-01 08:00:00
b 2015-01-02 08:00:00

And apparently grouped.apply(lambda x: x.iloc[0]) does the same as .first().

And according to this comment the same thing happens if you replace cell [4] with the more pandonic line:

dates = pd.date_range('2015-01-01',periods=4,tz='US/Pacific') 
Contributor

jreback commented Jul 24, 2015

its a bug. I thought we had an issue for this already, but can't seem to find it.

jreback added this to the Next Major Release milestone Jul 24, 2015

Contributor

cfperez commented Oct 26, 2015

I can also confirm bug for grouped.last() and grouped.apply(lambda x: x.iloc[-1]).

But does work correctly for grouped.agg(lambda x: x.iloc[-1]).

Contributor

jreback commented Oct 26, 2015

This is all ok on master, so all this issue needs is probably a few confirming tests.

@cfperez, @louispotok interesested in a pull-request?

In [20]: In [8]: grouped.nth(0)
Out[20]: 
                          B
A                          
a 2014-12-31 23:53:00-08:00
b 2015-01-01 23:53:00-08:00

In [21]: grouped.head(1)
Out[21]: 
                          B
0 2014-12-31 23:53:00-08:00
1 2015-01-01 23:53:00-08:00

In [22]: grouped.first()
Out[22]: 
                          B
A                          
a 2015-01-01 07:53:00-08:00
b 2015-01-02 07:53:00-08:00

In [23]: grouped.apply(lambda x: x.iloc[0]) 
Out[23]: 
A
a   2014-12-31 23:53:00-08:00
b   2015-01-01 23:53:00-08:00
dtype: datetime64[ns, US/Pacific]

In [24]: grouped.first().dtypes
Out[24]: 
B    datetime64[ns, US/Pacific]
dtype: object

@jreback jreback added Testing and removed Bug labels Oct 26, 2015

@jreback jreback modified the milestone: 0.17.1, Next Major Release Oct 26, 2015

Contributor

cfperez commented Oct 26, 2015

@jreback I'm working of the latest commit, and problem now is that the timestamp is wrong (exactly 8 hours off reflecting the timezone difference) even while the timezone is preserved. Note that nth(0) and first() return different times for the same date and timezone.

Also, why don't these two methods return the same indices? In your example, nth(0) and head(1) agree, but first() does not.

I can add tests but still think this is a bug (and unsure how deep the rabbit hole goes.)

Contributor

jreback commented Oct 26, 2015

ahh wasn't paying enough attention
yeh this got localized twice I think

ok will mark it has a bug again then

@jreback jreback modified the milestone: Next Major Release, 0.17.1 Nov 13, 2015

Member

sinhrks commented Apr 6, 2016

Dupe of #12716.

@jreback jreback modified the milestone: 0.18.1, Next Major Release Apr 6, 2016

@jreback jreback modified the milestone: 0.18.1, 0.18.2 Apr 26, 2016

jreback added the Duplicate label Feb 16, 2017

Contributor

jreback commented Feb 16, 2017

better example I think in #15426

jreback closed this Feb 16, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment