Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.first/last loses timezone information followup #21603

Closed
mroeschke opened this issue Jun 23, 2018 · 12 comments

Comments

@mroeschke
Copy link
Member

commented Jun 23, 2018

xref #21573 (comment)

In [2]: df = pd.DataFrame({'group': [1, 1, 2],
                           'category_string': pd.Series(list('abc')).astype('category'),
                           'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern')})
In [3]: df.groupby('group').first()
Out[3]: 
      category_string          datetimetz                                                
group                                                                                                                                                     
1                   a 2013-01-01 05:00:00                                                                      
2                   c 2013-01-03 05:00:00 

The example above passes data through the first/last compat method which strips timezone information. PR #15885 (now closed) should fix this issue (and offer a performance boost to Categorial data as mentioned in #19026)

@aggarwalvinayak

This comment has been minimized.

Copy link

commented Jun 23, 2018

Hey i would like to take up this issue. Also i am new to open source..
Could you elaborate a bit more on what this issue is about ?

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jun 23, 2018

@mroeschke haven't looked in detail yet but do you know if there's a reason why we even have a compat function here? Under the impression the Cythonized version should work here, no?

@mroeschke

This comment has been minimized.

Copy link
Member Author

commented Jun 23, 2018

@WillAyd Just quickly looking at the code, like a general first method was created and is invoked via apply for DataFrame groups and just directly on Series groups. I'm not too familiar with the groupby routines, but wouldn't bringing this into cython strip the current extension dtypes (i.e Categorical -> Object and Datetimetz -> Datetime)?

def first_compat(x, axis=0):

@aggarwalvinayak the datetimetz column above has lost timezone information when it was in the original DataFrame. The solution would be to impliment the changes in #15885

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jun 23, 2018

It may in it’s current form, but at least in theory I don’t think the first and last operations should have to behave differently based on the type being referenced. There are indexing operations for things like shift that can operate on any type arbitrarily, so seems logical that could apply to accessing the first and last elements as well.

That’s a larger change I can look into and perhaps deserves a separate issue. @aggarealvinayak feel free to continue diagnosing this as prescribed above

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 23, 2018

this needs to be fixed in cython i think

@aggarwalvinayak

This comment has been minimized.

Copy link

commented Jun 23, 2018

@mroeschke

The solution would be to impliment the changes in #15885

Do u mean to revert back the changes that were introduced in #15885
And @WillAyd Which procedure of diagnosing are you mentioning exactly?

@mroeschke

This comment has been minimized.

Copy link
Member Author

commented Jun 23, 2018

Fair point @WillAyd, the indexing should be agnostic to the data types.

Alternatively, I was thinking; why isn't first/last just implemented in terms of the nth method? It looks to handle this issue correctly correctly and to be more performant:

In [4]: df = pd.DataFrame({'group': [1, 1, 2],
   ...:                            'category_string': pd.Series(list('abc')).astype('category'),
   ...:                            'datetimetz': pd.date_range('20130101', periods=3, tz='US/Eastern')})
   ...:

In [5]: df
Out[5]:
   group category_string                datetimetz
0      1               a 2013-01-01 00:00:00-05:00
1      1               b 2013-01-02 00:00:00-05:00
2      2               c 2013-01-03 00:00:00-05:00

In [6]: df.groupby('group').first() #wrong
Out[6]:
      category_string          datetimetz
group
1                   a 2013-01-01 05:00:00
2                   c 2013-01-03 05:00:00

In [7]: df.groupby('group').nth(0) # correct
Out[7]:
      category_string                datetimetz
group
1                   a 2013-01-01 00:00:00-05:00
2                   c 2013-01-03 00:00:00-05:00

In [8]: %timeit df.groupby('group').nth(0)
2.52 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [9]: %timeit df.groupby('group').first()
14.8 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

like how head/tail is just a wrapper around iloc. This may be a win/win unless first/last has a feature that nth doesn't have?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 23, 2018

first/last i think could be in terms of nth (which came later) - nan handling is the same i think (that’s the defining issue and how they r different from head/tail)

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jun 23, 2018

If you take the compat out of the equation first is really just nth passing 1 as n:

last is a separate implementation in Cython. It's a little different than first because with first n will always be 1, but with last n could vary across each group. Perhaps an intelligent consolidation could still occur back in Cython. Here's the implementation of that (nth is in the same module):

def group_last_{{name}}(ndarray[{{dest_type2}}, ndim=2] out,

@mroeschke

This comment has been minimized.

Copy link
Member Author

commented Jun 23, 2018

Not sure if this is the right test, but it looks like nth supports negative indexing so shouldn't last always be -1?

In [3]: df.groupby('group').nth(-1)
Out[3]:
      category_string                datetimetz
group
1                   b 2013-01-02 00:00:00-05:00
2                   c 2013-01-03 00:00:00-05:00

@aggarwalvinayak what you may want to do then is start from @WillAyd comments above and try to investigate if its possible to implement first/last in terms of the nth method.

@WillAyd

This comment has been minimized.

Copy link
Member

commented Jun 24, 2018

@aggarwalvinayak welcome to still work on this but just as a heads up I removed the "good first issue" tag as this is a little more complicated touching on Cython code

@aggarwalvinayak

This comment has been minimized.

Copy link

commented Jun 24, 2018

@WillAyd I am not at all experienced with cython.. Will try to explore about that first.. Because this is my second issue that the good first issue tag was removed because of cython thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.