New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby.apply datetime bug affecting 0.17 #11324

Closed
hadjmic opened this Issue Oct 14, 2015 · 7 comments

Comments

Projects
None yet
3 participants
@hadjmic

hadjmic commented Oct 14, 2015

Exception is raised when
a) the original dataframe has a datetime column
b) the groupby.apply function returns a series object with a new datetime column

Code to reproduce:

import pandas as pd
import datetime

df = pd.DataFrame([['1', datetime.datetime.today()], 
                   ['2', datetime.datetime.today()],
                   ['2', datetime.datetime(2010, 1, 1)]],
                   columns=['record', 'date'])
dd = df.groupby('record').apply(lambda x: pd.Series({'max_date': x['date'].max()}))

This is a new issue affecting 0.17

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 14, 2015

Contributor

canonical way of selection a max column (and way way more efficient)

In [59]: df.groupby('record').date.max()
Out[59]: 
record
1   2015-10-14 07:59:04.094327
2   2015-10-14 07:59:04.094343
Name: date, dtype: datetime64[ns]
Contributor

jreback commented Oct 14, 2015

canonical way of selection a max column (and way way more efficient)

In [59]: df.groupby('record').date.max()
Out[59]: 
record
1   2015-10-14 07:59:04.094327
2   2015-10-14 07:59:04.094343
Name: date, dtype: datetime64[ns]
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 14, 2015

Contributor

I guess this a bug. You are doing a really really odd thing here though.

Contributor

jreback commented Oct 14, 2015

I guess this a bug. You are doing a really really odd thing here though.

@jreback jreback added this to the Next Major Release milestone Oct 14, 2015

@hadjmic

This comment has been minimized.

Show comment
Hide comment
@hadjmic

hadjmic Oct 14, 2015

Imagine it in the following context:
you have a massive apache event log that you import in a pandas dataframe. The dataframe has as columns:
event_id, user_identifier, event_type, timestamp, other stuff

Objective is to create a new dataframe to see what the users have done. Thus, need to group by the user_identifier and somehow aggregate the events of each user. One of the things you need to find is the first and last timestmap the user interacted with the server.

Hope this clarifies things a bit.

Pandas is awesome by the way, you guys rule.

hadjmic commented Oct 14, 2015

Imagine it in the following context:
you have a massive apache event log that you import in a pandas dataframe. The dataframe has as columns:
event_id, user_identifier, event_type, timestamp, other stuff

Objective is to create a new dataframe to see what the users have done. Thus, need to group by the user_identifier and somehow aggregate the events of each user. One of the things you need to find is the first and last timestmap the user interacted with the server.

Hope this clarifies things a bit.

Pandas is awesome by the way, you guys rule.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 14, 2015

Contributor

@hadjmic would

df.groupby(['user_identifier']).timestamp.agg(['min, 'max'])

work for you? You can also control the naming with .timestamp.agg({'max_date': 'max', 'min_date': 'min'}) (I might have the keys and values of that dictionary backward).
That will give the first (min) and last (max) timestamp per user. I guess you said that's just one of the things you need.

Contributor

TomAugspurger commented Oct 14, 2015

@hadjmic would

df.groupby(['user_identifier']).timestamp.agg(['min, 'max'])

work for you? You can also control the naming with .timestamp.agg({'max_date': 'max', 'min_date': 'min'}) (I might have the keys and values of that dictionary backward).
That will give the first (min) and last (max) timestamp per user. I guess you said that's just one of the things you need.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 14, 2015

Contributor

did my example in [59] not clarify? my point is technically using apply like this is ok, but canonically it is quite confusing.

Contributor

jreback commented Oct 14, 2015

did my example in [59] not clarify? my point is technically using apply like this is ok, but canonically it is quite confusing.

@hadjmic

This comment has been minimized.

Show comment
Hide comment
@hadjmic

hadjmic Oct 14, 2015

Perhaps it would have been clearer if I said I have a processUserEvents function. The function takes a dataframe of user events as input (i.e. each group of the groupby operation) and returns back a Series with specific user characteristics. Among those, are the min and max of the timestamp, but there are a lot of other stuff involved, such as values extracted from url paths, query strings, flow paths, etc.

hadjmic commented Oct 14, 2015

Perhaps it would have been clearer if I said I have a processUserEvents function. The function takes a dataframe of user events as input (i.e. each group of the groupby operation) and returns back a Series with specific user characteristics. Among those, are the min and max of the timestamp, but there are a lot of other stuff involved, such as values extracted from url paths, query strings, flow paths, etc.

@jreback jreback modified the milestones: 0.17.1, Next Major Release Oct 28, 2015

robdmc added a commit to robdmc/pandas that referenced this issue Nov 4, 2015

Fixed groupby().apply(func) bug when working with time colums (GH pan…
…das-dev#11324)

Addressed PR comments

Added comments and updated whatsnew

jreback added a commit that referenced this issue Nov 13, 2015

Fixed groupby().apply(func) bug when working with time colums (GH #11324
)

Addressed PR comments

Added comments and updated whatsnew
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 13, 2015

Contributor

closed by #11548

Contributor

jreback commented Nov 13, 2015

closed by #11548

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment