Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby.apply datetime bug affecting 0.17 #11324

Closed
hadjmic opened this issue Oct 14, 2015 · 7 comments
Closed

groupby.apply datetime bug affecting 0.17 #11324

hadjmic opened this issue Oct 14, 2015 · 7 comments
Milestone

Comments

@hadjmic
Copy link

hadjmic commented Oct 14, 2015

Exception is raised when
a) the original dataframe has a datetime column
b) the groupby.apply function returns a series object with a new datetime column

Code to reproduce:

import pandas as pd
import datetime

df = pd.DataFrame([['1', datetime.datetime.today()], 
                   ['2', datetime.datetime.today()],
                   ['2', datetime.datetime(2010, 1, 1)]],
                   columns=['record', 'date'])
dd = df.groupby('record').apply(lambda x: pd.Series({'max_date': x['date'].max()}))

This is a new issue affecting 0.17

@jreback
Copy link
Contributor

jreback commented Oct 14, 2015

canonical way of selection a max column (and way way more efficient)

In [59]: df.groupby('record').date.max()
Out[59]: 
record
1   2015-10-14 07:59:04.094327
2   2015-10-14 07:59:04.094343
Name: date, dtype: datetime64[ns]

@jreback
Copy link
Contributor

jreback commented Oct 14, 2015

I guess this a bug. You are doing a really really odd thing here though.

@jreback jreback added this to the Next Major Release milestone Oct 14, 2015
@hadjmic
Copy link
Author

hadjmic commented Oct 14, 2015

Imagine it in the following context:
you have a massive apache event log that you import in a pandas dataframe. The dataframe has as columns:
event_id, user_identifier, event_type, timestamp, other stuff

Objective is to create a new dataframe to see what the users have done. Thus, need to group by the user_identifier and somehow aggregate the events of each user. One of the things you need to find is the first and last timestmap the user interacted with the server.

Hope this clarifies things a bit.

Pandas is awesome by the way, you guys rule.

@TomAugspurger
Copy link
Contributor

@hadjmic would

df.groupby(['user_identifier']).timestamp.agg(['min, 'max'])

work for you? You can also control the naming with .timestamp.agg({'max_date': 'max', 'min_date': 'min'}) (I might have the keys and values of that dictionary backward).
That will give the first (min) and last (max) timestamp per user. I guess you said that's just one of the things you need.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2015

did my example in [59] not clarify? my point is technically using apply like this is ok, but canonically it is quite confusing.

@hadjmic
Copy link
Author

hadjmic commented Oct 14, 2015

Perhaps it would have been clearer if I said I have a processUserEvents function. The function takes a dataframe of user events as input (i.e. each group of the groupby operation) and returns back a Series with specific user characteristics. Among those, are the min and max of the timestamp, but there are a lot of other stuff involved, such as values extracted from url paths, query strings, flow paths, etc.

@jreback jreback modified the milestones: 0.17.1, Next Major Release Oct 28, 2015
robdmc added a commit to robdmc/pandas that referenced this issue Nov 4, 2015
…das-dev#11324)

Addressed PR comments

Added comments and updated whatsnew
jreback pushed a commit that referenced this issue Nov 13, 2015
)

Addressed PR comments

Added comments and updated whatsnew
@jreback
Copy link
Contributor

jreback commented Nov 13, 2015

closed by #11548

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants