Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: multiple grouping with a TimeGrouper requires sort #6764

Closed
jreback opened this issue Apr 2, 2014 · 2 comments · Fixed by #6908
Closed

BUG: multiple grouping with a TimeGrouper requires sort #6764

jreback opened this issue Apr 2, 2014 · 2 comments · Fixed by #6908
Labels
Bug Datetime Datetime data dtype Groupby
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Apr 2, 2014

resampling has been fixed, so this is only with 2 or more groupers (#6516)

In [9]: 
df = DataFrame({
            'date' : pd.to_datetime([
                '20121002','20121007','20130130','20130202','20130305','20121002',
                '20121207','20130130','20130202','20130305','20130202','20130305']),
            'user_id' : [1,1,1,1,1,3,3,3,5,5,5,5],
            'whole_cost' : [1790,364,280,259,201,623,90,312,359,301,359,801],
            'cost1' : [12,15,10,24,39,1,0,90,45,34,1,12] }).set_index('date')

        expected = df.groupby('user_id')['whole_cost'].resample(
            'M', how='sum').dropna().reorder_levels(['date','user_id']).sortlevel().astype('int64')
        expected.name = 'whole_cost'

In [10]: expected
Out[10]: 
date        user_id
2012-10-31  1          2154
            3           623
2012-12-31  3            90
2013-01-31  1           280
            3           312
2013-02-28  1           259
            5           718
2013-03-31  1           201
            5          1102
Name: whole_cost, dtype: int64

These should be equivalent

In [11]: df.sort_index().groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
Out[11]: 
date        user_id
2012-10-31  1          2154
            3           623
2012-12-31  3            90
2013-01-31  1           280
            3           312
2013-02-28  1           259
            5           718
2013-03-31  1           201
            5          1102
Name: whole_cost, dtype: int64

In [13]: df.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
ValueError: cannot reindex from a duplicate axis
@sinhrks
Copy link
Member

sinhrks commented Apr 18, 2014

Followings are results after #6908. Looks OK.

>>> df.sort_index().groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
date        user_id
2012-10-31  1          2154
            3           623
2012-12-31  3            90
2013-01-31  1           280
            3           312
2013-02-28  1           259
            5           718
2013-03-31  1           201
            5          1102
Name: whole_cost, dtype: int64

>>> df.groupby([pd.TimeGrouper(freq='M'), 'user_id'])['whole_cost'].sum()
date        user_id
2012-10-31  1          2154
            3           623
2012-12-31  3            90
2013-01-31  1           280
            3           312
2013-02-28  1           259
            5           718
2013-03-31  1           201
            5          1102
Name: whole_cost, dtype: int64

@jreback
Copy link
Contributor Author

jreback commented Apr 18, 2014

gr8 I didn't looks thru you tests but pls add these on unless yours fully cover
and ref this issue

@jreback jreback modified the milestones: 0.14.0, 0.15.0 Apr 18, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants