Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy using TimeGrouper does not work #3791

Closed
aschilling opened this issue Jun 7, 2013 · 20 comments
Closed

GroupBy using TimeGrouper does not work #3791

aschilling opened this issue Jun 7, 2013 · 20 comments
Milestone

Comments

@aschilling
Copy link

BUG: TimeGrouper not too friendly with other groups, e.g.

df.set_index('Date').groupby([pd.TimeGrouper('6M'),'Branch']).sum()
should work


Hi everybody,

I found two issues with TimeGrouper:

  1. TimeGrouper does not work at all:

Let's take the following example:

df = pd.DataFrame({
'Branch' : 'A A A A A B'.split(),
'Buyer': 'Carl Mark Carl Joe Joe Carl'.split(),
'Quantity': [1,3,5,8,9,3],
'Date' : [
DT.datetime(2013,1,1,13,0),
DT.datetime(2013,1,1,13,5),
DT.datetime(2013,10,1,20,0),
DT.datetime(2013,10,3,10,0),
DT.datetime(2013,12,2,12,0),
DT.datetime(2013,12,2,14,0),
]})

gr = df.groupby(pd.TimeGrouper(freq='6M'))

def testgr(df):
print df

gr.apply(testgr)

This will raise the Exception: "Exception: All objects passed were None"

  1. With previous Panda's version it was not possible to combine TimeGrouper with another criteria such as "Branch" in my case.

Thank you very much

Andy

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

You need to set_index as TimeGrouper operates on the index

In [15]: df
Out[15]: 
  Branch Buyer                Date  Quantity
0      A  Carl 2013-01-01 13:00:00         1
1      A  Mark 2013-01-01 13:05:00         3
2      A  Carl 2013-10-01 20:00:00         5
3      A   Joe 2013-10-03 10:00:00         8
4      A   Joe 2013-12-02 12:00:00         9
5      B  Carl 2013-12-02 14:00:00         3

In [16]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).sum()
Out[16]: 
            Quantity
2013-01-31         4
2013-07-31       NaN
2014-01-31        25

@aschilling
Copy link
Author

Thanks and is it possible to combine TimeGrouper with another criteria?

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

In [28]: df.set_index('Date').groupby(pd.TimeGrouper('6M'))['Quantity'].apply(lambda x: x.count())
Out[28]: 
2013-01-31    2
2013-07-31    0
2014-01-31    4
dtype: int64

@aschilling
Copy link
Author

Hi jreback, thanks for your reply. Sorry but I do not understand your solution, how can I use it to groupby the TimeGrouper criteria and for example by 'Branch' ?

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

This is actuallly a bit tricky

In [38]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch').sum())
Out[38]: 
                   Quantity
           Branch          
2013-01-31 A              4
2014-01-31 A             22
           B              3

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

@hayd is there a better way to do this you think?

@hayd
Copy link
Contributor

hayd commented Jun 7, 2013

I'm afraid I have no idea!

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

I woudl have thought:

df.set_index('Date').groupby([pd.TimeGrouper('6M'),'Branch']).sum()

should work....

@aschilling
Copy link
Author

I tried that but already but it raises the exception: "TypeError: 'TimeGrouper' object is not callable"

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

did you see my prior comment?

@hayd
Copy link
Contributor

hayd commented Jun 7, 2013

I would have thought that would work too....

I had tried a few variations of your solution... None of which I could get working (hence the other issue I posted) :)

@aschilling
Copy link
Author

@jreback Yes, but that does not work for me either, because I need to apply a self defined function to the formed GroupBy Object. If I already use the simple function above with your solution:

df.groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch').apply(testgr))

It raises: "AttributeError: 'DataFrame' object has no attribute 'name'"

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

This is basically a composition operation, you group by time, then apply a function

which happens to group by branch then operates, so u need to operate on the inner function

This is quite tricky in that your function should return a scalar value on the single passed series

In [44]: def testf(s):
   ....:     return s.sum()
   ....: 

In [46]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch')['Quantity'].apply(testf))
Out[46]: 
            Branch
2013-01-31  A          4
2014-01-31  A         22
            B          3
dtype: int64

@aschilling
Copy link
Author

@jreback Thanks! Unfortunately, it does not solve my problem, as I need to operate on various columns in my function. Is there any possibility to pass the Buyer column to the function?

df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch')[['Buyer', 'Quantity']].apply(testgr)) # doubles the buyer names

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

If you return a custom function then you need to handle the string cases, but you can return pretty much anything you want (make it a Series) to get this kind of functionaility, you function is passed a slice of the original frame

In [55]: def testf(df):
   ....:     if (df['Buyer'] == 'Mark').sum() > 0:
   ....:         return Series(dict(quantity = df['Quantity'].sum(), buyer = 'mark'))
   ....:     return Series(dict(quantity = df['Quantity'].sum()*100, buyer = 'other'))
   ....: 

In [56]: df.set_index('Date').groupby(pd.TimeGrouper('6M')).apply(lambda x: x.groupby('Branch').apply(testf))
Out[56]: 
                   buyer quantity
           Branch                
2013-01-31 A        mark        4
2014-01-31 A       other     2200
           B       other      300

@aschilling
Copy link
Author

Great solution ! Thanks a lot

@jreback
Copy link
Contributor

jreback commented Jun 7, 2013

great...glad it worked out (and I am going to open an issue about a defect in that:

df.set_index('Date').groupby([pd.TimeGrouper('6M'),'Branch']).sum()

should work...

thanks for bringing it up!

@JamesPHoughton
Copy link

It appears other methods that work on normal groups fail when using the TimeGrouper. For instance

 data.groupby('Col1').groups

works fine, but:

data.groupby(pd.TimeGrouper(freq='H')).groups

gives error:

'DataFrameGroupBy' object has no attribute 'groups'

Am I misunderstanding something?

Edit: Looks like this has become an open issue: #3881

@jreback
Copy link
Contributor

jreback commented Jun 19, 2013

see #3881, already noted, thank you

@dz1259423735
Copy link

Maybe you shall try to use the parameter explicitly,like freq=‘5min’,this could be efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants