Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The apply method over a group seems to visit the first group twice #2656

Closed
srippa opened this issue Jan 8, 2013 · 8 comments

Comments

@srippa
Copy link

commented Jan 8, 2013

It seems that the apply method of a group invokes the function passed as parameter once for each group, as expectes. However, the function is invoked twice for the first group. Enclose is a short code reproducing this behavior. Is it a normal behavior or am I doing something wrong here.

I run the code with pandas 0.10.0 on. Windows 7, 32 bit version.

The code reproducing this behavior:

import pandas as pd

tdf = pd.DataFrame( { 'cat' : [1,1,1,2,2,1,1,2], 'B' : range(8)})
category = tdf['cat']
pGroups = tdf.groupby(by=category)

def getLen(df) :
print 'len ',len(df)
return len(df)

plen = pGroups.apply(getLen)

@wesm

This comment has been minimized.

Copy link
Member

commented Jan 8, 2013

Yes. The issue is that the groupby infrastructure can take a very fast code path if the function returns a new object compared with a modified version of the chunk or a view on the chunk. This is new in 0.10-- may make sense to add an option to groupby that disables this behavior, taking the slower code path

@michaelaye

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2013

Hm, Wes, maybe I'm missing a point here, but why did you not mark this as a bug and your comment even sounds like you consider this merely a request for a feature. Do you not consider it wrong to provide three results in this case, even so there are only 2 groups? As I said, I'm afraid I'm missing something here, on my trip to really get a grip on groupby()

@dalejung

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2013

It's not providing a duplicate value. The GroupBy internals will attempt a fast_apply, on failure it uses the old pathway. The data returned from the GroupBy should be exactly the same, provided you aren't doing something funky with globals.

@michaelaye

This comment has been minimized.

Copy link
Contributor

commented Feb 10, 2013

data returned from the GroupBy should be exactly the same

same then what?
This is what I get from above example:

In [1]: import pandas as pd

In [2]: tdf = pd.DataFrame({'cat':[1,1,1,2,2,1,1,2],'B':range(8)})

In [3]: tdf
Out[3]: 
   B  cat
0  0    1
1  1    1
2  2    1
3  3    2
4  4    2
5  5    1
6  6    1
7  7    2

In [4]: category = tdf['cat']

In [5]: pGroups = tdf.groupby(by=category)

In [7]: def getLen(df):
    print 'len',len(df)
    return len(df)
   ...: 

In [8]: plen = pGroups.apply(getLen)
len 5
len 5
len 3

The groupby is by category, which has only 2 different items, 1 and 2. So should the group.apply(getLen) not be called only twice? I'm sorry if I'm dumb somewhere? (Using 0.10.1)

@wesm

This comment has been minimized.

Copy link
Member

commented Feb 11, 2013

The function is called an additional time to determine whether the function has side effects. If it does not have side effects, then a faster code path (substantially faster) can be used. If it does, the old slow code path is used. There should be an option to force the slow path to always be taken in cases like yours.

@michaelaye

This comment has been minimized.

Copy link
Contributor

commented Feb 11, 2013

Oh darn:

In [29]: len(plen)
Out[29]: 2

In [30]: plen
Out[30]: 
cat
1      5
2      3

I was wrongly assuming that the triple execution meant that the user receives 3 objects and was puzzled why you don't consider that wrong. Now that I see that, even so it is being executed 3 times, it is correctly returning only 2 objects, I understand why it's not a bug. Thanks for your patience.

@wesm

This comment has been minimized.

Copy link
Member

commented Apr 7, 2013

Marked this also Someday pending #2936

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 21, 2013

closing as not a bug, #2936 might provide this functionaility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.