Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby - apply applies to the first group twice #7739

Closed
jsw-fnal opened this issue Jul 12, 2014 · 17 comments

Comments

@jsw-fnal
Copy link

commented Jul 12, 2014

This bug is easy to reproduce:

def printA(group):
    print group.A.values[0]
    return 0
pandas.DataFrame({'A': [0,0,1,1], 'B': [0,1,0,1]}).groupby('A').apply(printA)

This should print

0
1

and return a Series containing two zeroes. Although it does return the expected Series, it prints

0
0
1

So, the function printA is being applied to the first group twice. This doesn't seem to affect the eventual returned result of apply, but it duplicates any side effects, and if printA performed some expensive computation, it would increased the time to completion.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2014

@jreback jreback closed this Jul 12, 2014

@cpcloud

This comment has been minimized.

Copy link
Member

commented Jul 12, 2014

See #6753 as well

@jsw-fnal

This comment has been minimized.

Copy link
Author

commented Jul 12, 2014

My apologies, I guess I was looking at the dev version of the docs, which lacks the warning? I'm not really sure how I wound up there, though.

Maybe it is worth documenting what influences the choice to take the "fast path" or the "slow path". That sounds important, from an optimization standpoint. If I have code that could work with the "fast path", I'd hate for it to decide to use the "slow path" just because I coded it one way when I could have coded it another.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2014

you were looking at 0.13.1 docs
0.14.1 is the current

certainly would take a pr for an update

in a nutshell:

  • fastest by far are the built in cythonized functions (eg mean,sum,min, etc)
  • fast path if u don't mutate the passed in data inside the apply this can be done in cython
  • slow path otherwise (also can happen if an exception is raised in the apply)

so it's not 100% obvious what happens just by examing code, that's why a trial happens

@jsw-fnal

This comment has been minimized.

Copy link
Author

commented Jul 13, 2014

Can I provide hints, or even manually select the path I think applies (risking an exception or even crash if I choose wrong)?

I would really like to avoid the extra invocation of my function, because it is expensive. At present, the best I can do seems to be to deliberately raise an exception on the first call (e.g. by asking for the group's name attribute, which doesn't exist during the trial). But, it seems this will force it down the slow path, and I'm not sure this is the fastest in my use-case.

To clarify, I'm performing a relatively expensive computation involving several other columns, and creating a few new columns which I add to the DataFrame. I'm following the method of http://stackoverflow.com/a/12394122/2539647 . I would guess this forces me down the slow path whether or not an exception is raised, even though I am not mutating the existing columns, just adding a new one(s). Is there a better way?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2014

don't use apply instead something like:

concat([ func(c) for c, col in df.iteritems() ], axis=1)

might work

w/I showing what you are doing its just guessing

@jsw-fnal

This comment has been minimized.

Copy link
Author

commented Jul 13, 2014

No, that won't work at all. I will try to be clearer. I'm actually computing the most-likely state sequence from a hidden markov model, where each group is treated completely independently. Each row in the DataFrame corresponds to one step in the model, so each step in the state sequence can be stored as one row (well, the value of a new column in one row) in the same DataFrame. So, I want to group, then hand each sub-frame to the Viterbi algorithm (implemented with numba), then store the resulting state sequence in a way that I can easily compare it (row-wise) with an already-existing column in the DataFrame.

In another case, I need to do things like compute the group-wise mean of one column, and compare that to the value in another column.

In both cases, groupby with apply seems the correct choice. I can iterate over the group members, but it isn't clear then the best way to stitch the results of the individual groups back together into a new column in the DataFrame. Also, I hate to iterate when I don't care about the order, since you might one day implement a magical concurrent-dispatch for groupby().apply() (or whatever I wind up using if something else is more appropriate).

Unless I've misunderstood something, the DataFrame.iteritems() you recommend iterates over columns. That isn't what I need at all. I need to use several (sometimes all) columns in different ways in a single computation, but that computation is group-specific.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2014

code is worth a 1000 words here

@jsw-fnal

This comment has been minimized.

Copy link
Author

commented Jul 13, 2014

OK, this is the one that isn't so expensive, but it doesn't require me to copy my numba implementation of the Viterbi algorithm, which is not really relevant.

df = pd.DataFrame({'groupNum' : [0, ..., 0, 1, ..., 1, 2, ..., 2, ...],
                   'EventTime' : [1331001304, 1331001307, 1331001315, ..., 1580225652, 1580225653, 1580225658, ...],
                   'LiveTime' : [1258, 2380, 737, ..., 139, 4452, 3, ...]})

def Cuts(group):
    EvTimes = group.EventTime - amin(group.EventTime)
    EventTimeHist, EventTimeBins = histogram(EvTimes, bins=arange(0, amax(EvTimes) + 201, 200), weights=ones(len(EvTimes))/200)
    CutBins1 = EventTimeHist > 3
    binnumbers = np.digitize(EvTimes, bins=EventTimeBins)
    group['Cut1'] = CutBins1[binnumbers-1]

    passing1 = EventTimeHist[EventTimeHist < 3]
    CutBins2 = EventTimeHist > (median(passing1) + 8*sqrt(median(passing1)))
    group['Cut2'] = CutBins2[binnumbers - 1]

    bursting = group.Cut1 | group.Cut2
    if sum(bursting) != 0:
        Nburst = sum(bursting)
        Nnoburst = sum(~bursting)
        BurstLive = sum(group.LiveTime[bursting])
        NoBurstLive = sum(group.LiveTime[~bursting])

        group['Cut3'] = (((Nburst / BurstLive) < 3) | ((Nburst / BurstLive) < 2*(Nnoburst/NoBurstLive)))
    else:
        group['Cut3'] = ((len(group) / sum(group.LiveTime)) > 3)
    return group

df_with_cuts = df..groupby('groupNum').apply(Cuts)

Then, I have a DataFrame that has three new columns, Cut1, Cut2, and Cut3.

You can see that I need all the columns at once, and I need to add a column, and I need to handle things on a group-by-group basis. Is this clearer?

I suppose that, strictly speaking, I don't need to add a column. I just need a sequence of homogenously-typed data elements that has an exact item-to-row correspondence with the existing data frame. But, if I had that, then I could trivially add that as a column, so saying "I need to add a column" is just shorthand. I hope that's clear.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2014

Your example is not reproducible, try something like this to generate your data. A copy-paste example is a must for a problem like this.

In [50]: np.random.seed(1234)

In [51]: pd.set_option('max_rows',10)

In [52]: df = DataFrame({'groupNum' :np.repeat(np.arange(3),N/3),'EventTime' : np.sort(np.random.choice(np.arange(1331001304,1580225658),N)),'LiveTime':np.random.randint(0,1000,size=N)})

In [53]: df
Out[53]: 
       EventTime  LiveTime  groupNum
0     1331069944       301         0
1     1331213814       479         0
2     1331389500       404         0
3     1331421395       458         0
4     1331656978       241         0
...          ...       ...       ...
2995  1579725319       822         2
2996  1579843342       685         2
2997  1579956576       357         2
2998  1580018358       335         2
2999  1580051273       217         2

[3000 rows x 3 columns]

Most of the operations can simply be refactored out of the groupby, e.g.

This is EvTimes = group.EventTime - amin(group.EventTime)

g = df.groupby('groupNum')
df['EventTime']-g['EventTime'].transform('min')

you should simply use pd.cut rather than using np.digiitize and np.histogram. I can't look into your example w/o a repro as above.

Try to do as much as possible outside the groupby, and then use the cythonized ops to make this faster

@jsw-fnal

This comment has been minimized.

Copy link
Author

commented Jul 13, 2014

I think we're talking at cross purposes. I'm asking a general question (and making a general suggestion and/or feature request), not asking for help optimizing my code. You seemed to fundamentally misunderstand what I was talking about (the iteritems() example was not remotely applicable) and you requested my specific code. I thought you wanted the code to help clarify the general question, but you dove in and tried to tell me how to optimize it. Not what I was looking for.

Here is some code which illustrates the general problem without providing specifics to get distracted by:

def func(group):
    # this must be applied group-wise for reasons beyond my control
    group['newcol'] = expensive_and_effectful_function(group['col1'], group['col2'], group['col3'])
    return group

#newdf = df.groupby(foo).apply(func) # don't do this because it is expensive and the side-effects clobber something

newstuff = [expensive_and_effectful_function(group['col1'], group['col2'], group['col3']) for name, group in df.groupby(foo)]
# But now what do I do with newstuff to get back a dataframe with the new computed column included?

def make_alternate_func():
    first_run = True
    def alternate_func(group):
        if first_run:
            first_run = False
            raise Exception
        return func(group)
    return alternate_func

newdf = df.groupby(foo).apply(make_alternate_func()) # This one works just fine, but I gather it uses the "slow path" because it raises an exception

By "effectful", I mean something like disk writes, not modifications to the preexisting columns of group. All I want to do is to compute a new column, not modify the old columns. Does that count as "mutating the passed in data"? Is expensive_and_effectful_function doomed to follow the "slow path" no matter what? If that's the case, then I'll just raise an exception on the first run and avoid the duplication that way, even though it's an ugly kludge.

To put it another way, the duplication of the first run is to allow apply to figure something out. If I already know what it is trying to figure out, can I tell it and save it the trouble?

And, lest I come off as too annoyed, demanding, or ungrateful, let me take the opportunity to thank all the pandas developers/contributors. It is really a wonderful tool, and I use it almost daily.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2014

generally I see comments about wanting to do something when in fact the user is not approach in the problem in a pandonic / efficient manner. that's why seeing runnable code is useful. you. an take my suggestion am and use them (or not). as I said you should do things outside of the groupby as much as possible

this may help as well: http://pandas.pydata.org/pandas-docs/stable/groupby.html#iterating-through-groups
it may make sense not to use apply in your case

as far as generally make an option available to chose which path pandas takes is self defeating and too complicated. it's an implementation detail. you can simply look ok the code if u care or use a groupy iteration if u want to completely avoid side effects ,which it seems u do)

no worries otherwise - trying just to make pandas generally useful with a consistent and simple as possible API (which are of course frequently at cross purposes)

@jsw-fnal

This comment has been minimized.

Copy link
Author

commented Jul 13, 2014

It's a very slick, high-level API. But it tries to guess what the user wants, and sometimes this results in very unexpected behavior that is quite difficult to work around without the user including snippets of code that are completely unrelated to what the user wants to do. Personally, I am not a fan of APIs that try to be too smart. I know what I want, and I want to tell the computer to do what I want.

So, feature request: could you please provide a lower-level API that would allow more direct control over the flow of execution? Just a simple slow_apply method would do. It would even allow a bit of refactoring, because you could take the BaseGrouper.apply() method, and turn the lower half of it into a call to slow_apply.

Currently, apply does exactly what I want, except that it calls my function an extra time, which at minimum violates least surprise and at worst destroys data. If I use a groupby iteration, then I have to repeat the code (that already exists in the groupby module) for taking the result and gluing it back together. This violates the DRY principle.

I'm working up a minimal working example of my other use-case for groupby().apply(); perhaps it will be more useful in this discussion, because it really can't be broken up into in- and out-of groupby steps. I'll post a notebook when I'm done.

@jsw-fnal

This comment has been minimized.

Copy link
Author

commented Jul 13, 2014

I see that there is an old enhancement request open to add the slow_apply method I'm requesting (or something similar). So let me lend my voice to those calling for such a thing.

Alternatively, in some cases it might be possible to capture the results of fast_apply from the first group, and not call the user function again in those cases.

Maybe at some point I'll even take a crack at implementing these things myself.

FWIW, I don't buy the argument that these things are an "implementation detail" -- The way in which my code is called is very important to me, and that brings it quite a bit beyond the level of "implementation detail".

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2014

well pull requests are welcome

and it IS an implementation detail
you are providing a function for pandas to call
the fact that it may be called twice (or in an arbitrary order) is up to the caller
not up to the function - hence it's a detail of the implementation

if u abide by the API of the groupby and don't have side effects then their is no issue
pandas is trying to accommodate an arbitrary function here
it could simply raise of I have side effects which I would call an API design
but the fact that it then deals with it in implemtation

if u want to have your function called in a particular way then go ahead and don't use apply

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2014

I think the way to do this is to add a parameter
maybe: try_fastpath=True to the groupby

then the user can change that if wanted

using the already computed group is done in transform iirc
I think it could be done generally but just wasn't implemented that way (in apply)

@jsw-fnal

This comment has been minimized.

Copy link
Author

commented Jul 13, 2014

This seems like a good possibility to me. I would really appreciate it.

It might also help to emit a warning when fastpath is attempted and fails. IPython has a nice facility for displaying a warning only the first time it appears. It is quite nice. I don't know exactly how it works, but it seems to be automatic.

@jreback jreback added this to the 0.25.0 milestone Jan 26, 2019

@jreback jreback added the Groupby label Jan 26, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.