ENH: add to cython groupby functions #4095

Closed
jreback opened this Issue Jul 1, 2013 · 22 comments

Comments

Projects
None yet
6 participants
Contributor

jreback commented Jul 1, 2013

Contributor

jtratner commented Jul 1, 2013

@jreback can I take this? If it's for 0.13 you have some time and I'd be really interested in a chance to dig into more of the Cython internals.

Contributor

jreback commented Jul 1, 2013

go for it! you will need to add a function template in src/generate_code.py and add the appropriate places in groupby.py

Contributor

jtratner commented Jul 1, 2013

@jreback cool - thanks :) I'm looking forward to figuring all that out

Contributor

jtratner commented Jul 9, 2013

@jreback were you thinking this would cover time series shifting too?

Contributor

jreback commented Jul 9, 2013

yes the index type actually doesn't matter though its based on position shiftting (well the time-series stuff happens at a high level and is just translated to positions to move anyhow)

Contributor

jtratner commented Jul 9, 2013

@jreback - if it's already translate to positions to move, then that makes
it much simpler. Thanks!

On Tue, Jul 9, 2013 at 7:34 PM, jreback notifications@github.com wrote:

yes the index type actually doesn't matter though its based on position
shiftting (well the time-series stuff happens at a high level and is just
translated to positions to move anyhow)


Reply to this email directly or view it on GitHubhttps://github.com/pydata/pandas/issues/4095#issuecomment-20712874
.

Contributor

jreback commented Jul 9, 2013

yep...look at pandas/core/frame/shift....

jtratner was assigned Aug 23, 2013

Contributor

jtratner commented Nov 10, 2013

I've let this slide... I will try to circle back to this when I have a chance, but if someone else wants to take it go go go

@jreback jreback modified the milestone: 0.15.0, 0.14.0 Feb 18, 2014

jreback added the Groupby label Feb 18, 2014

jreback changed the title from ENH: add shift to cython groupby functions to ENH: add to cython groupby functions Feb 11, 2015

Contributor

iwschris commented Feb 15, 2015

I've started looking at this one.

Contributor

iwschris commented Feb 20, 2015

@jreback I have a good start here on shift, but I need a little guidance to make sure that I don't reinvent something that's already in place. There's a fair amount of logic in groupby.py that expects the output of a groupby op to be compressed. However shift and cumsum shouldn't compress their result. The output length will be the same as the input length.

Currently I've done this by creating a list of ops that won't compress (currently just shift and cumsum I think) and then everywhere that the shape is expected to get compressed, I've added logic to keep it from doing so. Is that the right approach or has this already been done with some other groupby op that I'm not aware of?

Contributor

jreback commented Feb 20, 2015

these are kind of like (and I think should be implemented like) transform ops

another example is fillna
these return a same sized object as the input

can u show a branch that u have so far?

Contributor

iwschris commented Feb 20, 2015

Alright, that helps. Let me get this cleaned up a little bit and I'll start a PR so that we can look at real code.

@jreback jreback modified the milestone: 0.16.0, Next Major Release Mar 6, 2015

Contributor

chris-b1 commented Aug 21, 2015

@jreback jreback modified the milestone: 0.17.1, Next Major Release Oct 11, 2015

@chris-b1 chris-b1 added a commit to chris-b1/pandas that referenced this issue Nov 15, 2015

@chris-b1 chris-b1 PERF: Cythonize groupby transforms #4095 9d11734
Contributor

jreback commented Nov 16, 2015

closed by #10901

jreback closed this Nov 16, 2015

randomgambit commented Apr 22, 2016 edited

@jreback sorry to revive this but in my dataframe, a simple

df['duration']=df.groupby('prop', sort=False).sale_dt.transform(lambda x: x.shift(-1)-x)

takes just forever. Is this just because the data is big, or this is related to the old problems that are mentioned here? Happy to help if I can!

Contributor

jreback commented Apr 22, 2016

anything with a lambda function will by definition be slow its basically a python loop.

but what you are doing is NOT a transform, which must return a scalar per group.

Contributor

jreback commented Apr 22, 2016

you probably want

df.sale_dt - df.groupby('prop').sale_dt.transform('shift', -1)

randomgambit commented Apr 22, 2016 edited

Thanks Jeff

but what you are doing is NOT a transform, which must return a scalar per group.

Wait but my understanding is that transformis either

  • one scalar per group (I compute the mean of x for every group, I and I want that mean merged back to the original data),
  • OR some variation of the same data (aka some transformation, such as standardizing the values in the dataframe by demeaning and dividing by the standard deviation). Have I missed something?
Contributor

jreback commented Apr 22, 2016

that's what I said
you can use apply if you want
but using a lambda will simply be slow always

got it thanks.

by the way df.sale_dt - df.groupby('prop').sale_dt.transform('shift', -1) is perfect

randomgambit commented Apr 22, 2016 edited

I think I will write a book soon:

the 100 most common errors every Pandas user has to make

bestseller on amazon for sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment