Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add to cython groupby functions #4095

Closed
3 tasks done
jreback opened this issue Jul 1, 2013 · 22 comments
Closed
3 tasks done

ENH: add to cython groupby functions #4095

jreback opened this issue Jul 1, 2013 · 22 comments
Labels
Enhancement Groupby Performance Memory or execution speed performance
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jul 1, 2013

@jtratner
Copy link
Contributor

jtratner commented Jul 1, 2013

@jreback can I take this? If it's for 0.13 you have some time and I'd be really interested in a chance to dig into more of the Cython internals.

@jreback
Copy link
Contributor Author

jreback commented Jul 1, 2013

go for it! you will need to add a function template in src/generate_code.py and add the appropriate places in groupby.py

@jtratner
Copy link
Contributor

jtratner commented Jul 1, 2013

@jreback cool - thanks :) I'm looking forward to figuring all that out

@jtratner
Copy link
Contributor

jtratner commented Jul 9, 2013

@jreback were you thinking this would cover time series shifting too?

@jreback
Copy link
Contributor Author

jreback commented Jul 9, 2013

yes the index type actually doesn't matter though its based on position shiftting (well the time-series stuff happens at a high level and is just translated to positions to move anyhow)

@jtratner
Copy link
Contributor

jtratner commented Jul 9, 2013

@jreback - if it's already translate to positions to move, then that makes
it much simpler. Thanks!

On Tue, Jul 9, 2013 at 7:34 PM, jreback notifications@github.com wrote:

yes the index type actually doesn't matter though its based on position
shiftting (well the time-series stuff happens at a high level and is just
translated to positions to move anyhow)


Reply to this email directly or view it on GitHubhttps://github.com//issues/4095#issuecomment-20712874
.

@jreback
Copy link
Contributor Author

jreback commented Jul 9, 2013

yep...look at pandas/core/frame/shift....

@ghost ghost assigned jtratner Aug 23, 2013
@jtratner
Copy link
Contributor

I've let this slide... I will try to circle back to this when I have a chance, but if someone else wants to take it go go go

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014
@jreback jreback changed the title ENH: add shift to cython groupby functions ENH: add to cython groupby functions Feb 11, 2015
@chrisbyboston
Copy link

I've started looking at this one.

@chrisbyboston
Copy link

@jreback I have a good start here on shift, but I need a little guidance to make sure that I don't reinvent something that's already in place. There's a fair amount of logic in groupby.py that expects the output of a groupby op to be compressed. However shift and cumsum shouldn't compress their result. The output length will be the same as the input length.

Currently I've done this by creating a list of ops that won't compress (currently just shift and cumsum I think) and then everywhere that the shape is expected to get compressed, I've added logic to keep it from doing so. Is that the right approach or has this already been done with some other groupby op that I'm not aware of?

@jreback
Copy link
Contributor Author

jreback commented Feb 20, 2015

these are kind of like (and I think should be implemented like) transform ops

another example is fillna
these return a same sized object as the input

can u show a branch that u have so far?

@chrisbyboston
Copy link

Alright, that helps. Let me get this cleaned up a little bit and I'll start a PR so that we can look at real code.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@chris-b1
Copy link
Contributor

cumprod

@jreback jreback modified the milestones: 0.17.1, Next Major Release Oct 11, 2015
chris-b1 added a commit to chris-b1/pandas that referenced this issue Nov 15, 2015
@jreback
Copy link
Contributor Author

jreback commented Nov 16, 2015

closed by #10901

@jreback jreback closed this as completed Nov 16, 2015
@Oleg-Krivosheev
Copy link

@randomgambit
Copy link

randomgambit commented Apr 22, 2016

@jreback sorry to revive this but in my dataframe, a simple

df['duration']=df.groupby('prop', sort=False).sale_dt.transform(lambda x: x.shift(-1)-x)

takes just forever. Is this just because the data is big, or this is related to the old problems that are mentioned here? Happy to help if I can!

@jreback
Copy link
Contributor Author

jreback commented Apr 22, 2016

anything with a lambda function will by definition be slow its basically a python loop.

but what you are doing is NOT a transform, which must return a scalar per group.

@jreback
Copy link
Contributor Author

jreback commented Apr 22, 2016

you probably want

df.sale_dt - df.groupby('prop').sale_dt.transform('shift', -1)

@randomgambit
Copy link

randomgambit commented Apr 22, 2016

Thanks Jeff

but what you are doing is NOT a transform, which must return a scalar per group.

Wait but my understanding is that transformis either

  • one scalar per group (I compute the mean of x for every group, I and I want that mean merged back to the original data),
  • OR some variation of the same data (aka some transformation, such as standardizing the values in the dataframe by demeaning and dividing by the standard deviation). Have I missed something?

@jreback
Copy link
Contributor Author

jreback commented Apr 22, 2016

that's what I said
you can use apply if you want
but using a lambda will simply be slow always

@randomgambit
Copy link

got it thanks.

by the way df.sale_dt - df.groupby('prop').sale_dt.transform('shift', -1) is perfect

@randomgambit
Copy link

randomgambit commented Apr 22, 2016

I think I will write a book soon:

the 100 most common errors every Pandas user has to make

bestseller on amazon for sure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

6 participants