Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: cythonize groupby.count #7016

Merged
merged 5 commits into from May 5, 2014
Merged

ENH: cythonize groupby.count #7016

merged 5 commits into from May 5, 2014

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Apr 30, 2014

closes #7003

  • vbench
  • axis parameter (this only works in non-cython land)
  • tests for object dtype if there are none
  • datetime64/timedelta64 count

vbench results:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_multi_count                          |   7.3980 | 6814.2579 |   0.0011 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2014

@jreback axis parameter has no effect here...i kept for back compat with test_multilevel... i think should prob be implemented (i'm happy to do it)

@cpcloud cpcloud self-assigned this Apr 30, 2014
@cpcloud cpcloud added this to the 0.14.0 milestone Apr 30, 2014
@jreback
Copy link
Contributor

jreback commented Apr 30, 2014

Yeh not sure what it does in a groupby

as an aside why do u still need the lambda expression in count? is it called?

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2014

oh actually u need to pass a npfunc argument to _groupby_function ... fallback

@jreback
Copy link
Contributor

jreback commented Apr 30, 2014

fallback if the dtype does work / match?

FYI I don't think group count will work for datetime64; need to convert to i8 view and then compare against tslib.iNaT for nulls

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2014

fallback is called if any exceptions occur in the cython call

@cpcloud
Copy link
Member Author

cpcloud commented Apr 30, 2014

usually that is a dtype mismatch

@cpcloud cpcloud changed the title ENH: cython groupby.count ENH: cythonize groupby.count May 1, 2014
@cpcloud
Copy link
Member Author

cpcloud commented May 2, 2014

@jreback supporting date-likes here i think requires more than i originally set out to do with this ... about to push object dtype for counts

@jreback
Copy link
Contributor

jreback commented May 5, 2014

I think datelikes DO need to be handled (you can't just take a i8 view when you pass in?) and then do your comparisons instead of

if val != val

if val != tslib.iNaT

for nan testing?

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

Yeah it's there. I was trying to implement ttimedelta arithmetic for groupby but the py26 oddities are in the way. I'll do it in another pr

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

Sorry the numpy 1.6 oddities are there

@jreback
Copy link
Contributor

jreback commented May 5, 2014

ok...no problem (or leave it in python land). is ok too for now; can't leave it hanging though because would break things, ahh yes, the 2.6 oddities, so maybe you can just make it use python and M8/m8 in 2.6 (e.g. have a branch in the code), ugly but what the hey

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

@jreback yep took out the cython attempt ... now just squashing ... making sure that the cython path is hit for dates/timedeltas (but only count); current behavior is to raise ... i think there's an issue somewhere about groupby with deltas

@@ -2156,11 +2238,14 @@ def generate_put_template(template, use_ints = True, use_floats = True):
('int32', 'int32_t', 'float64_t', 'np.float64'),
('int64', 'int64_t', 'float64_t', 'np.float64'),
]
object_list = [('object', 'object', 'float64_t', 'np.float64')]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should float really be here? (as that should be the use_floats)? no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes float should be here use_(float|int)s is really only for the first two arguments which refer to the name and the ctype, but the result of a groupby operation is always a float (even count, which is astyped to int64 as the last method call before returning to the user).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh..right..np

@jreback
Copy link
Contributor

jreback commented May 5, 2014

vbench?

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

vbench coming soon

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

@jreback

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_multi_count_dates                    |  20.4120 | 15484.0871 |   0.0013 |
groupby_multi_count                          |  20.8774 | 15609.8307 |   0.0013 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

758x speedup, think we're doing ok :)

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

ah object dtype is still slow , might be another one of my dumb typos

@jreback
Copy link
Contributor

jreback commented May 5, 2014

can you make the vbench a bit smaller (so the base case takes say 1/10 the time)? (shouldn't really affect the speedup much), which is nice!

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

sure thing

@jreback
Copy link
Contributor

jreback commented May 5, 2014

when you do v0.14.0, put in the perf section FYI (release notes goes in improvements)

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

will do thanks for the heads up

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

@jreback can we get this in soon as green? want to avoid doc merge conflicts.

@jreback
Copy link
Contributor

jreback commented May 5, 2014

sure go ahead on green

cpcloud added a commit that referenced this pull request May 5, 2014
@cpcloud cpcloud merged commit eb3b677 into pandas-dev:master May 5, 2014
@jreback
Copy link
Contributor

jreback commented May 5, 2014

thinking about this

int types can never be nan at all
and only int64 need comparisons against iNaT, so if less than int64 you don't even need a routine you can just call size (eg it's just the size of the group themselves -/ can't have Nan's)

object types need comparison against the actual NaT value (I think)

wonder if any of that even matters

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

Right, really only float, object, and int64 and those are really a view on dates or time deltas which are just ints really. Could have size as the fallback function instead of current version which computes sum of nonnull. Then if I take out the generated for ints it will raise and use the fallback.

@jreback
Copy link
Contributor

jreback commented May 5, 2014

yep
sorry should have mentioned before

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

I can do some microbenchmarks. If size is done with Len then we go from linear in the number of rows to constant so def worth an investigation

@cpcloud
Copy link
Member Author

cpcloud commented May 5, 2014

Pandas isn't on iPhone so this may have to wait until tomorrow :/

@jreback
Copy link
Contributor

jreback commented May 5, 2014

hah

no size is already computed (when the groupby happens) so should be trivial

more worried about the issue of the type coercion in the current cython code vs iNaT which is a bigger value than say int8
though it prob does work

@cpcloud cpcloud deleted the groupby-count-cython branch May 6, 2014 03:45
@cpcloud
Copy link
Member Author

cpcloud commented May 6, 2014

@jreback i have the pr with size do u want me to put it up?

@jreback
Copy link
Contributor

jreback commented May 6, 2014

sure

@jreback
Copy link
Contributor

jreback commented May 6, 2014

can you also post the vbench results at the top of the PR?

@cpcloud
Copy link
Member Author

cpcloud commented May 6, 2014

yep np

@jreback
Copy link
Contributor

jreback commented May 6, 2014

oh...I meant the cythonized count vs before you merged it in! (in this PR)

because when someone clicks on the release notes they go to this issue

@cpcloud
Copy link
Member Author

cpcloud commented May 6, 2014

yep ... that's coming running it now

@jreback
Copy link
Contributor

jreback commented May 8, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: cythonize groupby.count
2 participants