ENH: cythonize groupby.count #7016

cpcloud · 2014-04-30T22:12:47Z

vbench
~~axis parameter~~ (this only works in non-cython land)
tests for object dtype if there are none
datetime64/timedelta64 count

vbench results:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_multi_count                          |   7.3980 | 6814.2579 |   0.0011 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

cpcloud · 2014-04-30T22:38:52Z

@jreback axis parameter has no effect here...i kept for back compat with test_multilevel... i think should prob be implemented (i'm happy to do it)

jreback · 2014-04-30T22:40:56Z

Yeh not sure what it does in a groupby

as an aside why do u still need the lambda expression in count? is it called?

cpcloud · 2014-04-30T22:43:44Z

oh actually u need to pass a npfunc argument to _groupby_function ... fallback

jreback · 2014-04-30T22:45:37Z

fallback if the dtype does work / match?

FYI I don't think group count will work for datetime64; need to convert to i8 view and then compare against tslib.iNaT for nulls

cpcloud · 2014-04-30T22:47:52Z

fallback is called if any exceptions occur in the cython call

cpcloud · 2014-04-30T22:48:30Z

usually that is a dtype mismatch

cpcloud · 2014-05-02T23:21:42Z

@jreback supporting date-likes here i think requires more than i originally set out to do with this ... about to push object dtype for counts

jreback · 2014-05-05T00:03:39Z

I think datelikes DO need to be handled (you can't just take a i8 view when you pass in?) and then do your comparisons instead of

if val != val

if val != tslib.iNaT

for nan testing?

cpcloud · 2014-05-05T00:05:49Z

Yeah it's there. I was trying to implement ttimedelta arithmetic for groupby but the py26 oddities are in the way. I'll do it in another pr

cpcloud · 2014-05-05T00:06:12Z

Sorry the numpy 1.6 oddities are there

jreback · 2014-05-05T00:07:48Z

ok...no problem (or leave it in python land). is ok too for now; can't leave it hanging though because would break things, ahh yes, the 2.6 oddities, so maybe you can just make it use python and M8/m8 in 2.6 (e.g. have a branch in the code), ugly but what the hey

cpcloud · 2014-05-05T02:42:46Z

@jreback yep took out the cython attempt ... now just squashing ... making sure that the cython path is hit for dates/timedeltas (but only count); current behavior is to raise ... i think there's an issue somewhere about groupby with deltas

jreback · 2014-05-05T11:54:07Z

pandas/src/generate_code.py

@@ -2156,11 +2238,14 @@ def generate_put_template(template, use_ints = True, use_floats = True):
        ('int32', 'int32_t', 'float64_t', 'np.float64'),
        ('int64', 'int64_t', 'float64_t', 'np.float64'),
        ]
+    object_list = [('object', 'object', 'float64_t', 'np.float64')]


should float really be here? (as that should be the use_floats)? no?

yes float should be here use_(float|int)s is really only for the first two arguments which refer to the name and the ctype, but the result of a groupby operation is always a float (even count, which is astyped to int64 as the last method call before returning to the user).

oh..right..np

jreback · 2014-05-05T11:54:48Z

vbench?

cpcloud · 2014-05-05T13:20:31Z

vbench coming soon

cpcloud · 2014-05-05T14:22:21Z

@jreback

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_multi_count_dates                    |  20.4120 | 15484.0871 |   0.0013 |
groupby_multi_count                          |  20.8774 | 15609.8307 |   0.0013 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

758x speedup, think we're doing ok :)

cpcloud · 2014-05-05T14:24:52Z

ah object dtype is still slow , might be another one of my dumb typos

jreback · 2014-05-05T14:28:23Z

can you make the vbench a bit smaller (so the base case takes say 1/10 the time)? (shouldn't really affect the speedup much), which is nice!

cpcloud · 2014-05-05T14:28:40Z

sure thing

jreback · 2014-05-05T14:32:35Z

when you do v0.14.0, put in the perf section FYI (release notes goes in improvements)

cpcloud · 2014-05-05T14:32:50Z

will do thanks for the heads up

cpcloud · 2014-05-05T20:16:27Z

@jreback can we get this in soon as green? want to avoid doc merge conflicts.

jreback · 2014-05-05T21:07:06Z

sure go ahead on green

ENH: cythonize groupby.count

jreback · 2014-05-05T22:25:15Z

thinking about this

int types can never be nan at all
and only int64 need comparisons against iNaT, so if less than int64 you don't even need a routine you can just call size (eg it's just the size of the group themselves -/ can't have Nan's)

object types need comparison against the actual NaT value (I think)

wonder if any of that even matters

cpcloud · 2014-05-05T22:31:05Z

Right, really only float, object, and int64 and those are really a view on dates or time deltas which are just ints really. Could have size as the fallback function instead of current version which computes sum of nonnull. Then if I take out the generated for ints it will raise and use the fallback.

jreback · 2014-05-05T22:32:46Z

yep
sorry should have mentioned before

cpcloud · 2014-05-05T22:33:56Z

I can do some microbenchmarks. If size is done with Len then we go from linear in the number of rows to constant so def worth an investigation

cpcloud · 2014-05-05T22:34:26Z

Pandas isn't on iPhone so this may have to wait until tomorrow :/

jreback · 2014-05-05T22:37:37Z

hah

no size is already computed (when the groupby happens) so should be trivial

more worried about the issue of the type coercion in the current cython code vs iNaT which is a bigger value than say int8
though it prob does work

cpcloud · 2014-05-06T03:45:52Z

@jreback i have the pr with size do u want me to put it up?

jreback · 2014-05-06T09:21:16Z

sure

jreback · 2014-05-06T12:42:16Z

can you also post the vbench results at the top of the PR?

cpcloud · 2014-05-06T13:39:35Z

yep np

jreback · 2014-05-06T14:06:57Z

oh...I meant the cythonized count vs before you merged it in! (in this PR)

because when someone clicks on the release notes they go to this issue

cpcloud · 2014-05-06T14:07:16Z

yep ... that's coming running it now

jreback · 2014-05-08T15:45:29Z

adding to SO rep! http://stackoverflow.com/questions/23545834/speed-up-pandas-aggregation

cpcloud self-assigned this Apr 30, 2014

cpcloud added this to the 0.14.0 milestone Apr 30, 2014

cpcloud added Enhancement labels Apr 30, 2014

cpcloud changed the title ~~ENH: cython groupby.count~~ ENH: cythonize groupby.count May 1, 2014

jreback reviewed May 5, 2014
View reviewed changes

cpcloud added 4 commits May 5, 2014 16:10

ENH: cythonize groupby count

6c8b56a

VB: add a vbench for groupby count

d39dcf6

BUG: numeric_only must be False for count

5563ea5

CLN: EAFP for count

a83c186

DOC: add release and v0.14.0.txt notes

e82a65a

cpcloud added a commit that referenced this pull request May 5, 2014

Merge pull request #7016 from cpcloud/groupby-count-cython

eb3b677

ENH: cythonize groupby.count

cpcloud merged commit eb3b677 into pandas-dev:master May 5, 2014

cpcloud deleted the groupby-count-cython branch May 6, 2014 03:45

ENH: cythonize groupby.count #7016

ENH: cythonize groupby.count #7016

Conversation

cpcloud commented Apr 30, 2014

cpcloud commented Apr 30, 2014

jreback commented Apr 30, 2014

cpcloud commented Apr 30, 2014

jreback commented Apr 30, 2014

cpcloud commented Apr 30, 2014

cpcloud commented Apr 30, 2014

cpcloud commented May 2, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

jreback May 5, 2014

Choose a reason for hiding this comment

cpcloud May 5, 2014

Choose a reason for hiding this comment

jreback May 5, 2014

Choose a reason for hiding this comment

jreback commented May 5, 2014

cpcloud commented May 5, 2014

cpcloud commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

cpcloud commented May 5, 2014

cpcloud commented May 5, 2014

jreback commented May 5, 2014

cpcloud commented May 6, 2014

jreback commented May 6, 2014

jreback commented May 6, 2014

cpcloud commented May 6, 2014

jreback commented May 6, 2014

cpcloud commented May 6, 2014

jreback commented May 8, 2014