ENH: Add numba engine to groupby.aggregate #33388

mroeschke · 2020-04-08T05:48:14Z

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

In the same spirit of #31845 and #32854, adding engine and engine_kwargs arguments to groupby.aggreate.

This PR has some functionality that is waiting to be merged in #32854

jbrockmendel · 2020-04-08T16:28:54Z

is it worth making a _numba_agg to call instead of adding a clause to python_agg_general? this code is difficult to debug so to the extent things can be isolated thatd be nice

mroeschke · 2020-04-08T17:27:49Z

@jbrockmendel sure I can create a separate _numba_agg function. I just noticed that python_agg_general contains the essentially what I want _numba_agg to do, but definitely nicer if I don't have to traverse all these calls to get to python_agg_general

mroeschke · 2020-04-13T02:42:41Z

So by creating a new _numba_agg function, I would be repeating a lot of the iterating over groups + result boxing functions present in python_agg_general and elsewhere. Is that okay?

pep8speaks · 2020-04-13T03:59:22Z

Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-24 05:09:27 UTC

jbrockmendel · 2020-04-13T14:30:08Z

So by creating a new _numba_agg function, I would be repeating a lot of the iterating over groups + result boxing functions present in python_agg_general and elsewhere. Is that okay?

I trust your judgement.

jbrockmendel · 2020-04-16T19:22:14Z

E TypeError: _percentile_dispatcher() missing 1 required positional argument: 'q'

jbrockmendel · 2020-04-23T16:40:59Z

LGTM

jreback

looks good, just a couple of comments.

also for followups

updating a section in the groupby docs with an example on how to use numba (for rolling as well if we don't have it).
update doc-string? also IIRC this raises on failures rather than trying to fall back like non-numba .agg, let's note that in doc-string & in the groupby docs (as a warning i think).
cleanups as noted

jreback · 2020-04-23T17:01:26Z

pandas/core/groupby/ops.py

+    ):
+
+        if engine == "numba":
+            nopython, nogil, parallel = get_jit_arguments(engine_kwargs)


can you condense these to a single function call (with whatever return args you need later on)
you can certainly leave these functions individually in core.util.numba_, just when you are calling it would make th api simpler here. (also if you can do this simplification other places we call numba).

can do this in a followup.

Sure can follow up with this cleanup

jreback · 2020-04-23T17:02:06Z

pandas/core/util/numba_.py

+    -------
+    bool
+    """
+    return "The first" in err_message or "numba does not" in err_message


can you actually just check the class type here?

I raise ValueError from the numba utilities, so I need to distinguish between ValueErrors from another op vs the numba utilties.

Should I make a NumbaUtilError(ValueError) exception? Then we would also need to expose it to users

yes, let's have a separate exception (but in a followon)

mroeschke · 2020-04-24T05:22:35Z

Here is some example timeit benchmarks

   In [1]: N = 10 ** 3

   In [2]: data = {0: [str(i) for i in range(100)] * N, 1: list(range(100)) * N}

   In [3]: df = pd.DataFrame(data, columns=[0, 1])

   In [4]: def f_numba(values, index):
      ...:     total = 0
      ...:     for i, value in enumerate(values):
      ...:         if i % 2:
      ...:             total += value + 5
      ...:         else:
      ...:             total += value * 2
      ...:     return total
      ...:

   In [5]: def f_cython(values):
      ...:     total = 0
      ...:     for i, value in enumerate(values):
      ...:         if i % 2:
      ...:             total += value + 5
      ...:         else:
      ...:             total += value * 2
      ...:     return total
      ...:

   In [6]: groupby = df.groupby(0)
   # Run the first time, compilation time will affect performance
   In [7]: %timeit -r 1 -n 1 groupby.aggregate(f_numba, engine='numba')  # noqa: E225
   2.14 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
   # Function is cached and performance will improve
   In [8]: %timeit groupby.aggregate(f_numba, engine='numba')
   4.93 ms ± 32.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

   In [9]: %timeit groupby.aggregate(f_cython, engine='cython')
   18.6 ms ± 84.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

mroeschke · 2020-04-24T05:24:25Z

Looks like the the aggregate and transform docstrings are not rendering correctly:

https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.aggregate.html#pandas.core.groupby.GroupBy.aggregate
https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.transform.html#pandas.core.groupby.GroupBy.transform

I can try to address those in a followup (in addition to adding to the agg docstring). But address updating groupby.rst in this PR

jreback · 2020-04-26T00:02:31Z

pandas/core/util/numba_.py

+    -------
+    bool
+    """
+    return "The first" in err_message or "numba does not" in err_message


yes, let's have a separate exception (but in a followon)

jreback · 2020-04-26T00:03:32Z

thanks @mroeschke very nice. a followon request + plus any refactorings as discussed as follows would be great.

Matt Roeschke added 3 commits April 5, 2020 21:59

Add engine keywords to aggregate signature

194bf7f

ENH: Add numba engine to groupby.transform

7d42379

include numba jitted func in agg routine

2124b81

mroeschke added Apply Apply, Aggregate, Transform Enhancement Groupby labels Apr 8, 2020

mroeschke added this to the 1.1 milestone Apr 8, 2020

mroeschke mentioned this pull request Apr 8, 2020

ENH: Add numba engine to groupby.transform #32854

Merged

4 tasks

Matt Roeschke added 2 commits April 12, 2020 18:54

Merge remote-tracking branch 'upstream/master' into groupby_agg_numba

2321693

Add util functions

0f8a692

Add cache and more routines

1d09ce1

Matt Roeschke added 5 commits April 16, 2020 21:21

Merge remote-tracking branch 'upstream/master' into groupby_agg_numba

38e4485

Merge remote-tracking branch 'upstream/master' into groupby_agg_numba

79ee638

minimize whitespace diff

b43f183

Merge remote-tracking branch 'upstream/master' into groupby_agg_numba

632fb0c

fix split by numba call

f30ba2b

mroeschke mentioned this pull request Apr 18, 2020

REF: Make numba function cache globally accessible #33621

Merged

3 tasks

Matt Roeschke added 6 commits April 20, 2020 15:05

Merge remote-tracking branch 'upstream/master' into groupby_agg_numba

a8b7fdd

Merge remote-tracking branch 'upstream/master' into groupby_agg_numba

dadba23

Use global cache correctly

6e4cdd1

Raise for numba specific errors, add tests

7ffe304

Add benchmarks for new engine

9fc1068

Merge remote-tracking branch 'upstream/master' into groupby_agg_numba

599a640

mroeschke changed the title ~~WIP: ENH: Add numba engine to groupby.aggregate~~ ENH: Add numba engine to groupby.aggregate Apr 21, 2020

Matt Roeschke added 4 commits April 21, 2020 09:31

Add whatsnew entry

4d1cbd5

Fix benchmarks and lint

7554190

Reorder function arguments

0729230

Merge remote-tracking branch 'upstream/master' into groupby_agg_numba

4092882

jreback requested changes Apr 23, 2020

View reviewed changes

Matt Roeschke added 4 commits April 23, 2020 18:28

Merge remote-tracking branch 'upstream/master' into groupby_agg_numba

42b5171

Add documentation about groupby functions with numba access

7a9055c

Add warning about no fall back behavior

3004046

Add noqa to timeit

123e53a

jreback approved these changes Apr 26, 2020

View reviewed changes

jreback merged commit 0db2286 into pandas-dev:master Apr 26, 2020

mroeschke deleted the groupby_agg_numba branch April 26, 2020 04:21

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020

ENH: Add numba engine to groupby.aggregate (pandas-dev#33388)

2d84f49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add numba engine to groupby.aggregate #33388

ENH: Add numba engine to groupby.aggregate #33388

mroeschke commented Apr 8, 2020 •

edited

Loading

jbrockmendel commented Apr 8, 2020

mroeschke commented Apr 8, 2020

mroeschke commented Apr 13, 2020

pep8speaks commented Apr 13, 2020 •

edited

Loading

jbrockmendel commented Apr 13, 2020

jbrockmendel commented Apr 16, 2020

jbrockmendel commented Apr 23, 2020

jreback left a comment

jreback Apr 23, 2020

mroeschke Apr 24, 2020

jreback Apr 23, 2020

mroeschke Apr 24, 2020

jreback Apr 26, 2020

mroeschke commented Apr 24, 2020

mroeschke commented Apr 24, 2020

jreback Apr 26, 2020

jreback commented Apr 26, 2020

ENH: Add numba engine to groupby.aggregate #33388

ENH: Add numba engine to groupby.aggregate #33388

Conversation

mroeschke commented Apr 8, 2020 • edited Loading

jbrockmendel commented Apr 8, 2020

mroeschke commented Apr 8, 2020

mroeschke commented Apr 13, 2020

pep8speaks commented Apr 13, 2020 • edited Loading

Comment last updated at 2020-04-24 05:09:27 UTC

jbrockmendel commented Apr 13, 2020

jbrockmendel commented Apr 16, 2020

jbrockmendel commented Apr 23, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback Apr 23, 2020

Choose a reason for hiding this comment

mroeschke Apr 24, 2020

Choose a reason for hiding this comment

jreback Apr 23, 2020

Choose a reason for hiding this comment

mroeschke Apr 24, 2020

Choose a reason for hiding this comment

jreback Apr 26, 2020

Choose a reason for hiding this comment

mroeschke commented Apr 24, 2020

mroeschke commented Apr 24, 2020

jreback Apr 26, 2020

Choose a reason for hiding this comment

jreback commented Apr 26, 2020

mroeschke commented Apr 8, 2020 •

edited

Loading

pep8speaks commented Apr 13, 2020 •

edited

Loading