ENH: Add numba engine to groupby.transform #32854

mroeschke · 2020-03-20T04:39:44Z

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

In the same spirit of #31845, adding engine and engine_kwargs arguments to groupby.transform (which was easier to tackle first than groupby.apply). This signature is the same as what was added to rolling.apply.

Constraints:

The user defined function's first two arguments must be def f(values, index, ...), explicitly those names, as we will pass in the the values and the pandas index (as a numpy array) into the udf

pep8speaks · 2020-03-20T04:39:49Z

Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-16 04:34:45 UTC

WillAyd

Seems reasonable

WillAyd · 2020-04-08T16:06:09Z

pandas/core/util/numba_.py

@@ -56,3 +58,44 @@ def impl(data, *_args):
            return impl

    return numba_func
+
+
+def split_for_numba(arg: FrameOrSeries):


Can you annotate return types here?

WillAyd · 2020-04-08T16:06:16Z

pandas/core/util/numba_.py

+    return arg.to_numpy(), arg.index.to_numpy(), columns_as_array
+
+
+def validate_udf(func: Callable, include_columns: bool = False):


Same comment

jreback · 2020-04-08T16:16:28Z

@mroeschke this is orthogonal to the .agg one? IOW does ordering of merge matter?

mroeschke · 2020-04-08T17:36:44Z

@jreback this PR has numba utilities that the agg PR #33388 could use, so prefer to merge this one first

jreback

looks pretty good.

I would introduce a Dispatcher concept here, with a Cython and a Numba Dispatcher.

this way we can move all of the messy logic to that class and just call generically.

asv_bench/benchmarks/groupby.py

pandas/core/groupby/generic.py

jreback · 2020-04-08T20:55:20Z

pandas/core/groupby/generic.py

        klass = type(self._selected_obj)

        results = []
        for name, group in self:
            object.__setattr__(group, "name", name)
-            res = func(group, *args, **kwargs)
+            if engine == "numba":


like to see this as

def _evaluate_udf

jreback · 2020-04-08T20:56:06Z

pandas/core/util/numba_.py

 from pandas.compat._optional import import_optional_dependency


 def check_kwargs_and_nopython(
    kwargs: Optional[Dict] = None, nopython: Optional[bool] = None
-):
+) -> None:


can you add a doc-string

jreback · 2020-04-08T20:56:37Z

pandas/core/util/numba_.py

+
+def split_for_numba(arg: FrameOrSeries) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """
+    Split pandas object into its components as numpy arrays for numba functions.


can you add Parameters / Returns section

jreback · 2020-04-08T20:56:42Z

pandas/core/util/numba_.py

+
+
+def validate_udf(func: Callable, include_columns: bool = False) -> None:
+    """


pandas/core/groupby/generic.py

jreback · 2020-04-08T21:10:19Z

for timings you can use this code

In [1]: import pandas as pd 
   ...: import numpy as np 
   ...: import time 
   ...: np.random.seed(0) 
   ...: ngroups = 1000 
   ...: ndays = 100000 
   ...: ncols = 100 
   ...: foo = pd.DataFrame( 
   ...:         index=pd.date_range(start=20000101,periods=ndays,freq="D"), 
   ...:         data = np.random.randn(ndays,ncols) 
   ...:         ) 
   ...: foo["group"] = np.random.choice(ngroups,ndays)

mroeschke · 2020-04-13T01:26:33Z

Here the timings with the above benchmark. I'll use a modified version of it for the ASV

In [5]: In [1]: import pandas as pd
   ...:    ...: import numpy as np
   ...:    ...: import time
   ...:    ...: np.random.seed(0)
   ...:    ...: ngroups = 1000
   ...:    ...: ndays = 100000
   ...:    ...: ncols = 100
   ...:    ...: foo = pd.DataFrame(
   ...:    ...:         index=pd.date_range(start=20000101,periods=ndays,freq="D"),
   ...:    ...:         data = np.random.randn(ndays,ncols)
   ...:    ...:         )
   ...:    ...: foo[-1] = np.random.choice(ngroups,ndays)

In [6]:

In [6]: In [4]: def function(values, index, columns):
   ...:    ...:     return values * 5
   ...:
   ...:


In [7]: grouper = foo.groupby(-1)
# warm the cache
In [8]: grouper.transform(function, engine="numba")
Out[8]:
                                      0         1          2          3   ...        96        97         98        99
1970-01-01 00:00:00.020000101   8.820262  2.000786   4.893690  11.204466  ...  0.052500  8.929352   0.634560  2.009947
1970-01-02 00:00:00.020000101   9.415753 -6.738795  -6.352425   4.846984  ...  3.858953  4.117521  10.816180  6.682640
1970-01-03 00:00:00.020000101  -1.845909 -1.196896   5.498298   3.276319  ...  0.488625  2.914768  -1.997245  1.850279
1970-01-04 00:00:00.020000101  -6.532634  8.290653  -0.590820  -3.400891  ...  4.289620  5.705509   7.332894  4.262760
1970-01-05 00:00:00.020000101  -2.993270 -5.579485   3.833316   1.781464  ... -3.292765 -2.571170  -5.090209 -0.389274
...                                  ...       ...        ...        ...  ...       ...       ...        ...       ...
2243-10-12 00:00:00.020000101  -0.235677 -0.319252  10.614723  -1.871743  ...  3.524537  5.565481  -2.199342  3.493679
2243-10-13 00:00:00.020000101  -6.501683 -1.439189  -5.445545  -6.564634  ...  5.365536 -5.383367   1.147402  1.660815
2243-10-14 00:00:00.020000101  -1.894160  0.401290  -0.528430  -2.900666  ... -0.678287 -1.696137   0.421033  2.729988
2243-10-15 00:00:00.020000101   0.002150 -3.285543   1.835571   6.569671  ... -0.486593 -4.820628  -0.368741 -3.181568
2243-10-16 00:00:00.020000101  11.840371  1.316436  -5.017203   3.308539  ...  2.338079 -9.723574  -1.719926 -3.700948

[100000 rows x 100 columns]

In [9]: %timeit grouper.transform(function, engine="numba")
318 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: In [9]: def function(values):
    ...:    ...:     return values * 5
    ...:

In [11]: %timeit grouper.transform(function, engine="cython")
17.8 s ± 178 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

mroeschke · 2020-04-13T01:48:52Z

It's also not immediately obvious how to include a Dispatcher concept here, especially since the "cython" path has a lot of fallback behavior. I can look into introducing a dispatcher concept in a followup

mroeschke · 2020-04-13T03:23:04Z

Also @jreback, is it really useful that the if we are doing a DataFrame object transform that the column name is passed to the udf? I can see it as a potential pitfall as column names are usually strings and numba functions do not support objects?

EDIT: Made a decision to not pass in the dataframe column as an array into the UDF before calling transform

Matt Roeschke added 8 commits March 17, 2020 22:02

Add numba engine to groupby.transform for series

db1b3aa

new_func -> func

e82ede7

Merge remote-tracking branch 'upstream/master' into groupby_transform

b0f4faa

Adjust inputs for groupby.transform for series

1316662

Merge remote-tracking branch 'upstream/master' into groupby_transform

8d7343e

Fix typo in func

3e12a51

Add udf validation function

6e6891f

Add numba engine for dataframe objects

73be08d

mroeschke added Enhancement Groupby labels Mar 20, 2020

Matt Roeschke added 13 commits March 19, 2020 21:41

Lint

e85dcd5

Merge remote-tracking branch 'upstream/master' into groupby_transform

0760c44

Add separate folder + file for tests

52fff03

isort

156a2b4

Add tests and reorder parameters

6e9d2bf

Merge remote-tracking branch 'upstream/master' into groupby_transform

2a57a14

Remove usused path variable

e1e5f73

Make tests more explicit

ce3b2b3

Black

195c35f

Merge remote-tracking branch 'upstream/master' into groupby_transform

d8ea389

Add numba cache

a9ece86

Add ASV bench

367dc12

Add whatsnew enhancement entry

8256a0a

mroeschke changed the title ~~WIP: Add numba engine to groupby.transform~~ ENH: Add numba engine to groupby.transform Mar 27, 2020

mroeschke added this to the 1.1 milestone Mar 27, 2020

Matt Roeschke added 4 commits March 27, 2020 11:18

Lint and add typing

2c5543d

Merge remote-tracking branch 'upstream/master' into groupby_transform

80e0ddc

fix benchmark

9c4fa56

Fix benchmarks again

1c71c9b

mroeschke mentioned this pull request Apr 8, 2020

ENH: Add numba engine to groupby.aggregate #33388

Merged

4 tasks

mroeschke added the Apply Apply, Aggregate, Transform label Apr 8, 2020

WillAyd reviewed Apr 8, 2020

View reviewed changes

Add more typing to numba utils

1de2cf1

jreback requested changes Apr 8, 2020

View reviewed changes

pandas/core/groupby/generic.py Show resolved Hide resolved

pandas/core/groupby/generic.py Show resolved Hide resolved

Matt Roeschke added 3 commits April 8, 2020 20:46

Merge remote-tracking branch 'upstream/master' into groupby_transform

d4d58f9

Merge remote-tracking branch 'upstream/master' into groupby_transform

f63e3e7

Add more docstrings

e984283

Have benchmark contain more groups

909e92e

Matt Roeschke added 8 commits April 13, 2020 18:55

Merge remote-tracking branch 'upstream/master' into groupby_transform

e2f2a54

Remove columns as a required argument for udf

cd7a0be

Simplify tests

930466a

Add one more test and commentary

145ca50

Expand docstring

fc0654d

Merge remote-tracking branch 'upstream/master' into groupby_transform

6d5c63d

lint

9dbded0

Merge remote-tracking branch 'upstream/master' into groupby_transform

5909abb

jreback approved these changes Apr 16, 2020

View reviewed changes

jreback merged commit b8b6471 into pandas-dev:master Apr 16, 2020

mroeschke deleted the groupby_transform branch April 16, 2020 19:57

CloseChoice pushed a commit to CloseChoice/pandas that referenced this pull request Apr 20, 2020

ENH: Add numba engine to groupby.transform (pandas-dev#32854)

6f1420e

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020

ENH: Add numba engine to groupby.transform (pandas-dev#32854)

96fd664

axil mentioned this pull request Dec 19, 2021

BUG: groupby.transform calls the user function ~1.5 times more than necessary #44977

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add numba engine to groupby.transform #32854

ENH: Add numba engine to groupby.transform #32854

mroeschke commented Mar 20, 2020 •

edited

Loading

pep8speaks commented Mar 20, 2020 •

edited

Loading

WillAyd left a comment

WillAyd Apr 8, 2020

WillAyd Apr 8, 2020

jreback commented Apr 8, 2020

mroeschke commented Apr 8, 2020

jreback left a comment

jreback Apr 8, 2020

jreback Apr 8, 2020

jreback Apr 8, 2020

jreback Apr 8, 2020

jreback commented Apr 8, 2020 •

edited

Loading

mroeschke commented Apr 13, 2020

mroeschke commented Apr 13, 2020

mroeschke commented Apr 13, 2020 •

edited

Loading

		return arg.to_numpy(), arg.index.to_numpy(), columns_as_array


		def validate_udf(func: Callable, include_columns: bool = False):

ENH: Add numba engine to groupby.transform #32854

ENH: Add numba engine to groupby.transform #32854

Conversation

mroeschke commented Mar 20, 2020 • edited Loading

pep8speaks commented Mar 20, 2020 • edited Loading

Comment last updated at 2020-04-16 04:34:45 UTC

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Apr 8, 2020

Choose a reason for hiding this comment

WillAyd Apr 8, 2020

Choose a reason for hiding this comment

jreback commented Apr 8, 2020

mroeschke commented Apr 8, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback Apr 8, 2020

Choose a reason for hiding this comment

jreback Apr 8, 2020

Choose a reason for hiding this comment

jreback Apr 8, 2020

Choose a reason for hiding this comment

jreback Apr 8, 2020

Choose a reason for hiding this comment

jreback commented Apr 8, 2020 • edited Loading

mroeschke commented Apr 13, 2020

mroeschke commented Apr 13, 2020

mroeschke commented Apr 13, 2020 • edited Loading

mroeschke commented Mar 20, 2020 •

edited

Loading

pep8speaks commented Mar 20, 2020 •

edited

Loading

jreback commented Apr 8, 2020 •

edited

Loading

mroeschke commented Apr 13, 2020 •

edited

Loading