Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add numba engine to groupby.transform #32854

Merged
merged 42 commits into from
Apr 16, 2020

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Mar 20, 2020

  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

In the same spirit of #31845, adding engine and engine_kwargs arguments to groupby.transform (which was easier to tackle first than groupby.apply). This signature is the same as what was added to rolling.apply.

Constraints:

  • The user defined function's first two arguments must be def f(values, index, ...), explicitly those names, as we will pass in the the values and the pandas index (as a numpy array) into the udf

@pep8speaks
Copy link

pep8speaks commented Mar 20, 2020

Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-16 04:34:45 UTC

@mroeschke mroeschke changed the title WIP: Add numba engine to groupby.transform ENH: Add numba engine to groupby.transform Mar 27, 2020
@mroeschke mroeschke added this to the 1.1 milestone Mar 27, 2020
@mroeschke mroeschke added the Apply Apply, Aggregate, Transform label Apr 8, 2020
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable

@@ -56,3 +58,44 @@ def impl(data, *_args):
return impl

return numba_func


def split_for_numba(arg: FrameOrSeries):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate return types here?

return arg.to_numpy(), arg.index.to_numpy(), columns_as_array


def validate_udf(func: Callable, include_columns: bool = False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

@jreback
Copy link
Contributor

jreback commented Apr 8, 2020

@mroeschke this is orthogonal to the .agg one? IOW does ordering of merge matter?

@mroeschke
Copy link
Member Author

@jreback this PR has numba utilities that the agg PR #33388 could use, so prefer to merge this one first

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good.

I would introduce a Dispatcher concept here, with a Cython and a Numba Dispatcher.

this way we can move all of the messy logic to that class and just call generically.

asv_bench/benchmarks/groupby.py Show resolved Hide resolved
pandas/core/groupby/generic.py Show resolved Hide resolved
klass = type(self._selected_obj)

results = []
for name, group in self:
object.__setattr__(group, "name", name)
res = func(group, *args, **kwargs)
if engine == "numba":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like to see this as

def _evaluate_udf

from pandas.compat._optional import import_optional_dependency


def check_kwargs_and_nopython(
kwargs: Optional[Dict] = None, nopython: Optional[bool] = None
):
) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string


def split_for_numba(arg: FrameOrSeries) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
"""
Split pandas object into its components as numpy arrays for numba functions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add Parameters / Returns section



def validate_udf(func: Callable, include_columns: bool = False) -> None:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

pandas/core/groupby/generic.py Show resolved Hide resolved
pandas/core/groupby/generic.py Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Apr 8, 2020

for timings you can use this code

In [1]: import pandas as pd 
   ...: import numpy as np 
   ...: import time 
   ...: np.random.seed(0) 
   ...: ngroups = 1000 
   ...: ndays = 100000 
   ...: ncols = 100 
   ...: foo = pd.DataFrame( 
   ...:         index=pd.date_range(start=20000101,periods=ndays,freq="D"), 
   ...:         data = np.random.randn(ndays,ncols) 
   ...:         ) 
   ...: foo["group"] = np.random.choice(ngroups,ndays)                                                                                                               

@mroeschke
Copy link
Member Author

Here the timings with the above benchmark. I'll use a modified version of it for the ASV

In [5]: In [1]: import pandas as pd
   ...:    ...: import numpy as np
   ...:    ...: import time
   ...:    ...: np.random.seed(0)
   ...:    ...: ngroups = 1000
   ...:    ...: ndays = 100000
   ...:    ...: ncols = 100
   ...:    ...: foo = pd.DataFrame(
   ...:    ...:         index=pd.date_range(start=20000101,periods=ndays,freq="D"),
   ...:    ...:         data = np.random.randn(ndays,ncols)
   ...:    ...:         )
   ...:    ...: foo[-1] = np.random.choice(ngroups,ndays)

In [6]:

In [6]: In [4]: def function(values, index, columns):
   ...:    ...:     return values * 5
   ...:
   ...:


In [7]: grouper = foo.groupby(-1)
# warm the cache
In [8]: grouper.transform(function, engine="numba")
Out[8]:
                                      0         1          2          3   ...        96        97         98        99
1970-01-01 00:00:00.020000101   8.820262  2.000786   4.893690  11.204466  ...  0.052500  8.929352   0.634560  2.009947
1970-01-02 00:00:00.020000101   9.415753 -6.738795  -6.352425   4.846984  ...  3.858953  4.117521  10.816180  6.682640
1970-01-03 00:00:00.020000101  -1.845909 -1.196896   5.498298   3.276319  ...  0.488625  2.914768  -1.997245  1.850279
1970-01-04 00:00:00.020000101  -6.532634  8.290653  -0.590820  -3.400891  ...  4.289620  5.705509   7.332894  4.262760
1970-01-05 00:00:00.020000101  -2.993270 -5.579485   3.833316   1.781464  ... -3.292765 -2.571170  -5.090209 -0.389274
...                                  ...       ...        ...        ...  ...       ...       ...        ...       ...
2243-10-12 00:00:00.020000101  -0.235677 -0.319252  10.614723  -1.871743  ...  3.524537  5.565481  -2.199342  3.493679
2243-10-13 00:00:00.020000101  -6.501683 -1.439189  -5.445545  -6.564634  ...  5.365536 -5.383367   1.147402  1.660815
2243-10-14 00:00:00.020000101  -1.894160  0.401290  -0.528430  -2.900666  ... -0.678287 -1.696137   0.421033  2.729988
2243-10-15 00:00:00.020000101   0.002150 -3.285543   1.835571   6.569671  ... -0.486593 -4.820628  -0.368741 -3.181568
2243-10-16 00:00:00.020000101  11.840371  1.316436  -5.017203   3.308539  ...  2.338079 -9.723574  -1.719926 -3.700948

[100000 rows x 100 columns]

In [9]: %timeit grouper.transform(function, engine="numba")
318 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: In [9]: def function(values):
    ...:    ...:     return values * 5
    ...:

In [11]: %timeit grouper.transform(function, engine="cython")
17.8 s ± 178 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@mroeschke
Copy link
Member Author

It's also not immediately obvious how to include a Dispatcher concept here, especially since the "cython" path has a lot of fallback behavior. I can look into introducing a dispatcher concept in a followup

@mroeschke
Copy link
Member Author

mroeschke commented Apr 13, 2020

Also @jreback, is it really useful that the if we are doing a DataFrame object transform that the column name is passed to the udf? I can see it as a potential pitfall as column names are usually strings and numba functions do not support objects?

EDIT: Made a decision to not pass in the dataframe column as an array into the UDF before calling transform

@jreback jreback merged commit b8b6471 into pandas-dev:master Apr 16, 2020
@mroeschke mroeschke deleted the groupby_transform branch April 16, 2020 19:57
CloseChoice pushed a commit to CloseChoice/pandas that referenced this pull request Apr 20, 2020
rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Enhancement Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants