Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Allow jitting of groupby agg loop #35759

Merged
merged 17 commits into from
Aug 22, 2020

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Aug 17, 2020

  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

New performance comparison for 10,000 groups

In [1]: In [1]: df_g = pd.DataFrame({'a': range(10**4), 'b': range(10**4), 'c': range(10**4)})

In [2]: In [2]: def f(x):
   ...:    ...:     return np.sum(x) + 1
   ...:

In [3]: df_g.groupby('a').agg(f)
Out[3]:
          b      c
a
0         1      1
1         2      2
2         3      3
3         4      4
4         5      5
...     ...    ...
9995   9996   9996
9996   9997   9997
9997   9998   9998
9998   9999   9999
9999  10000  10000

[10000 rows x 2 columns]

In [4]: %timeit df_g.groupby('a').agg(f)
1.2 s ± 70.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: def f(values, index):
   ...:     return np.sum(values) + 1
   ...:

In [6]: df_g.groupby('a').agg(f, engine='numba', engine_kwargs={'parallel': True})
Out[6]:
            b        c
a
0         1.0      1.0
1         2.0      2.0
2         3.0      3.0
3         4.0      4.0
4         5.0      5.0
...       ...      ...
9995   9996.0   9996.0
9996   9997.0   9997.0
9997   9998.0   9998.0
9998   9999.0   9999.0
9999  10000.0  10000.0

In [8]: %timeit df_g.groupby('a').agg(f, engine='numba', engine_kwargs={'parallel': True})
2.07 ms ± 64.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@mroeschke mroeschke added Apply Apply, Aggregate, Transform Groupby Performance Memory or execution speed performance labels Aug 17, 2020
@mroeschke mroeschke added this to the 1.2 milestone Aug 17, 2020
sorted_data, sorted_index, starts, ends, len(group_keys), len(data.columns),
)
if cache_key not in NUMBA_FUNC_CACHE:
NUMBA_FUNC_CACHE[cache_key] = numba_agg_func
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this be moved into the else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you check that the cache is being used property lin a test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to evaluate the function first with all arguments first before putting the function in the cache so we're not caching a function that may fail.

I have existing tests that check for the presence of the function in the cache here:

assert (func_1, "groupby_agg") in NUMBA_FUNC_CACHE

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah would want to move this all to groupby/numba_.py rather than here (you can certainly cache afer, but ideally all of the caching is not exposed here; I think we did this elsewhere IIRC)

@@ -230,6 +227,18 @@ def apply(self, func, *args, **kwargs):
)
def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs):

if maybe_use_numba(engine):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would not object to making a _aggregate_with_python_cython (where you put everything L242 and on down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do this in a follow up refactor PR.

I guess I would need to make a Series and DataFrame version of this function since it looks like both are different.

)
return self.obj._constructor(result, index=index, columns=data.columns)

relabeling, func, columns, order = reconstruct_func(func, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here

self, func, *args, engine="cython", engine_kwargs=None, **kwargs
):
def _aggregate_with_numba(self, data, func, *args, engine_kwargs=None, **kwargs):
group_keys = self.grouper._get_group_keys()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string a and type as much as possible

sorted_labels = algorithms.take_nd(labels, sorted_index, allow_fill=False)
sorted_data = data.take(sorted_index, axis=self.axis).to_numpy()
starts, ends = lib.generate_slices(sorted_labels, n_groups)
cache_key = (func, "groupby_agg")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this consistent with other functions, e.g. transform and rolling and such (the cache keys)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah these keys are all formatted similarly (function, "string of the operation")

sorted_data, sorted_index, starts, ends, len(group_keys), len(data.columns),
)
if cache_key not in NUMBA_FUNC_CACHE:
NUMBA_FUNC_CACHE[cache_key] = numba_agg_func
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah would want to move this all to groupby/numba_.py rather than here (you can certainly cache afer, but ideally all of the caching is not exposed here; I think we did this elsewhere IIRC)

num_groups: int,
num_columns: int,
) -> np.ndarray:
result = np.empty((num_groups, num_columns))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to type this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking float (the default type) would be the safest here?

  • Mixed int & float frame = float
  • Float frame = float
  • Int frame = int

If there's a desire to infer a more appropriate type (int) I could include inference logic


numba = import_optional_dependency("numba")

if parallel:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could make a helper function for this (as we likley need this elsewhere?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only mimicked one other place currently for rolling. I can consolidate when the pattern grows

@@ -129,94 +127,3 @@ def impl(data, *_args):
return impl

return numba_func


def split_for_numba(arg: FrameOrSeries) -> Tuple[np.ndarray, np.ndarray]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these not used in the window functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Only the groupby functions so that's why I moved them to the groupby/numba_.py file

@pep8speaks
Copy link

pep8speaks commented Aug 20, 2020

Hello @mroeschke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-21 23:15:46 UTC

@mroeschke
Copy link
Member Author

@jreback all green

@jreback jreback merged commit 068e654 into pandas-dev:master Aug 22, 2020
@jreback
Copy link
Contributor

jreback commented Aug 22, 2020

thanks @mroeschke

as discussed if u can follow up to consolidate 2 clean

@mroeschke mroeschke deleted the feature/parallelism_groupby_agg branch August 22, 2020 03:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants