FEAT-#6803: Enable range-partitioning impl for 'groupby.apply()' by default #6804

dchigarev · 2023-12-06T13:13:41Z

What do these changes do?

It's believed that range-partitioning implementation is always better for groupby.apply(), so this PR makes the new implementation to be a default one.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Enable range-partitioning groupby for groupby.apply() automatically #6803
tests are passing
module layout described at docs/development/architecture.rst is up-to-date

…apply()' by default Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2023-12-06T16:28:48Z

modin/core/dataframe/pandas/dataframe/dataframe.py

@@ -3735,14 +3735,6 @@ def groupby(
        skip_on_aligning_flag = "__skip_me_on_aligning__"

        def apply_func(df):  # pragma: no cover
-            if any(


categorical are now always being caught at the query compiler level

dchigarev · 2023-12-06T16:33:50Z

modin/core/storage_formats/pandas/query_compiler.py

+            by_dtypes = self._modin_frame._dtypes.lazy_get(by).get()
+        else:
+            by_dtypes = self.dtypes[by]
+        if any(isinstance(dtype, pandas.CategoricalDtype) for dtype in by_dtypes):


We're now materializing 'by' dtypes to consistently catch unsupported cases and fallback to an older implementation. Previously, if the dtypes weren't materialized we were raising an exception in the kernel which caused groupby to fail. Since we're now moving this implementation out of experimental mode, we want more stability here in terms of falling back to an implementation that has more coverage

anmyachev

LGTM!

YarShev · 2023-12-06T17:48:05Z

modin/core/storage_formats/pandas/query_compiler.py

+        # 'group_wise' means 'groupby.apply()'. We're certain that range-partitioning groupby
+        # always works better for '.apply()', so we're using it regardless of the 'ExperimentalGroupbyImpl'
+        # value
+        if how == "group_wise" or ExperimentalGroupbyImpl.get():


Let's rename ExperimentalGroupbyImpl to RangePartitioningGroupby in a separate PR as we discussed offline.

FEAT-modin-project#6803: Enable range-partitioning impl for 'groupby.…

c10537e

…apply()' by default Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev commented Dec 6, 2023

View reviewed changes

dchigarev marked this pull request as ready for review December 6, 2023 16:34

dchigarev requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev and a team as code owners December 6, 2023 16:34

anmyachev approved these changes Dec 6, 2023

View reviewed changes

YarShev reviewed Dec 6, 2023

View reviewed changes

YarShev approved these changes Dec 6, 2023

View reviewed changes

YarShev merged commit a405217 into modin-project:master Dec 6, 2023
38 checks passed

dchigarev mentioned this pull request Feb 19, 2024

Enable range-partitioning implementation for groupby().rolling() by default #6942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#6803: Enable range-partitioning impl for 'groupby.apply()' by default #6804

FEAT-#6803: Enable range-partitioning impl for 'groupby.apply()' by default #6804

dchigarev commented Dec 6, 2023 •

edited

dchigarev Dec 6, 2023

dchigarev Dec 6, 2023

anmyachev left a comment

YarShev Dec 6, 2023

FEAT-#6803: Enable range-partitioning impl for 'groupby.apply()' by default #6804

FEAT-#6803: Enable range-partitioning impl for 'groupby.apply()' by default #6804

Conversation

dchigarev commented Dec 6, 2023 • edited

What do these changes do?

dchigarev Dec 6, 2023

Choose a reason for hiding this comment

dchigarev Dec 6, 2023

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

YarShev Dec 6, 2023

Choose a reason for hiding this comment

dchigarev commented Dec 6, 2023 •

edited