FEAT-#6883: Support grouping on a Series with range-partitioning impl #6888

dchigarev · 2024-01-26T15:22:31Z

What do these changes do?

This PR enables support for grouping by modin Series'es using range-partitioning implementation. The way it's implemented is the following:

Split groupers into two lists: internal groupers (columns of the same frame) and external groupers (groupers that were passed as Series'es). Internal groupers are passed to the implementation as a column names and external groupers are passed as low-level PandasDataframes.
The order in which groupers are passed to df.groupby matters, so during the split we also store original groupers order as by_positions.
Before performing range-partitioning, external groupers are concatenated with the data itself and treated then treated as columns of the same frame.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves Support grouping on a modin.pandas.Series with range-partitioning groupby #6883
tests added and passing
module layout described at docs/development/architecture.rst is up-to-date

…itioning impl Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev · 2024-01-29T12:26:14Z

modin/pandas/test/test_groupby.py

@@ -1921,26 +1922,79 @@ def test_to_pandas_convertion(kwargs):
    [
        [(False, "a"), (False, "b"), (False, "c")],
        [(False, "a"), (False, "b")],
-        [(True, "a"), (True, "b"), (True, "c")],
+        [(True, "b"), (True, "a"), (True, "c")],


changed the original order to verify whether it's preserved in the result

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

YarShev · 2024-02-01T09:29:19Z

modin/pandas/test/test_series.py

@@ -227,19 +228,6 @@ def inter_df_math_helper_one_side(
            getattr(modin_df_multi_level, op)(modin_df_multi_level, level=1)


-def create_test_series(vals, sort=False, **kwargs):


We have a couple more create_test_series functions in tests. Should we get rid of duplication of those methods?

created a tracker for that #6903

modin/pandas/test/test_groupby.py

YarShev · 2024-02-01T09:38:13Z

modin/core/storage_formats/pandas/query_compiler.py

@@ -3472,31 +3472,52 @@ def _groupby_internal_columns(self, by, drop):

        Returns
        -------
-        by : list of BaseQueryCompiler, column or index label, or Grouper
+        external_by : list of BaseQueryCompiler and arrays


Which arrays?

you can also specify arrays as a by argument in groupby, they are considered as external groupers and will be returned as a first value

>>> df.groupby(np.array([1, 1, 2, 2])).sum()

modin/core/storage_formats/pandas/query_compiler.py

modin/core/dataframe/pandas/dataframe/dataframe.py

YarShev · 2024-02-01T11:17:52Z

modin/core/dataframe/pandas/dataframe/dataframe.py

+                    same_columns[col] += 1
+                    col = (
+                        (*col[:-1], f"{col[-1]}_{suffix}{duplicated_suffix}")
+                        if isinstance(col, tuple)


What might be the lenght of this tuple? Why do we only add a suffix to the last element?

What might be the lenght of this tuple?

Tuple is a value from a MultiIndex, so it can be any length in theory.

Why do we only add a suffix to the last element?

It's enough and simpler to add/remove the suffix only at the last level of MultiIndex, so I did it that way

modin/pandas/test/test_groupby.py

modin/core/storage_formats/pandas/query_compiler.py

anmyachev · 2024-02-01T11:23:39Z

modin/core/storage_formats/pandas/query_compiler.py

-        if isinstance(by, type(self)) and drop:
-            by = by.columns.tolist()
+        if groupby_kwargs.get("level") is not None:
+            raise NotImplementedError(


Just to be sure, these cases (NotImplementedError) will be executed on pure pandas?

no, they will fall back to either MapReduce or FullAxis implementation

anmyachev · 2024-02-01T11:25:53Z

modin/core/storage_formats/pandas/query_compiler.py

+                for obj in external_by:
+                    if not isinstance(obj, type(self)):
+                        # we're only interested in categorical dtypes here, which can only
+                        # appear in pandas objects


?

Suggested change

# appear in pandas objects

# appear in Modin objects

Could you also add test for the case?

we won't be able to hit this line with tests, as all non-modin objects are filtered out here, I added this safeguard so we wouldn't hit the error in the future (when the support for non-modin groupers will be enabled) trying to access .dtypes field of non-modin objects

modin/core/storage_formats/pandas/query_compiler.py

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

anmyachev

LGTM!

dchigarev added 5 commits January 25, 2024 19:05

FEAT-modin-project#6883: Support grouping on a Series with range-part…

a13bfee

…itioning impl Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

all works

187f2cb

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

fix style

55335ff

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

fix formatting

73e2cf1

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Merge remote-tracking branch 'origin/master' into issue_5926

16fcff1

dchigarev commented Jan 29, 2024

View reviewed changes

dchigarev added 9 commits January 29, 2024 13:31

fix typing

994ccb5

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Merge remote-tracking branch 'origin/master' into issue_5926

95157a7

Merge remote-tracking branch 'origin/master' into issue_5926

5cb2187

make external groupers work with categoricals

7abfe61

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

s

6634bf6

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Merge remote-tracking branch 'origin/master' into issue_5926

eebe639

Merge remote-tracking branch 'origin/master' into issue_5926

d2dbf55

Merge remote-tracking branch 'origin/master' into issue_5926

3926adb

fix handling of external categorical columns

4d3e418

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev marked this pull request as ready for review January 31, 2024 18:28

dchigarev requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev and a team as code owners January 31, 2024 18:28

YarShev reviewed Feb 1, 2024

View reviewed changes

anmyachev reviewed Feb 1, 2024

View reviewed changes

dchigarev added 2 commits February 1, 2024 12:57

Merge remote-tracking branch 'origin/master' into issue_5926

423f886

apply suggestions

b5b68e0

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev requested review from YarShev and anmyachev February 1, 2024 12:59

YarShev approved these changes Feb 1, 2024

View reviewed changes

anmyachev approved these changes Feb 1, 2024

View reviewed changes

anmyachev merged commit abbbd03 into modin-project:master Feb 1, 2024
36 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT-#6883: Support grouping on a Series with range-partitioning impl #6888

FEAT-#6883: Support grouping on a Series with range-partitioning impl #6888

dchigarev commented Jan 26, 2024 •

edited

dchigarev Jan 29, 2024

YarShev Feb 1, 2024

dchigarev Feb 1, 2024

YarShev Feb 1, 2024

dchigarev Feb 1, 2024

YarShev Feb 1, 2024

dchigarev Feb 1, 2024

anmyachev Feb 1, 2024

dchigarev Feb 1, 2024

anmyachev Feb 1, 2024

anmyachev Feb 1, 2024

dchigarev Feb 1, 2024

anmyachev left a comment

		@@ -227,19 +228,6 @@ def inter_df_math_helper_one_side(
		getattr(modin_df_multi_level, op)(modin_df_multi_level, level=1)


		def create_test_series(vals, sort=False, **kwargs):

FEAT-#6883: Support grouping on a Series with range-partitioning impl #6888

FEAT-#6883: Support grouping on a Series with range-partitioning impl #6888

Conversation

dchigarev commented Jan 26, 2024 • edited

What do these changes do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmyachev left a comment

Choose a reason for hiding this comment

dchigarev commented Jan 26, 2024 •

edited