Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#6883: Support grouping on a Series with range-partitioning impl #6888

Merged
merged 16 commits into from Feb 1, 2024

Conversation

dchigarev
Copy link
Collaborator

@dchigarev dchigarev commented Jan 26, 2024

What do these changes do?

This PR enables support for grouping by modin Series'es using range-partitioning implementation. The way it's implemented is the following:

  1. Split groupers into two lists: internal groupers (columns of the same frame) and external groupers (groupers that were passed as Series'es). Internal groupers are passed to the implementation as a column names and external groupers are passed as low-level PandasDataframes.
  2. The order in which groupers are passed to df.groupby matters, so during the split we also store original groupers order as by_positions.
  3. Before performing range-partitioning, external groupers are concatenated with the data itself and treated then treated as columns of the same frame.
  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Support grouping on a modin.pandas.Series with range-partitioning groupby #6883
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

…itioning impl

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@@ -1921,26 +1922,79 @@ def test_to_pandas_convertion(kwargs):
[
[(False, "a"), (False, "b"), (False, "c")],
[(False, "a"), (False, "b")],
[(True, "a"), (True, "b"), (True, "c")],
[(True, "b"), (True, "a"), (True, "c")],
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the original order to verify whether it's preserved in the result

@dchigarev dchigarev marked this pull request as ready for review January 31, 2024 18:28
@@ -227,19 +228,6 @@ def inter_df_math_helper_one_side(
getattr(modin_df_multi_level, op)(modin_df_multi_level, level=1)


def create_test_series(vals, sort=False, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a couple more create_test_series functions in tests. Should we get rid of duplication of those methods?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created a tracker for that #6903

modin/pandas/test/test_groupby.py Outdated Show resolved Hide resolved
modin/pandas/test/test_groupby.py Outdated Show resolved Hide resolved
@@ -3472,31 +3472,52 @@ def _groupby_internal_columns(self, by, drop):

Returns
-------
by : list of BaseQueryCompiler, column or index label, or Grouper
external_by : list of BaseQueryCompiler and arrays
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which arrays?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also specify arrays as a by argument in groupby, they are considered as external groupers and will be returned as a first value

>>> df.groupby(np.array([1, 1, 2, 2])).sum()

modin/core/storage_formats/pandas/query_compiler.py Outdated Show resolved Hide resolved
modin/core/storage_formats/pandas/query_compiler.py Outdated Show resolved Hide resolved
modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved
modin/core/dataframe/pandas/dataframe/dataframe.py Outdated Show resolved Hide resolved
same_columns[col] += 1
col = (
(*col[:-1], f"{col[-1]}_{suffix}{duplicated_suffix}")
if isinstance(col, tuple)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What might be the lenght of this tuple? Why do we only add a suffix to the last element?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What might be the lenght of this tuple?

Tuple is a value from a MultiIndex, so it can be any length in theory.

Why do we only add a suffix to the last element?

It's enough and simpler to add/remove the suffix only at the last level of MultiIndex, so I did it that way

modin/pandas/test/test_groupby.py Outdated Show resolved Hide resolved
modin/pandas/test/test_groupby.py Outdated Show resolved Hide resolved
modin/pandas/test/test_groupby.py Outdated Show resolved Hide resolved
if isinstance(by, type(self)) and drop:
by = by.columns.tolist()
if groupby_kwargs.get("level") is not None:
raise NotImplementedError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, these cases (NotImplementedError) will be executed on pure pandas?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, they will fall back to either MapReduce or FullAxis implementation

for obj in external_by:
if not isinstance(obj, type(self)):
# we're only interested in categorical dtypes here, which can only
# appear in pandas objects
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Suggested change
# appear in pandas objects
# appear in Modin objects

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add test for the case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we won't be able to hit this line with tests, as all non-modin objects are filtered out here, I added this safeguard so we wouldn't hit the error in the future (when the support for non-modin groupers will be enabled) trying to access .dtypes field of non-modin objects

modin/core/storage_formats/pandas/query_compiler.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@anmyachev anmyachev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@anmyachev anmyachev merged commit abbbd03 into modin-project:master Feb 1, 2024
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support grouping on a modin.pandas.Series with range-partitioning groupby
3 participants