New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEAT-#6883: Support grouping on a Series with range-partitioning impl #6888
Conversation
…itioning impl Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@@ -1921,26 +1922,79 @@ def test_to_pandas_convertion(kwargs): | |||
[ | |||
[(False, "a"), (False, "b"), (False, "c")], | |||
[(False, "a"), (False, "b")], | |||
[(True, "a"), (True, "b"), (True, "c")], | |||
[(True, "b"), (True, "a"), (True, "c")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the original order to verify whether it's preserved in the result
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
@@ -227,19 +228,6 @@ def inter_df_math_helper_one_side( | |||
getattr(modin_df_multi_level, op)(modin_df_multi_level, level=1) | |||
|
|||
|
|||
def create_test_series(vals, sort=False, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a couple more create_test_series
functions in tests. Should we get rid of duplication of those methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created a tracker for that #6903
@@ -3472,31 +3472,52 @@ def _groupby_internal_columns(self, by, drop): | |||
|
|||
Returns | |||
------- | |||
by : list of BaseQueryCompiler, column or index label, or Grouper | |||
external_by : list of BaseQueryCompiler and arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can also specify arrays as a by
argument in groupby, they are considered as external groupers and will be returned as a first value
>>> df.groupby(np.array([1, 1, 2, 2])).sum()
same_columns[col] += 1 | ||
col = ( | ||
(*col[:-1], f"{col[-1]}_{suffix}{duplicated_suffix}") | ||
if isinstance(col, tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What might be the lenght of this tuple? Why do we only add a suffix to the last element?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What might be the lenght of this tuple?
Tuple is a value from a MultiIndex, so it can be any length in theory.
Why do we only add a suffix to the last element?
It's enough and simpler to add/remove the suffix only at the last level of MultiIndex, so I did it that way
if isinstance(by, type(self)) and drop: | ||
by = by.columns.tolist() | ||
if groupby_kwargs.get("level") is not None: | ||
raise NotImplementedError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure, these cases (NotImplementedError
) will be executed on pure pandas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for obj in external_by: | ||
if not isinstance(obj, type(self)): | ||
# we're only interested in categorical dtypes here, which can only | ||
# appear in pandas objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
# appear in pandas objects | |
# appear in Modin objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add test for the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we won't be able to hit this line with tests, as all non-modin objects are filtered out here, I added this safeguard so we wouldn't hit the error in the future (when the support for non-modin groupers will be enabled) trying to access .dtypes
field of non-modin objects
Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What do these changes do?
This PR enables support for grouping by modin Series'es using range-partitioning implementation. The way it's implemented is the following:
df.groupby
matters, so during the split we also store original groupers order asby_positions
.flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
git commit -s
modin.pandas.Series
with range-partitioning groupby #6883docs/development/architecture.rst
is up-to-date