Reshuffling groupby doesn't handle grouping on a categorical column correctly #5925

dchigarev · 2023-04-03T15:37:13Z

The reshuffling groupby implementation introduced in #5928 doesn't handle categoricals correctly:

import pandas

from modin.test.storage_formats.pandas.test_internals import construct_modin_df_by_scheme

df = pandas.DataFrame({"a": [1, 1, 1, 1, 2, 2, 2, 2], "b": [3, 4, 5, 6, 7, 8, 9, 10]}).astype({"a": "category"})

md_df = construct_modin_df_by_scheme(df, partitioning_scheme={"row_lengths": [4, 4], "column_widths": [2]})
assert md_df._query_compiler._modin_frame._partitions.shape == (2, 1)

import modin.config as cfg
cfg.ExperimentalGroupbyImpl.put(True)

print("modin:")
print(md_df.groupby("a").sum())
#     b
# a
# 1  18
# 2   0
# 1   0
# 2  34

print("pandas:")
print(df.groupby("a").sum())
#     b
# a
# 1  18
# 2  34

There are currently safeguards (1, 2) that hides the problem and falls back to a non-reshuffling implementation. These safeguards have to be removed in order to reproduce the problem.

YarShev · 2023-12-06T17:50:42Z

@dchigarev, should we close this issue?

dchigarev · 2023-12-06T18:13:15Z

Not yet, it's still relevant. Planing to submit a PR fixing this soon

@dchigarev, should we close this issue?

dchigarev · 2024-01-15T14:31:18Z

The only challenge with enabling categoricals for range-partitioning implementation, is handling groupby(observed=False, ...) parameter correctly. The observed=False is a default value, so we can't simply fall back to slower implementations in this case, as it won't make the simple calls working:

df.groupby(categorical_by).apply(...) # <-- will engage the slow version
df.groupby(categorical_by, observed=True).apply(...) # <-- will engage the range-partitioning impl, however,
                                                     # I don't think anyone will go that far and change the default value. Users would
                                                     # probably give up trying this impl if it won't work with the initial call

What's observed=False?
This parameter includes missing categories into the result index, the missing values are then filled with the default values for this particular aggregation. Consider this example for more details:

# we have a categorical 'by_col', containing values {1, 2, 3}
>>> df
  by_col  b  c
0      1  3  6
1      2  4  5
2      2  5  4
3      3  6  3
>>> df.dtypes
by_col    category
b            int64
c            int64
# then if we make the following row-slice, the 'by_col' is now containing values {1, 2}
>>> df.iloc[:3]
  by_col  b  c
0      1  3  6
1      2  4  5
2      2  5  4
# however, the categorical dtype of the column, still contains {1, 2, 3}, meaning, that for this particular dataframe
# {3} is now considered a missing categorical value
>>> df.iloc[:3].dtypes["by_col"]
CategoricalDtype(categories=[1, 2, 3], ordered=False, categories_dtype=int64)
# if we then perform a groupby with `observed=False`, we'll see that the missing categorical value
# is actually appears in the result with a default value ('0')
>>> df.iloc[:3].groupby("by_col", observed=False).sum()
        b  c
by_col
1       3  6
2       9  9
3       0  0  <--- result for a missing categorical value
# in case `observed=True` was specified, the result contains only actual dataframe values,
# discarding missing categories
>>> df.iloc[:3].groupby("by_col", observed=True).sum()
        b  c
by_col
1       3  6
2       9  9
              <--- nothing here
# in case of a multi-column groupby, the resulted index will contain a cartesian
# product of (missing_categorical_values X values_of_another_by_column)
>>> df.iloc[:3].groupby(["by_col", "b"], observed=False).sum()
          c
by_col b
1      3  6
       4  0 <--- result for a missing categorical value
       5  0 <--- result for a missing categorical value
2      3  0 <--- result for a missing categorical value
       4  5
       5  4
3      3  0 <--- result for a missing categorical value
       4  0 <--- result for a missing categorical value
       5  0 <--- result for a missing categorical value

As for now, I've only been able to make this work on modin with as_index=True parameter (default one). I think I'll drop support for as_index=False for now, as it takes too much time for me to complete. I'm currently polishing the code and measuring performance of my approach, hope will post more updates soon.

…artitioning impl Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

…mpl (#6862) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

…odin-project#6875 bug Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

…6896) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev added bug 🦗 Something isn't working P2 Minor bugs or low-priority feature requests pandas.groupby labels Apr 3, 2023

dchigarev self-assigned this Jan 10, 2024

dchigarev added P1 Important tasks that we should complete soon and removed P2 Minor bugs or low-priority feature requests labels Jan 11, 2024

dchigarev added a commit to dchigarev/modin that referenced this issue Jan 17, 2024

FEAT-modin-project#5925: Enable grouping on categoricals with range-p…

fb62d14

…artitioning impl Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev mentioned this issue Jan 17, 2024

FEAT-#5925: Enable grouping on categoricals with range-partitioning impl #6862

Merged

7 tasks

dchigarev added a commit to dchigarev/modin that referenced this issue Jan 29, 2024

FEAT-modin-project#5925: Enable grouping on categoricals with range-p…

4b835fa

…artitioning impl Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

anmyachev closed this as completed in #6862 Jan 29, 2024

anmyachev pushed a commit that referenced this issue Jan 29, 2024

FEAT-#5925: Enable grouping on categoricals with range-partitioning i…

46dc0a5

…mpl (#6862) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev added a commit to dchigarev/modin that referenced this issue Jan 30, 2024

FIX-modin-project#5925: Put a sorting-hack into groupby tests to hide m…

57ffaa7

…odin-project#6875 bug Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

dchigarev mentioned this issue Jan 30, 2024

FIX-#5925: Put a sorting-hack into groupby tests to hide #6875 bug #6896

Merged

7 tasks

anmyachev pushed a commit that referenced this issue Jan 30, 2024

FIX-#5925: Put a sorting-hack into groupby tests to hide #6875 bug (#…

c130e13

…6896) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reshuffling groupby doesn't handle grouping on a categorical column correctly #5925

Reshuffling groupby doesn't handle grouping on a categorical column correctly #5925

dchigarev commented Apr 3, 2023 •

edited

YarShev commented Dec 6, 2023

dchigarev commented Dec 6, 2023

dchigarev commented Jan 15, 2024

Reshuffling groupby doesn't handle grouping on a categorical column correctly #5925

Reshuffling groupby doesn't handle grouping on a categorical column correctly #5925

Comments

dchigarev commented Apr 3, 2023 • edited

YarShev commented Dec 6, 2023

dchigarev commented Dec 6, 2023

dchigarev commented Jan 15, 2024

dchigarev commented Apr 3, 2023 •

edited