Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reshuffling groupby doesn't handle grouping on a categorical column correctly #5925

Closed
dchigarev opened this issue Apr 3, 2023 · 3 comments · Fixed by #6862 or #6896
Closed

Reshuffling groupby doesn't handle grouping on a categorical column correctly #5925

dchigarev opened this issue Apr 3, 2023 · 3 comments · Fixed by #6862 or #6896
Assignees
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas.groupby

Comments

@dchigarev
Copy link
Collaborator

dchigarev commented Apr 3, 2023

The reshuffling groupby implementation introduced in #5928 doesn't handle categoricals correctly:

import pandas

from modin.test.storage_formats.pandas.test_internals import construct_modin_df_by_scheme

df = pandas.DataFrame({"a": [1, 1, 1, 1, 2, 2, 2, 2], "b": [3, 4, 5, 6, 7, 8, 9, 10]}).astype({"a": "category"})

md_df = construct_modin_df_by_scheme(df, partitioning_scheme={"row_lengths": [4, 4], "column_widths": [2]})
assert md_df._query_compiler._modin_frame._partitions.shape == (2, 1)

import modin.config as cfg
cfg.ExperimentalGroupbyImpl.put(True)

print("modin:")
print(md_df.groupby("a").sum())
#     b
# a
# 1  18
# 2   0
# 1   0
# 2  34

print("pandas:")
print(df.groupby("a").sum())
#     b
# a
# 1  18
# 2  34

There are currently safeguards (1, 2) that hides the problem and falls back to a non-reshuffling implementation. These safeguards have to be removed in order to reproduce the problem.

@dchigarev dchigarev added bug 🦗 Something isn't working P2 Minor bugs or low-priority feature requests pandas.groupby labels Apr 3, 2023
@YarShev
Copy link
Collaborator

YarShev commented Dec 6, 2023

@dchigarev, should we close this issue?

@dchigarev
Copy link
Collaborator Author

Not yet, it's still relevant. Planing to submit a PR fixing this soon

@dchigarev, should we close this issue?

@dchigarev dchigarev self-assigned this Jan 10, 2024
@dchigarev dchigarev added P1 Important tasks that we should complete soon and removed P2 Minor bugs or low-priority feature requests labels Jan 11, 2024
@dchigarev
Copy link
Collaborator Author

The only challenge with enabling categoricals for range-partitioning implementation, is handling groupby(observed=False, ...) parameter correctly. The observed=False is a default value, so we can't simply fall back to slower implementations in this case, as it won't make the simple calls working:

df.groupby(categorical_by).apply(...) # <-- will engage the slow version
df.groupby(categorical_by, observed=True).apply(...) # <-- will engage the range-partitioning impl, however,
                                                     # I don't think anyone will go that far and change the default value. Users would
                                                     # probably give up trying this impl if it won't work with the initial call

What's observed=False?
This parameter includes missing categories into the result index, the missing values are then filled with the default values for this particular aggregation. Consider this example for more details:

# we have a categorical 'by_col', containing values {1, 2, 3}
>>> df
  by_col  b  c
0      1  3  6
1      2  4  5
2      2  5  4
3      3  6  3
>>> df.dtypes
by_col    category
b            int64
c            int64
# then if we make the following row-slice, the 'by_col' is now containing values {1, 2}
>>> df.iloc[:3]
  by_col  b  c
0      1  3  6
1      2  4  5
2      2  5  4
# however, the categorical dtype of the column, still contains {1, 2, 3}, meaning, that for this particular dataframe
# {3} is now considered a missing categorical value
>>> df.iloc[:3].dtypes["by_col"]
CategoricalDtype(categories=[1, 2, 3], ordered=False, categories_dtype=int64)
# if we then perform a groupby with `observed=False`, we'll see that the missing categorical value
# is actually appears in the result with a default value ('0')
>>> df.iloc[:3].groupby("by_col", observed=False).sum()
        b  c
by_col
1       3  6
2       9  9
3       0  0  <--- result for a missing categorical value
# in case `observed=True` was specified, the result contains only actual dataframe values,
# discarding missing categories
>>> df.iloc[:3].groupby("by_col", observed=True).sum()
        b  c
by_col
1       3  6
2       9  9
              <--- nothing here
# in case of a multi-column groupby, the resulted index will contain a cartesian
# product of (missing_categorical_values X values_of_another_by_column)
>>> df.iloc[:3].groupby(["by_col", "b"], observed=False).sum()
          c
by_col b
1      3  6
       4  0 <--- result for a missing categorical value
       5  0 <--- result for a missing categorical value
2      3  0 <--- result for a missing categorical value
       4  5
       5  4
3      3  0 <--- result for a missing categorical value
       4  0 <--- result for a missing categorical value
       5  0 <--- result for a missing categorical value

As for now, I've only been able to make this work on modin with as_index=True parameter (default one). I think I'll drop support for as_index=False for now, as it takes too much time for me to complete. I'm currently polishing the code and measuring performance of my approach, hope will post more updates soon.

dchigarev added a commit to dchigarev/modin that referenced this issue Jan 17, 2024
…artitioning impl

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
dchigarev added a commit to dchigarev/modin that referenced this issue Jan 29, 2024
…artitioning impl

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
anmyachev pushed a commit that referenced this issue Jan 29, 2024
…mpl (#6862)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
dchigarev added a commit to dchigarev/modin that referenced this issue Jan 30, 2024
…odin-project#6875 bug

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
anmyachev pushed a commit that referenced this issue Jan 30, 2024
…6896)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas.groupby
Projects
None yet
2 participants