New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reshuffling groupby doesn't handle grouping on a categorical column correctly #5925
Comments
@dchigarev, should we close this issue? |
Not yet, it's still relevant. Planing to submit a PR fixing this soon
|
The only challenge with enabling categoricals for range-partitioning implementation, is handling df.groupby(categorical_by).apply(...) # <-- will engage the slow version
df.groupby(categorical_by, observed=True).apply(...) # <-- will engage the range-partitioning impl, however,
# I don't think anyone will go that far and change the default value. Users would
# probably give up trying this impl if it won't work with the initial call What's # we have a categorical 'by_col', containing values {1, 2, 3}
>>> df
by_col b c
0 1 3 6
1 2 4 5
2 2 5 4
3 3 6 3
>>> df.dtypes
by_col category
b int64
c int64
# then if we make the following row-slice, the 'by_col' is now containing values {1, 2}
>>> df.iloc[:3]
by_col b c
0 1 3 6
1 2 4 5
2 2 5 4
# however, the categorical dtype of the column, still contains {1, 2, 3}, meaning, that for this particular dataframe
# {3} is now considered a missing categorical value
>>> df.iloc[:3].dtypes["by_col"]
CategoricalDtype(categories=[1, 2, 3], ordered=False, categories_dtype=int64)
# if we then perform a groupby with `observed=False`, we'll see that the missing categorical value
# is actually appears in the result with a default value ('0')
>>> df.iloc[:3].groupby("by_col", observed=False).sum()
b c
by_col
1 3 6
2 9 9
3 0 0 <--- result for a missing categorical value
# in case `observed=True` was specified, the result contains only actual dataframe values,
# discarding missing categories
>>> df.iloc[:3].groupby("by_col", observed=True).sum()
b c
by_col
1 3 6
2 9 9
<--- nothing here
# in case of a multi-column groupby, the resulted index will contain a cartesian
# product of (missing_categorical_values X values_of_another_by_column)
>>> df.iloc[:3].groupby(["by_col", "b"], observed=False).sum()
c
by_col b
1 3 6
4 0 <--- result for a missing categorical value
5 0 <--- result for a missing categorical value
2 3 0 <--- result for a missing categorical value
4 5
5 4
3 3 0 <--- result for a missing categorical value
4 0 <--- result for a missing categorical value
5 0 <--- result for a missing categorical value As for now, I've only been able to make this work on modin with |
…artitioning impl Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
…artitioning impl Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
…mpl (#6862) Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
…odin-project#6875 bug Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
The reshuffling groupby implementation introduced in #5928 doesn't handle categoricals correctly:
There are currently safeguards (1, 2) that hides the problem and falls back to a non-reshuffling implementation. These safeguards have to be removed in order to reproduce the problem.
The text was updated successfully, but these errors were encountered: