Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby(sort=True) may produce unsorted results with range-partitioning implementation #6875

Open
3 tasks done
dchigarev opened this issue Jan 22, 2024 · 0 comments
Open
3 tasks done
Assignees
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas.groupby partitions reshuffling 🔀 Issues related to partitions reshuffling

Comments

@dchigarev
Copy link
Collaborator

dchigarev commented Jan 22, 2024

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import numpy as np
import modin.pandas as pd
import pandas

from modin.pandas.test.utils import df_equals

import modin.config as cfg
cfg.IsDebug.put(True)
cfg.NPartitions.put(4)
cfg.RangePartitioningGroupby.put(True)

np.random.seed(214)

data = {
    "a": ["a", "b", "c", "d", "e", "b", "g", "a"] * 32,
    "b": [1, 2, 3, 4] * 64,
    "c": range(256),
    "d": range(256),
    "e": ["x", "y"] * 128,
}

filter = lambda row: (~row["a"].isin(["a", "e"]) & ~row["b"].isin([4]))

md_df, pd_df = pd.DataFrame(data), pandas.DataFrame(data)
md_df = md_df[filter]
pd_df = pd_df[filter]

md_res = md_df.groupby(["a", "e"]).sum()
pd_res = pd_df.groupby(["a", "e"]).sum()
df_equals(md_res, pd_res)
# MultiIndex level [0] values are different (100.0 %)
# [left]:  Index(['c', 'g', 'b'], dtype='object', name='a')
# [right]: Index(['b', 'c', 'g'], dtype='object', name='a')
# At positional index 0, first diff: c != b

Issue Description

The modin's result is unsorted. This seems to be only relevant for multi-column groupby

Expected Behavior

should be sorted

Error Logs

Replace this line with the error backtrace (if applicable).

Installed Versions

Replace this line with the output of pd.show_versions()

@dchigarev dchigarev added bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas.groupby partitions reshuffling 🔀 Issues related to partitions reshuffling labels Jan 22, 2024
@dchigarev dchigarev self-assigned this Jan 22, 2024
dchigarev added a commit to dchigarev/modin that referenced this issue Jan 30, 2024
…odin-project#6875 bug

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
anmyachev pushed a commit that referenced this issue Jan 30, 2024
…6896)

Signed-off-by: Dmitry Chigarev <dmitry.chigarev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P1 Important tasks that we should complete soon pandas.groupby partitions reshuffling 🔀 Issues related to partitions reshuffling
Projects
None yet
Development

No branches or pull requests

1 participant