Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with complex groupby on copied columns #4604

Closed
Garra1980 opened this issue Jun 24, 2022 · 4 comments · Fixed by #4642
Closed

Bug with complex groupby on copied columns #4604

Garra1980 opened this issue Jun 24, 2022 · 4 comments · Fixed by #4642
Assignees
Labels
bug 🦗 Something isn't working

Comments

@Garra1980
Copy link
Collaborator

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
  • Modin version (modin.__version__):
  • Python version:
  • Code we can use to reproduce:
import modin.pandas as mpd
import numpy as np

d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [10,9], 'col4': [11,12], 'col5': [13,15]}

mdf = mpd.DataFrame(data=d)
def col5(x):
    return np.sum(x)

def min5_agg(x):
    return np.sum(np.abs(x))

def col6(x):
    return np.min(x)

def col7(x):
    return np.min(x)

agg_func = {
    'col5': [col5, min5_agg], 
    'col6': col6, 
    'col7': col7
}

mdf['col6'] = mdf['col2']
mdf['col7'] = mdf['col3']

res_m=mdf.groupby(['col1']).agg(agg_func)

res_m.columns.droplevel(0)   <--- Value Error: Cannot remove 1 levels from an index with 1 levels: at least one level must be left.

Describe the problem

A bit weird bug found in one of use cases of TPC-AI benchmark
Modin dataframe after groupby is incorrect:

      (col5, col5)  (col5, min5_agg)  col6  col7
col1
1               13                13     3    10
2               15                15     4     9

Pandas gives

     col5          col6 col7
     col5 min5_agg col6 col7
col1
1      13       13    3   10
2      15       15    4    9

Weirdness is that bug is reproduced only in case

mdf['col6'] = mdf['col2']
mdf['col7'] = mdf['col3']

persists in the code. Without copied columns and aggregating on them no error occurs.

Source code / logs

@Garra1980
Copy link
Collaborator Author

Seems it is also important that aggregation functions are specified as it is done in the test

@pyrito
Copy link
Collaborator

pyrito commented Jun 27, 2022

@Garra1980 thanks for opening the issue! I tried running the snippet and was able to reproduce this behavior locally. Like you said, the strange coupling of the columns seems to happen if we set the columns the way you mentioned. If we do something like this things seem to be working:

import modin.pandas as mpd
import numpy as np

# Initialize col6 and col7 up here instead statically
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [10,9], 'col4': [11,12], 'col5': [13,15], 'col6': [17,38], 'col7': [71,32]}

mdf = mpd.DataFrame(data=d)
def col5(x):
    return np.sum(x)

def min5_agg(x):
    return np.sum(np.abs(x))

def col6(x):
    return np.min(x)

def col7(x):
    return np.min(x)

agg_func = {
    'col5': [col5, min5_agg], 
    'col6': col6, 
    'col7': col7
}

res_m=mdf.groupby(['col1']).agg(agg_func)

res_m.columns.droplevel(0)

The Modin team will take a look and find a solution to the issue! cc: @modin-project/modin-contributors @modin-project/modin-core

@Garra1980
Copy link
Collaborator Author

yeah, something like that

@anmyachev anmyachev added bug 🦗 Something isn't working labels Jul 1, 2022
@anmyachev anmyachev self-assigned this Jul 1, 2022
@anmyachev
Copy link
Collaborator

Problem: when used to aggregate a list of functions, the end result must be a multicolumn. When a Modin dataframe contains several partitions, a situation may arise when a multicolumn occurs in one partition and not in another. The current implementation does not handle this case.

Solution: we can find out if the end result should be a multicolumn before executing the aggregation functions, simply by looking at the type of the variable. If this is the case, then it is possible for columns to which only one function is applied, to wrap it in list type, which will lead to the creation of a multicolumn in all partitions.

Code example:

            func_dict = {col: try_get_str_func(fn) for col, fn in func_dict.items()}
            if any((isinstance(value, list) for value in func_dict.values())):
                # multicolumn case
                new_func_dict = {col: fn if isinstance(fn, list) else [fn] for col, fn in func_dict.items()}
                func_dict = new_func_dict

anmyachev added a commit to anmyachev/modin that referenced this issue Jul 4, 2022
…arise

Signed-off-by: Myachev <anatoly.myachev@intel.com>
anmyachev added a commit to anmyachev/modin that referenced this issue Jul 4, 2022
Signed-off-by: Myachev <anatoly.myachev@intel.com>
YarShev added a commit that referenced this issue Jul 5, 2022
Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
YarShev added a commit that referenced this issue Sep 6, 2022
Co-authored-by: Yaroslav Igoshev <Poolliver868@mail.ru>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants