New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug with complex groupby on copied columns #4604
Comments
Seems it is also important that aggregation functions are specified as it is done in the test |
@Garra1980 thanks for opening the issue! I tried running the snippet and was able to reproduce this behavior locally. Like you said, the strange coupling of the columns seems to happen if we set the columns the way you mentioned. If we do something like this things seem to be working: import modin.pandas as mpd
import numpy as np
# Initialize col6 and col7 up here instead statically
d = {'col1': [1, 2], 'col2': [3, 4], 'col3': [10,9], 'col4': [11,12], 'col5': [13,15], 'col6': [17,38], 'col7': [71,32]}
mdf = mpd.DataFrame(data=d)
def col5(x):
return np.sum(x)
def min5_agg(x):
return np.sum(np.abs(x))
def col6(x):
return np.min(x)
def col7(x):
return np.min(x)
agg_func = {
'col5': [col5, min5_agg],
'col6': col6,
'col7': col7
}
res_m=mdf.groupby(['col1']).agg(agg_func)
res_m.columns.droplevel(0) The Modin team will take a look and find a solution to the issue! cc: @modin-project/modin-contributors @modin-project/modin-core |
yeah, something like that |
Problem: when used to aggregate a list of functions, the end result must be a multicolumn. When a Modin dataframe contains several partitions, a situation may arise when a multicolumn occurs in one partition and not in another. The current implementation does not handle this case. Solution: we can find out if the end result should be a multicolumn before executing the aggregation functions, simply by looking at the type of the variable. If this is the case, then it is possible for columns to which only one function is applied, to wrap it in Code example: func_dict = {col: try_get_str_func(fn) for col, fn in func_dict.items()}
if any((isinstance(value, list) for value in func_dict.values())):
# multicolumn case
new_func_dict = {col: fn if isinstance(fn, list) else [fn] for col, fn in func_dict.items()}
func_dict = new_func_dict |
…arise Signed-off-by: Myachev <anatoly.myachev@intel.com>
Signed-off-by: Myachev <anatoly.myachev@intel.com>
System information
modin.__version__
):Describe the problem
A bit weird bug found in one of use cases of TPC-AI benchmark
Modin dataframe after groupby is incorrect:
Pandas gives
Weirdness is that bug is reproduced only in case
persists in the code. Without copied columns and aggregating on them no error occurs.
Source code / logs
The text was updated successfully, but these errors were encountered: