Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby().apply() raise numpy ValueError when Series has multi index #7344

Open
3 tasks done
Pekton opened this issue Jul 16, 2024 · 1 comment
Open
3 tasks done
Labels
bug 🦗 Something isn't working Triage 🩹 Issues that need triage

Comments

@Pekton
Copy link

Pekton commented Jul 16, 2024

Modin version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest released version of Modin.

  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd

data1=pd.read_excel('abc.xlsx', header=[0,1]) # multiple headers

def anyFuncB(x):
    do something
    return x

def anyFuncA(x)
    x.loc[data1[('col0','col1')].apply(anyFuncB)] #here cause the error, apply() results in a pd.Series

data = pd.read_excel('def.xlsx')
data.groupby(by='col0').apply(anyFuncA)

Issue Description

By just applying dataframe0.apply(anyFunc0), everything was good.

After applying dataframe0.groupby().apply(anyFunc0), if another dataframe1 has multi index and it runs dataframe1[('col0', 'col1')].apply(anyFunc1),
File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/series.py", line 713, in apply
if result.name == self.index[0]:
raises ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(), because here result.name is a tuple with 2 items and self.index[0] is a numpy.int64, the result of comparison is a list contents two boolean values, my temp fix is adding following code:

elif return_type == "Series":
    try:
        if result.name == self.index[0]:
            result.name = None
    except:
        if (result.name == self.index[0]).all():
            result.name = None

other solution could be to determine if result.name and self.index[0] is single value or not.

Expected Behavior

make the comparison correct

Error Logs

Traceback (most recent call last):
  File "/home/ecommerce_production_classification/database.py", line 46, in <module>
    print(data.loc[:5].groupby(by='company_id').apply(lambda x: detect_data(x)))
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/groupby.py", line 653, in apply
    if not isinstance(apply_res, Series) and apply_res.columns.equals(
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/base.py", line 4294, in __getattribute__
    attr = super().__getattribute__(item)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/dataframe.py", line 315, in _get_columns
    return self._query_compiler.columns
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 104, in <lambda>
    return lambda self: self._modin_frame.columns
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 727, in _get_columns
    columns, column_widths = self._columns_cache.get(return_lengths=True)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/metadata/index.py", line 194, in get
    index, self._lengths_cache = self._value()
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/metadata/index.py", line 106, in <lambda>
    return lambda: dataframe_obj._compute_axis_labels_and_lengths(axis)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 835, in _compute_axis_labels_and_lengths
    new_index, internal_idx = self._partition_mgr_cls.get_indices(axis, partitions)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1193, in get_indices
    new_idx = cls.get_objects_from_partitions(new_idx)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 1134, in get_objects_from_partitions
    return cls._execution_wrapper.materialize(
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/execution/ray/common/engine_wrapper.py", line 139, in materialize
    return ray.get(obj_id)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/worker.py", line 2630, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/python3.10/lib/python3.10/site-packages/ray/_private/worker.py", line 863, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::remote_exec_func() (pid=22666, ip=172.29.158.228)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_deploy_ray_func() (pid=22664, ip=172.29.158.228)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py", line 335, in _deploy_ray_func
    result = deployer(axis, f_to_deploy, f_args, f_kwargs, *deploy_args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 462, in deploy_axis_func
    raise err
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/partitioning/axis_partition.py", line 457, in deploy_axis_func
    result = func(dataframe, *f_args, **f_kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 2078, in _tree_reduce_func
    series_result = func(df, *args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/dataframe/pandas/dataframe/dataframe.py", line 4261, in apply_func
    result = operator(df.groupby(by, **kwargs))
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3976, in <lambda>
    operator=lambda grp: agg_func(grp, *agg_args, **agg_kwargs),
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/core/storage_formats/pandas/query_compiler.py", line 3957, in agg_func
    result = agg_method(grp, original_agg_func, *args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1824, in apply
    result = self._python_apply_general(f, self._selected_obj)
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/groupby.py", line 1885, in _python_apply_general
    values, mutated = self._grouper.apply_groupwise(f, data, self.axis)
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/groupby/ops.py", line 919, in apply_groupwise
    res = f(group)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/utils.py", line 765, in wrapper
    result = func(*args, **kwargs)
  File "/home/ecommerce_production_classification/database.py", line 46, in <lambda>
    print(data.loc[:5].groupby(by='company_id').apply(lambda x: detect_data(x)))
  File "/home/ecommerce_production_classification/database.py", line 21, in detect_data
    return classification(data, _rulesDF)
  File "/home/ecommerce_production_classification/categorization.py", line 227, in classification
    data = categorization(data, rules)
  File "/home/ecommerce_production_classification/categorization.py", line 209, in categorization
    return process(data, rules,  '分类')
  File "/home/ecommerce_production_classification/categorization.py", line 205, in process
    data[rules['赋值'].columns]=pd.DataFrame(data.apply(getCategories, axis=1).to_dict()).T
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 10374, in apply
    return op.apply().__finalize__(self, method="apply")
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 916, in apply
    return self.apply_standard()
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1063, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/usr/local/python3.10/lib/python3.10/site-packages/pandas/core/apply.py", line 1081, in apply_series_generator
    results[i] = self.func(v, *self.args, **self.kwargs)
  File "/home/ecommerce_production_classification/categorization.py", line 168, in getCategories
    _res = rules.loc[rules[('运算式','运算式')].apply(operationToBool)]
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/logging/logger_decorator.py", line 144, in run_and_log
    return obj(*args, **kwargs)
  File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/series.py", line 713, in apply
    if result.name == self.index[0]:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Installed Versions

UserWarning: Setuptools is replacing distutils.

INSTALLED VERSIONS

commit : c8bbca8
python : 3.10.12.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-1160.108.1.el7.x86_64
Version : #1 SMP Thu Jan 25 16:17:31 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

Modin dependencies

modin : 0.31.0
ray : 2.30.0
dask : None
distributed : None

pandas dependencies

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 65.5.0
pip : 24.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.4.6
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.26.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.6.0
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.4
pandas_gbq : None
pyarrow : 16.1.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 2.0.31
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@Pekton Pekton added bug 🦗 Something isn't working Triage 🩹 Issues that need triage labels Jul 16, 2024
@Pekton
Copy link
Author

Pekton commented Jul 16, 2024

Modin version checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest released version of Modin.
  • I have confirmed this bug exists on the main branch of Modin. (In order to do this you can follow this guide.)

Reproducible Example

import modin.pandas as pd

data1=pd.read_excel('abc.xlsx', header=[0,1]) # multiple headers

def anyFuncB(x):
    do something
    return x

def anyFuncA(x)
    x.loc[data1[('col0','col1')].apply(anyFuncB)] #here cause the error, apply() results in a pd.Series

data = pd.read_excel('def.xlsx')
data.groupby(by='col0').apply(anyFuncA)

Issue Description

By just applying dataframe0.apply(anyFunc0), everything was good.

After applying dataframe0.groupby().apply(anyFunc0), if another dataframe1 has multi index and it runs dataframe1[('col0', 'col1')].apply(anyFunc1), File "/usr/local/python3.10/lib/python3.10/site-packages/modin/pandas/series.py", line 713, in apply if result.name == self.index[0]: raises ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all(), because here result.name is a tuple with 2 items and self.index[0] is a numpy.int64, the result of comparison is a list contents two boolean values, my temp fix is adding following code:

elif return_type == "Series":
    try:
        if result.name == self.index[0]:
            result.name = None
    except:
        if (result.name == self.index[0]).all():
            result.name = None

other solution could be to determine if result.name and self.index[0] is single value or not.

Expected Behavior

make the comparison correct

Error Logs

Installed Versions

solution modified to:

if  isinstance(_ := (result.name == self.index[0]), np.ndarray):
                if _.all():
                    result.name = None
            elif _:
                result.name = None

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working Triage 🩹 Issues that need triage
Projects
None yet
Development

No branches or pull requests

1 participant