Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Mars dataframe sort_values with multiple ascendings returns incorrect result on pandas<1.4 #3215

Closed
fyrestone opened this issue Aug 9, 2022 · 0 comments · Fixed by #3234
Assignees
Labels
type: bug Something isn't working

Comments

@fyrestone
Copy link
Contributor

fyrestone commented Aug 9, 2022

Describe the bug
A clear and concise description of what the bug is.

Example

import numpy as np
import pandas as pd
import mars
import mars.dataframe as md


mars.new_session()
ns = np.random.RandomState(0)
df = pd.DataFrame(ns.rand(100, 2), columns=["a" + str(i) for i in range(2)])
mdf = md.DataFrame(df, chunk_size=10)
result = (
    mdf.sort_values(["a0", "a1"], ascending=[False, True])
    .execute()
    .fetch()
)
expected = df.sort_values(
    ["a0", "a1"], ascending=[False, True]
)
pd.testing.assert_frame_equal(result, expected)

Mars backend

Traceback (most recent call last):
  File "/home/admin/Work/mars/t1.py", line 19, in <module>
    pd.testing.assert_frame_equal(result, expected)
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/_testing/asserters.py", line 1257, in assert_frame_equal
    assert_index_equal(
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/_testing/asserters.py", line 412, in assert_index_equal
    _testing.assert_almost_equal(
  File "pandas/_libs/testing.pyx", line 53, in pandas._libs.testing.assert_almost_equal
  File "pandas/_libs/testing.pyx", line 168, in pandas._libs.testing.assert_almost_equal
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/_testing/asserters.py", line 665, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: DataFrame.index are different
DataFrame.index values are different (91.0 %)
[left]:  Int64Index([26, 10, 36, 35, 82,  4, 61, 92, 34, 33,  5,  9, 97, 37, 94, 60, 21,
            22, 64, 31, 28, 69, 65,  1, 91, 68, 25, 67,  6,  0, 93, 29,  3, 62,
             2, 95, 20, 24, 39, 38, 98, 23, 27, 32, 96, 90, 30, 66,  7, 99,  8,
            63, 19, 70, 59, 58, 49, 57, 78, 72, 87, 51, 84, 81, 74, 89, 56, 80,
            50, 18, 53, 48, 44, 42, 43, 14, 85, 11, 16, 55, 71, 79, 88, 45, 40,
            47, 15, 52, 54, 86, 76, 75, 13, 46, 77, 12, 73, 41, 17, 83],
           dtype='int64')
[right]: Int64Index([26, 10, 36, 35, 82,  4, 61, 19, 92, 70, 59, 58, 34, 49, 33, 57, 78,
            72, 87,  5,  9, 97, 37, 51, 94, 84, 60, 81, 74, 89, 56, 21, 80, 50,
            22, 64, 31, 28, 69, 65, 18,  1, 53, 48, 91, 44, 68, 25, 67,  6, 42,
             0, 93, 43, 14, 85, 29, 11, 16, 55,  3, 71, 62,  2, 79, 95, 20, 88,
            45, 40, 24, 39, 47, 38, 15, 52, 98, 54, 23, 27, 86, 32, 96, 90, 76,
            30, 75, 13, 66, 46, 77, 12, 73,  7, 41, 99,  8, 63, 17, 83],

Ray DAG backend

Traceback (most recent call last):
  File "/home/admin/Work/mars/t1.py", line 12, in <module>
    mdf.sort_values(["a0", "a1"], ascending=[False, True])
  File "/home/admin/Work/mars/mars/core/entity/tileables.py", line 462, in execute
    result = self.data.execute(session=session, **kw)
  File "/home/admin/Work/mars/mars/core/entity/executable.py", line 144, in execute
    return execute(self, session=session, **kw)
  File "/home/admin/Work/mars/mars/deploy/oscar/session.py", line 1890, in execute
    return session.execute(
  File "/home/admin/Work/mars/mars/deploy/oscar/session.py", line 1684, in execute
    execution_info: ExecutionInfo = fut.result(
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/concurrent/futures/_base.py", line 444, in result
    return self.__get_result()
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/home/admin/Work/mars/mars/deploy/oscar/session.py", line 1870, in _execute
    await execution_info
  File "/home/admin/Work/mars/mars/deploy/oscar/session.py", line 105, in wait
    return await self._aio_task
  File "/home/admin/Work/mars/mars/deploy/oscar/session.py", line 953, in _run_in_background
    raise task_result.error.with_traceback(task_result.traceback)
  File "/home/admin/Work/mars/mars/services/task/supervisor/processor.py", line 369, in run
    await self._process_stage_chunk_graph(*stage_args)
  File "/home/admin/Work/mars/mars/services/task/supervisor/processor.py", line 247, in _process_stage_chunk_graph
    chunk_to_result = await self._executor.execute_subtask_graph(
  File "/home/admin/Work/mars/mars/services/task/execution/ray/executor.py", line 551, in execute_subtask_graph
    meta_list = await asyncio.gather(*output_meta_object_refs)
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/asyncio/tasks.py", line 695, in _wrap_awaitable
    return (yield from awaitable.__await__())
ray.exceptions.RayTaskError(ValueError): ray::execute_subtask() (pid=68092, ip=127.0.0.1)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::execute_subtask() (pid=68097, ip=127.0.0.1)
  File "/home/admin/Work/mars/mars/services/task/execution/ray/executor.py", line 185, in execute_subtask
    execute(context, chunk.op)
  File "/home/admin/Work/mars/mars/core/operand/core.py", line 491, in execute
    result = executor(results, op)
  File "/home/admin/Work/mars/mars/dataframe/sort/psrs.py", line 713, in execute
    cls._execute_map(ctx, op)
  File "/home/admin/Work/mars/mars/dataframe/sort/psrs.py", line 668, in _execute_map
    cls._execute_dataframe_map(ctx, op)
  File "/home/admin/Work/mars/mars/dataframe/sort/psrs.py", line 602, in _execute_dataframe_map
    poses = cls._calc_poses(a[by], pivots, op.ascending)
  File "/home/admin/Work/mars/mars/dataframe/sort/psrs.py", line 559, in _calc_poses
    pivots[col] = -pivots[col]
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/core/frame.py", line 3612, in __setitem__
    self._set_item(key, value)
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/core/frame.py", line 3797, in _set_item
    self._set_item_mgr(key, value)
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/core/frame.py", line 3756, in _set_item_mgr
    self._iset_item_mgr(loc, value)
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/core/frame.py", line 3746, in _iset_item_mgr
    self._mgr.iset(loc, value)
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1078, in iset
    blk.set_inplace(blk_locs, value_getitem(val_locs))
  File "/home/admin/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 360, in set_inplace
    self.values[locs] = values
ValueError: assignment destination is read-only

The problems is that the pivots[col] = -pivots[col]

  • on Mars backend: It assigns without any exceptions, but the data is not updated to pivots. The following p_records = pivots.to_records(index=False) get incorrect p_records.
  • on Ray backend: Ray mark the numpy array returns from Ray object store as immutable. So, this line raises a clear exception.

Related issues:
ray-project/ray#369
pandas-dev/pandas#43406

This bug has fixed in pandas >= 1.4.

To Reproduce
To help us reproducing this bug, please provide information below:

  1. Your Python version 3.7.11
  2. The version of Mars you use
  3. Versions of crucial packages, such as numpy, scipy and pandas pandas==1.3.0
  4. Full stack of the error.
  5. Minimized code to reproduce the error.

Expected behavior
A clear and concise description of what you expected to happen.

Additional context
Add any other context about the problem here.

@fyrestone fyrestone added the type: bug Something isn't working label Aug 9, 2022
@fyrestone fyrestone changed the title [BUG] Mars dataframe psrs run failed on pandas==1.3.0 [BUG] Mars dataframe psrs bug on pandas<1.4 Aug 26, 2022
@fyrestone fyrestone self-assigned this Aug 26, 2022
@fyrestone fyrestone changed the title [BUG] Mars dataframe psrs bug on pandas<1.4 [BUG] Mars dataframe sort_values with multiple ascendings returns incorrect result on pandas<1.4 Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant