Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition of set_subtask_result #2784

Conversation

chaokunyang
Copy link
Contributor

@chaokunyang chaokunyang commented Mar 4, 2022

What do these changes do?

  • Fix set_subtask_result race condition since we release lock when _decref_input_subtasks.
  • Fix duplicate subtasks submit.

Check code requirements

Related issue number

Closes #2814

  • tests added / passed (if needed)
  • Ensure all linting tests pass, see here for how to run them

@chaokunyang chaokunyang force-pushed the fix_set_subtask_result_race_condition branch 2 times, most recently from fde4670 to 8f322b4 Compare March 10, 2022 04:42
@chaokunyang
Copy link
Contributor Author

chaokunyang commented Mar 14, 2022

Seems there is a bug in worker_slot, see #2814

Traceback (most recent call last):
  File "mars/oscar/core.pyx", line 478, in mars.oscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "mars/oscar/core.pyx", line 481, in mars.oscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "mars/oscar/core.pyx", line 482, in mars.oscar.core._BaseActor.__on_receive__
    result = func(*args, **kwargs)
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/services/scheduling/worker/workerslot.py", line 169, in release_free_slot
    assert acquired_slot_id == slot_id, f"acquired_slot_id {acquired_slot_id} != slot_id {slot_id}"
AssertionError: acquired_slot_id 1 != slot_id 0
2022-03-14 12:01:04,073 ERROR execution.py:120 -- Failed to run subtask PzeAnA7jYhspICgUe7eS81Vz on band numa-0
Traceback (most recent call last):
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/services/scheduling/worker/execution.py", line 331, in internal_run_subtask
    subtask_info.result = await self._retry_run_subtask(
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/services/scheduling/worker/execution.py", line 420, in _retry_run_subtask
    return await _retry_run(subtask, subtask_info, _run_subtask_once)
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/services/scheduling/worker/execution.py", line 107, in _retry_run
    raise ex
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/services/scheduling/worker/execution.py", line 67, in _retry_run
    return await target_async_func(*args)
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/services/scheduling/worker/execution.py", line 412, in _run_subtask_once
    await slot_manager_ref.release_free_slot(
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/oscar/backends/context.py", line 189, in send
    return self._process_result_message(result)
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/oscar/backends/context.py", line 70, in _process_result_message
    raise message.as_instanceof_cause()
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/oscar/backends/pool.py", line 542, in send
    result = await self._run_coro(message.message_id, coro)
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/oscar/backends/pool.py", line 333, in _run_coro
    return await coro
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/oscar/api.py", line 115, in __on_receive__
    return await super().__on_receive__(message)
  File "mars/oscar/core.pyx", line 506, in __on_receive__
    raise ex
  File "mars/oscar/core.pyx", line 478, in mars.oscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
  File "mars/oscar/core.pyx", line 481, in mars.oscar.core._BaseActor.__on_receive__
    async with self._lock:
  File "mars/oscar/core.pyx", line 482, in mars.oscar.core._BaseActor.__on_receive__
    result = func(*args, **kwargs)
  File "/Users/chaokunyang/ant/Development/DevProjects/python/mars/mars/services/scheduling/worker/workerslot.py", line 169, in release_free_slot
    assert acquired_slot_id == slot_id, f"acquired_slot_id {acquired_slot_id} != slot_id {slot_id}"
AssertionError: [address=127.0.0.1:56418, pid=42775] acquired_slot_id 1 != slot_id 0

@chaokunyang chaokunyang force-pushed the fix_set_subtask_result_race_condition branch from 8f322b4 to 08574dc Compare March 14, 2022 08:12
@wjsi wjsi changed the title fix set_subtask_result race condition Fix race condition of set_subtask_result Mar 14, 2022
Copy link
Member

@wjsi wjsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wjsi wjsi added type: bug Something isn't working to be backported Indicate that the PR need to be backported to stable branch mod: scheduling service labels Mar 14, 2022
@wjsi wjsi added this to In progress in Distributed via automation Mar 14, 2022
@wjsi wjsi added this to PR-In progress in v0.9 Release via automation Mar 14, 2022
@wjsi wjsi added this to the v0.9.0rc1 milestone Mar 14, 2022
Copy link
Contributor

@hekaisheng hekaisheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hekaisheng hekaisheng merged commit 78e5cf7 into mars-project:master Mar 14, 2022
Distributed automation moved this from In progress to Done Mar 14, 2022
v0.9 Release automation moved this from PR-In progress to PR-Done Mar 14, 2022
wjsi pushed a commit to wjsi/mars that referenced this pull request Mar 14, 2022
@hekaisheng hekaisheng added backported already PR has been backported and removed to be backported Indicate that the PR need to be backported to stable branch labels Mar 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported already PR has been backported mod: scheduling service type: bug Something isn't working
Projects
Distributed
  
Done
Development

Successfully merging this pull request may close these issues.

[BUG] release_free_slot got wrong slot_id
3 participants