Skip to content

[MRG] add a new test: test_reusable_executor_thread_safety#116

Merged
ogrisel merged 9 commits into
joblib:masterfrom
ogrisel:test-reusable-executor-thread-safety
Mar 26, 2018
Merged

[MRG] add a new test: test_reusable_executor_thread_safety#116
ogrisel merged 9 commits into
joblib:masterfrom
ogrisel:test-reusable-executor-thread-safety

Conversation

@ogrisel
Copy link
Copy Markdown
Contributor

@ogrisel ogrisel commented Mar 19, 2018

Added a new test with several threads concurrently using the same reusable executor. Found an issue, added a fix.

Let see if this test can find new bugs on the CI.

@ogrisel ogrisel requested a review from tomMoral March 19, 2018 15:17
@ogrisel
Copy link
Copy Markdown
Contributor Author

ogrisel commented Mar 19, 2018

It seems to cause a random crash:

For instance: https://travis-ci.org/tomMoral/loky/jobs/355411973#L785

Traceback (most recent call last):
  File "/opt/python/3.4.6/lib/python3.4/threading.py", line 911, in _bootstrap_inner
    self.run()
  File "/opt/python/3.4.6/lib/python3.4/threading.py", line 859, in run
    self._target(*self._args, **self._kwargs)
  File "/home/travis/build/tomMoral/loky/tests/test_reusable_executor.py", line 587, in helper_func
    executor = get_reusable_executor(max_workers=max_workers)
  File "/home/travis/build/tomMoral/loky/loky/reusable_executor.py", line 104, in get_reusable_executor
    executor.shutdown(wait=True, kill_workers=kill_workers)
  File "/home/travis/build/tomMoral/loky/loky/process_executor.py", line 989, in shutdown
    self._queue_management_thread_wakeup.close()
  File "/home/travis/build/tomMoral/loky/loky/process_executor.py", line 122, in close
    self._reader.close()
  File "/opt/python/3.4.6/lib/python3.4/multiprocessing/connection.py", line 177, in close
    self._close()
  File "/opt/python/3.4.6/lib/python3.4/multiprocessing/connection.py", line 361, in _close
    _close(self._handle)
OSError: [Errno 9] Bad file descriptor

Similar problem in: https://travis-ci.org/tomMoral/loky/jobs/355411974#L802

and under windows: https://ci.appveyor.com/project/tomMoral/loky/build/1.0.851/job/7yx6x33y1fyaki8c#L567

Traceback (most recent call last):
  File "C:\Python33-x64\Lib\threading.py", line 901, in _bootstrap_inner
    self.run()
  File "C:\Python33-x64\Lib\threading.py", line 858, in run
    self._target(*self._args, **self._kwargs)
  File "C:\projects\loky\tests\test_reusable_executor.py", line 587, in helper_func
    executor = get_reusable_executor(max_workers=max_workers)
  File "C:\projects\loky\loky\reusable_executor.py", line 104, in get_reusable_executor
    executor.shutdown(wait=True, kill_workers=kill_workers)
  File "C:\projects\loky\loky\process_executor.py", line 984, in shutdown
    self._call_queue.join_thread()
AttributeError: 'NoneType' object has no attribute 'join_thread'

@ogrisel
Copy link
Copy Markdown
Contributor Author

ogrisel commented Mar 20, 2018

I just pushed more parametrizations for the test: the initial state can be either a broken executor or not and the threads can ask for varying numbers of workers or not, to shake it more. With this, I could reproduce some of the random failures on my local workstations by running this test in isolation.

@ogrisel
Copy link
Copy Markdown
Contributor Author

ogrisel commented Mar 20, 2018

Some of the previous commit failures where caused by a docker timeout. I hope that my last commit will avoid such false positives.

Copy link
Copy Markdown
Contributor

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the implementation.

My only concern is that we do not guarantee to the user the number of processes he asks for and all this can go silently. But I am not sure how we could detect that different threads are asking simultaneously different number of processes.

Comment thread loky/process_executor.py
if self._queue_management_thread_wakeup:
self._queue_management_thread_wakeup.close()

if qmtw:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reduce as much as possible the chances of an OSError, we could call qmtw = self.queue_manager_thread_wakeup before this if statement.

Comment thread loky/process_executor.py Outdated
# Wake up queue management thread
self._queue_management_thread_wakeup.wakeup()
if qmtw is not None:
qmtw.wakeup()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can fail with OSError if it has been closed concurrently so adding a try..catch is probably safer.

Comment thread CHANGES.md Outdated
### 2.1.0 (in developement)

- Fixed a thread-safety issue when iterating over the dict of processes in
`ReusablePoolExecutor._resize`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe update this to:
Fixed thread-safety issues for ReusablePoolExecutor._resize and ReusablePoolExecutor.shutdown

Comment thread tox.ini
passenv = NUMBER_OF_PROCESSORS
deps =
pytest
pytest==3.4.2
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest3.4.2 does not support python3.3...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the Python 3.3 entry in travis :)

@ogrisel
Copy link
Copy Markdown
Contributor Author

ogrisel commented Mar 26, 2018

The travis tests now pass but we still get a deadlock on windows:

https://ci.appveyor.com/project/tomMoral/loky/build/1.0.860/job/d2xqvkpc0s5wuogi

It's weird, it looks like we have 2 reusable executor instance running: 2 QM threads and 2 QF threads. The factory is probably not thread safe.

@ogrisel
Copy link
Copy Markdown
Contributor Author

ogrisel commented Mar 26, 2018

@tomMoral Ok I have put a more global reentrant lock that protects:

  • the reusable executor's factory
  • any call to _ReusablePoolExecutor.submit
  • any call to _ReusablePoolExecutor._resize

This way we can never have 2 instances of the _ReusablePoolExecutor in the same python program. Furthermore I have rename the class to make it private to make it explicit that the factory is the only supported way to get the singleton. So far the tests look good.

@ogrisel
Copy link
Copy Markdown
Contributor Author

ogrisel commented Mar 26, 2018

The tests pass. The code coverage has decreased a bit because we need to clean up the py 3.3 specific hacks. But I would rather do that in a different PR.

@ogrisel
Copy link
Copy Markdown
Contributor Author

ogrisel commented Mar 26, 2018

I have run the joblib tests on my machine several times against this loky branch and everything is fine. @tomMoral feel free to merge and make a release if you agree with this PR.

@ogrisel
Copy link
Copy Markdown
Contributor Author

ogrisel commented Mar 26, 2018

Let's merge to get the cron job shake this one a bit more.

@ogrisel ogrisel merged commit e4e927c into joblib:master Mar 26, 2018
@ogrisel ogrisel deleted the test-reusable-executor-thread-safety branch March 26, 2018 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants