Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel hangs when using Dask as a back-end #875

Closed
jjerphan opened this issue May 17, 2019 · 15 comments
Closed

Parallel hangs when using Dask as a back-end #875

jjerphan opened this issue May 17, 2019 · 15 comments

Comments

@jjerphan
Copy link
Contributor

jjerphan commented May 17, 2019

Hello,

I have been using Dask through joblib and had somes problem of deadlocks for the execution of some tasks.

I am using Parallel with Dask as a backend. Parallel.__call__ does not returns.
After having modified some of the code with logs in scikit-learn 0.20.x distribution of joblib and distributed master; what I observed is that Tasks are not submitted at one point. Apart from that it seems that tasks on workers are executed properly.

I have inspected Parallel code, I think that I may have found a possible explanation of possible deadlocks.


The dispatch of tasks in made as follows in Parallel.__call__:

if self.dispatch_one_batch(iterator):
    self._iterating = self._original_iterator is not None

while self.dispatch_one_batch(iterator):
    logging.info("Parallel.__call__ dispatched one batch")
    pass

in Parallel.dispatch_one_batch Parallel._dispatch is called.
As indicated in the docstring Parallel._dispatch "this method is not thread-safe: it should be only called indirectly via dispatch_one_batch".

_dispatch does use BatchCompletionCallBack's objects that are called by the backendapply_async. In the case of SequentialBackend's, the execution is made in the same thread so there are no problems ; but in the case of Dask, DaskDistributedBackend.apply_async does register this callback on tornado.IOLoop:

class DaskDistributedBackend(ParallelBackendBase, AutoBatchingMixin):

    # ...

    def apply_async(self, func, callback=None):
        key = '%s-batch-%s' % (_funcname(func), uuid4().hex)
        func, args = self._to_func_args(func)

        future = self.client.submit(func, *args, key=key, **self.submit_kwargs)
        self.task_futures.add(future)

        @gen.coroutine
        def callback_wrapper():
            result = yield _wait([future])
            self.task_futures.remove(future)
            if callback is not None:
                callback(result)  # gets called in separate thread

        self.client.loop.add_callback(callback_wrapper)
        # ... 

And as indicated, it "gets called in another thread" ; hence possibly producing a dead-lock if the lock is already taken in the main thread.

I am not entirely sure this could be a valid explanation. Also, I do not see a way to do handle the callback differently in DaskDistributedBackend, nor do I understand all the moving parts of this connector.

I am inspecting it currently. I'll be glad to have your thought on it, @ogrisel and @mrocklin.
While I can hardly give a minimal complete reproducible example with my setup (sorry), I think that the problem goes hand in hand with specific workloads and can be present in a variety of setups and observed in a variety of errors that are hard to predict.

However, you will find logs that present the problem ; I hope they are useful.

Please let me know if you need some more clues.


Setup:

Python version: 2.7.5 (default, Apr  9 2019, 14:30:50)
Dask version: 1.2.2+10.gd53b560
Distributed version: 1.28.1+26.gb876796.dirty
Client-side logs
[2019-05-17 08:18:35,615] [22/MainThread] [INFO] [root] Setting up Dask
[2019-05-17 08:18:35,616] [22/MainThread] [INFO] [root] Python      version:
[2019-05-17 08:18:35,616] [22/MainThread] [INFO] [root] 2.7.5 (default, Apr  9 2019, 14:30:50)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]
[2019-05-17 08:18:35,616] [22/MainThread] [INFO] [root] Dask        version: 1.2.2+10.gd53b560
[2019-05-17 08:18:35,616] [22/MainThread] [INFO] [root] Distributed version: 1.28.1+26.gb876796.dirty
[2019-05-17 08:18:46,970] [22/MainThread] [INFO] [root] Dask Scheduler started on port 8786
[2019-05-17 08:18:46,971] [22/MainThread] [INFO] [root] Done with Dask Scheduler Setup
[2019-05-17 08:18:47,302] [22/Thread-1] [INFO] [root] Scheduler.__init__ called
[2019-05-17 08:18:47,307] [22/Thread-1] [INFO] [root] Scheduler._setup_logging called
[2019-05-17 08:18:48,938] [22/Thread-1] [INFO] [root] Scheduler.start called
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://10.12.3.73:8786
distributed.scheduler - INFO -       bokeh at:                     :8787
[2019-05-17 08:18:49,069] [22/Thread-1] [INFO] [root] Scheduler.finished called
[2019-05-17 08:18:49,074] [22/Thread-1] [INFO] [root] Scheduler.add_worker called
[2019-05-17 08:18:49,074] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
distributed.scheduler - INFO - Register tcp://10.12.3.74:33599
[2019-05-17 08:18:49,077] [22/Thread-1] [INFO] [root] Scheduler.handle_worker called
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.12.3.74:33599
distributed.core - INFO - Starting established connection
[2019-05-17 08:18:49,087] [22/Thread-1] [INFO] [root] Scheduler.add_worker called
[2019-05-17 08:18:49,088] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
distributed.scheduler - INFO - Register tcp://10.12.3.75:42309
[2019-05-17 08:18:49,091] [22/Thread-1] [INFO] [root] Scheduler.handle_worker called
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.12.3.75:42309
distributed.core - INFO - Starting established connection
[2019-05-17 08:18:49,095] [22/Thread-1] [INFO] [root] Scheduler.add_worker called
[2019-05-17 08:18:49,096] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
distributed.scheduler - INFO - Register tcp://10.12.3.76:36063
[2019-05-17 08:18:49,099] [22/Thread-1] [INFO] [root] Scheduler.handle_worker called
[2019-05-17 08:18:49,099] [22/MainThread] [INFO] [root] Checking series of type: float64 (isM8=False)
distributed.scheduler - INFO - Starting worker compute stream, tcp://10.12.3.76:36063
distributed.core - INFO - Starting established connection
[2019-05-17 08:18:49,212] [22/MainThread] [INFO] [root] Creating Dask client
[2019-05-17 08:18:49,233] [22/Thread-1] [INFO] [root] Scheduler.add_client called
distributed.scheduler - INFO - Receive client connection: Client-65e26f35-787c-11e9-8016-0a580a0c0349
distributed.core - INFO - Starting established connection
[2019-05-17 08:18:49,242] [22/MainThread] [INFO] [root] register_parallel_backend called
[2019-05-17 08:18:49,242] [22/MainThread] [INFO] [root] Creating joblib.Parallel(n_jobs=4, verbose=100, pre_dispatch=n_jobs, backend=dask)
[2019-05-17 08:18:49,242] [22/MainThread] [INFO] [root] Parallel.__init__ called
[2019-05-17 08:18:49,242] [22/MainThread] [INFO] [root] DaskDistributedBackend.__init__ called
[2019-05-17 08:18:49,242] [22/MainThread] [INFO] [root] Worker.get_client called
[2019-05-17 08:18:49,242] [22/MainThread] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:49,242] [22/MainThread] [INFO] [root] Parallel.__call__ called
[2019-05-17 08:18:49,243] [22/MainThread] [INFO] [root] Parallel._initialize_backend called
[2019-05-17 08:18:49,243] [22/MainThread] [INFO] [root] DaskDistributedBackend.configure called
[2019-05-17 08:18:49,243] [22/MainThread] [INFO] [root] DaskDistributedBackend.effective_n_jobs called
[2019-05-17 08:18:49,251] [22/MainThread] [INFO] [root] Parallel._print called
[2019-05-17 08:18:49,251] [22/MainThread] [INFO] [root] DaskDistributedBackend.start_call called
[2019-05-17 08:18:49,251] [22/MainThread] [INFO] [root] Parallel.__init__: iterator created
[2019-05-17 08:18:49,251] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch called
[2019-05-17 08:18:49,251] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch: took the lock
[2019-05-17 08:18:49,251] [22/MainThread] [INFO] [root] DaskDistributedBackend.get_nested_backend called
[2019-05-17 08:18:49,251] [22/MainThread] [INFO] [root] DaskDistributedBackend.__init__ called
[2019-05-17 08:18:49,252] [22/MainThread] [INFO] [root] delayed called
[2019-05-17 08:18:49,254] [22/MainThread] [INFO] [root] Parallel.dispatch: calling self._dispatch
[2019-05-17 08:18:49,254] [22/MainThread] [INFO] [root] Parallel._dispatch called
[2019-05-17 08:18:49,254] [22/MainThread] [INFO] [root] Parallel._dispatch: submitting job of id 0 to back-end
[2019-05-17 08:18:49,254] [22/MainThread] [INFO] [root] DaskDistributedBackend.apply_async called
[2019-05-17 08:18:49,254] [22/MainThread] [INFO] [root] DaskDistributedBackend._to_func_args called
[2019-05-17 08:18:49,255] [22/MainThread] [INFO] [root] in '_dask._to_func_args'
[2019-05-17 08:18:49,255] [22/MainThread] [INFO] [root] Looping on func.items
[2019-05-17 08:18:49,255] [22/MainThread] [INFO] [root] maybe to future called
[2019-05-17 08:18:49,262] [22/IO loop] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:49,270] [22/Thread-1] [INFO] [root] Scheduler.scatter called
[2019-05-17 08:18:49,280] [22/Thread-1] [INFO] [root] Scheduler.update_data called
[2019-05-17 08:18:49,280] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,280] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:49,280] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,294] [22/MainThread] [INFO] [root] maybe to future called
[2019-05-17 08:18:49,303] [22/IO loop] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:49,308] [22/Thread-1] [INFO] [root] Scheduler.scatter called
[2019-05-17 08:18:49,312] [22/Thread-1] [INFO] [root] Scheduler.update_data called
[2019-05-17 08:18:49,312] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,312] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:49,313] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,322] [22/MainThread] [INFO] [root] maybe to future called
[2019-05-17 08:18:49,328] [22/IO loop] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:49,331] [22/Thread-1] [INFO] [root] Scheduler.scatter called
[2019-05-17 08:18:49,334] [22/Thread-1] [INFO] [root] Scheduler.update_data called
[2019-05-17 08:18:49,334] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,334] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:49,334] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,344] [22/MainThread] [INFO] [root] maybe to future called
[2019-05-17 08:18:49,351] [22/IO loop] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:49,354] [22/Thread-1] [INFO] [root] Scheduler.scatter called
[2019-05-17 08:18:49,356] [22/Thread-1] [INFO] [root] Scheduler.update_data called
[2019-05-17 08:18:49,357] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,357] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:49,357] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,369] [22/MainThread] [INFO] [root] Appending tasks done
[2019-05-17 08:18:49,369] [22/MainThread] [INFO] [root] Batch.__init__ called
[2019-05-17 08:18:49,369] [22/MainThread] [INFO] [root] Submitting to the client
[2019-05-17 08:18:49,369] [22/MainThread] [INFO] [root] Batch.__reduce__ called
[2019-05-17 08:18:49,370] [22/MainThread] [INFO] [root] Added future to tasks
[2019-05-17 08:18:49,371] [22/MainThread] [INFO] [root] Parallel._dispatch: job of id 0 inserted to back-end
[2019-05-17 08:18:49,371] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch: leaving the lock after self._dispatch
[2019-05-17 08:18:49,372] [22/Thread-1] [INFO] [root] Scheduler.update_graph called
[2019-05-17 08:18:49,373] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch called
[2019-05-17 08:18:49,373] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:49,373] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch: took the lock
[2019-05-17 08:18:49,374] [22/Thread-1] [INFO] [root] Scheduler.decide_worker called
[2019-05-17 08:18:49,374] [22/MainThread] [INFO] [root] DaskDistributedBackend.get_nested_backend called
[2019-05-17 08:18:49,374] [22/Thread-1] [INFO] [root] Scheduler.valid_workers called
[2019-05-17 08:18:49,374] [22/MainThread] [INFO] [root] DaskDistributedBackend.__init__ called
[2019-05-17 08:18:49,374] [22/Thread-1] [INFO] [root] Scheduler.decide_worker called
[2019-05-17 08:18:49,375] [22/Thread-1] [INFO] [root] Scheduler.consumre_resources called
[2019-05-17 08:18:49,375] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:49,375] [22/MainThread] [INFO] [root] delayed called
[2019-05-17 08:18:49,375] [22/Thread-1] [INFO] [root] Scheduler.send_task_to_worker called
[2019-05-17 08:18:49,376] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:49,377] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:49,377] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,378] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:49,378] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,379] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:49,379] [22/MainThread] [INFO] [root] Parallel.dispatch: calling self._dispatch
[2019-05-17 08:18:49,379] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,380] [22/MainThread] [INFO] [root] Parallel._dispatch called
[2019-05-17 08:18:49,380] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:49,380] [22/MainThread] [INFO] [root] Parallel._dispatch: submitting job of id 1 to back-end
[2019-05-17 08:18:49,381] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:49,382] [22/MainThread] [INFO] [root] DaskDistributedBackend.apply_async called
[2019-05-17 08:18:49,383] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:49,383] [22/MainThread] [INFO] [root] DaskDistributedBackend._to_func_args called
[2019-05-17 08:18:49,383] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:49,383] [22/MainThread] [INFO] [root] in '_dask._to_func_args'
[2019-05-17 08:18:49,384] [22/MainThread] [INFO] [root] Looping on func.items
[2019-05-17 08:18:49,384] [22/MainThread] [INFO] [root] maybe to future called
[2019-05-17 08:18:49,397] [22/IO loop] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:49,400] [22/Thread-1] [INFO] [root] Scheduler.scatter called
[2019-05-17 08:18:49,485] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:49,586] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:49,687] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:49,788] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:49,888] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:49,990] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:50,090] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:50,191] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:50,292] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:50,392] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:50,493] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:50,594] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:50,694] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:50,795] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:50,895] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:50,996] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,097] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,113] [22/Thread-1] [INFO] [root] Scheduler.update_data called
[2019-05-17 08:18:51,113] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,113] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,113] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,122] [22/Thread-1] [INFO] [root] Scheduler.add_client called
distributed.scheduler - INFO - Receive client connection: Client-worker-67033d1e-787c-11e9-8016-0a580a0c034a
distributed.core - INFO - Starting established connection
[2019-05-17 08:18:51,165] [22/MainThread] [INFO] [root] maybe to future called
=[2019-05-17 08:18:51,171] [22/IO loop] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:51,173] [22/Thread-1] [INFO] [root] Scheduler.scatter called
[2019-05-17 08:18:51,176] [22/Thread-1] [INFO] [root] Scheduler.update_data called
[2019-05-17 08:18:51,176] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,176] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,176] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,194] [22/MainThread] [INFO] [root] Appending tasks done
[2019-05-17 08:18:51,194] [22/MainThread] [INFO] [root] Batch.__init__ called
[2019-05-17 08:18:51,194] [22/MainThread] [INFO] [root] Submitting to the client
[2019-05-17 08:18:51,194] [22/MainThread] [INFO] [root] Batch.__reduce__ called
[2019-05-17 08:18:51,195] [22/MainThread] [INFO] [root] Added future to tasks
[2019-05-17 08:18:51,196] [22/MainThread] [INFO] [root] Parallel._dispatch: job of id 1 inserted to back-end
[2019-05-17 08:18:51,196] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch: leaving the lock after self._dispatch
[2019-05-17 08:18:51,196] [22/MainThread] [INFO] [root] Parallel.__call__ dispatched one batch
[2019-05-17 08:18:51,196] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch called
[2019-05-17 08:18:51,196] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch: took the lock
[2019-05-17 08:18:51,196] [22/MainThread] [INFO] [root] DaskDistributedBackend.get_nested_backend called
[2019-05-17 08:18:51,197] [22/MainThread] [INFO] [root] DaskDistributedBackend.__init__ called
[2019-05-17 08:18:51,198] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,198] [22/MainThread] [INFO] [root] delayed called
[2019-05-17 08:18:51,199] [22/MainThread] [INFO] [root] Parallel.dispatch: calling self._dispatch
[2019-05-17 08:18:51,200] [22/MainThread] [INFO] [root] Parallel._dispatch called
[2019-05-17 08:18:51,200] [22/MainThread] [INFO] [root] Parallel._dispatch: submitting job of id 2 to back-end
[2019-05-17 08:18:51,200] [22/MainThread] [INFO] [root] DaskDistributedBackend.apply_async called
[2019-05-17 08:18:51,200] [22/MainThread] [INFO] [root] DaskDistributedBackend._to_func_args called
[2019-05-17 08:18:51,200] [22/MainThread] [INFO] [root] in '_dask._to_func_args'
[2019-05-17 08:18:51,200] [22/MainThread] [INFO] [root] Looping on func.items
[2019-05-17 08:18:51,200] [22/MainThread] [INFO] [root] maybe to future called
[2019-05-17 08:18:51,207] [22/Thread-1] [INFO] [root] Scheduler.update_graph called
[2019-05-17 08:18:51,207] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,208] [22/Thread-1] [INFO] [root] Scheduler.decide_worker called
[2019-05-17 08:18:51,208] [22/Thread-1] [INFO] [root] Scheduler.valid_workers called
[2019-05-17 08:18:51,208] [22/Thread-1] [INFO] [root] Scheduler.decide_worker called
[2019-05-17 08:18:51,208] [22/Thread-1] [INFO] [root] Scheduler.consumre_resources called
[2019-05-17 08:18:51,208] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,209] [22/Thread-1] [INFO] [root] Scheduler.send_task_to_worker called
[2019-05-17 08:18:51,209] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:51,209] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,208] [22/IO loop] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:51,209] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,210] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,211] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,211] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,211] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,211] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,211] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,213] [22/Thread-1] [INFO] [root] Scheduler.scatter called
[2019-05-17 08:18:51,216] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,217] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,219] [22/Thread-1] [INFO] [root] Scheduler.update_data called
[2019-05-17 08:18:51,219] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,219] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,219] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,224] [22/MainThread] [INFO] [root] maybe to future called
[2019-05-17 08:18:51,230] [22/IO loop] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:51,232] [22/Thread-1] [INFO] [root] Scheduler.scatter called
[2019-05-17 08:18:51,237] [22/Thread-1] [INFO] [root] Scheduler.update_data called
[2019-05-17 08:18:51,238] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,238] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,238] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,247] [22/MainThread] [INFO] [root] Appending tasks done
[2019-05-17 08:18:51,247] [22/MainThread] [INFO] [root] Batch.__init__ called
[2019-05-17 08:18:51,247] [22/MainThread] [INFO] [root] Submitting to the client
[2019-05-17 08:18:51,248] [22/MainThread] [INFO] [root] Batch.__reduce__ called
[2019-05-17 08:18:51,248] [22/MainThread] [INFO] [root] Added future to tasks
[2019-05-17 08:18:51,248] [22/MainThread] [INFO] [root] Parallel._dispatch: job of id 2 inserted to back-end
[2019-05-17 08:18:51,248] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch: leaving the lock after self._dispatch
[2019-05-17 08:18:51,248] [22/MainThread] [INFO] [root] Parallel.__call__ dispatched one batch
[2019-05-17 08:18:51,248] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch called
[2019-05-17 08:18:51,248] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch: took the lock
[2019-05-17 08:18:51,248] [22/MainThread] [INFO] [root] DaskDistributedBackend.get_nested_backend called
[2019-05-17 08:18:51,248] [22/MainThread] [INFO] [root] DaskDistributedBackend.__init__ called
[2019-05-17 08:18:51,249] [22/MainThread] [INFO] [root] delayed called
[2019-05-17 08:18:51,250] [22/MainThread] [INFO] [root] Parallel.dispatch: calling self._dispatch
[2019-05-17 08:18:51,251] [22/MainThread] [INFO] [root] Parallel._dispatch called
[2019-05-17 08:18:51,251] [22/MainThread] [INFO] [root] Parallel._dispatch: submitting job of id 3 to back-end
[2019-05-17 08:18:51,251] [22/MainThread] [INFO] [root] DaskDistributedBackend.apply_async called
[2019-05-17 08:18:51,251] [22/MainThread] [INFO] [root] DaskDistributedBackend._to_func_args called
[2019-05-17 08:18:51,251] [22/MainThread] [INFO] [root] in '_dask._to_func_args'
[2019-05-17 08:18:51,251] [22/MainThread] [INFO] [root] Looping on func.items
[2019-05-17 08:18:51,251] [22/MainThread] [INFO] [root] maybe to future called
[2019-05-17 08:18:51,253] [22/Thread-1] [INFO] [root] Scheduler.update_graph called
[2019-05-17 08:18:51,254] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,254] [22/Thread-1] [INFO] [root] Scheduler.decide_worker called
[2019-05-17 08:18:51,254] [22/Thread-1] [INFO] [root] Scheduler.valid_workers called
[2019-05-17 08:18:51,254] [22/Thread-1] [INFO] [root] Scheduler.decide_worker called
[2019-05-17 08:18:51,254] [22/Thread-1] [INFO] [root] Scheduler.consumre_resources called
[2019-05-17 08:18:51,254] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.send_task_to_worker called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,255] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,264] [22/IO loop] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:51,265] [22/Thread-1] [INFO] [root] Scheduler.scatter called
[2019-05-17 08:18:51,277] [22/Thread-1] [INFO] [root] Scheduler.update_data called
[2019-05-17 08:18:51,278] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,278] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,278] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,282] [22/MainThread] [INFO] [root] maybe to future called
[2019-05-17 08:18:51,290] [22/IO loop] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:51,292] [22/Thread-1] [INFO] [root] Scheduler.scatter called
[2019-05-17 08:18:51,295] [22/Thread-1] [INFO] [root] Scheduler.update_data called
[2019-05-17 08:18:51,295] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,295] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,296] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,301] [22/MainThread] [INFO] [root] Appending tasks done
[2019-05-17 08:18:51,301] [22/MainThread] [INFO] [root] Batch.__init__ called
[2019-05-17 08:18:51,301] [22/MainThread] [INFO] [root] Submitting to the client
[2019-05-17 08:18:51,301] [22/MainThread] [INFO] [root] Batch.__reduce__ called
[2019-05-17 08:18:51,302] [22/Thread-1] [INFO] [root] Scheduler.update_graph called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.decide_worker called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.valid_workers called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.decide_worker called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.worker_objective called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.worker_objective called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.consumre_resources called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.send_task_to_worker called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,303] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,304] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,304] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,304] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,304] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,304] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,305] [22/MainThread] [INFO] [root] Added future to tasks
[2019-05-17 08:18:51,306] [22/MainThread] [INFO] [root] Parallel._dispatch: job of id 3 inserted to back-end
[2019-05-17 08:18:51,306] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch: leaving the lock after self._dispatch
[2019-05-17 08:18:51,306] [22/MainThread] [INFO] [root] Parallel.__call__ dispatched one batch
[2019-05-17 08:18:51,306] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch called
[2019-05-17 08:18:51,306] [22/MainThread] [INFO] [root] Parallel.dispatch_one_batch: took the lock
[2019-05-17 08:18:51,306] [22/MainThread] [INFO] [root] DaskDistributedBackend.get_nested_backend called
[2019-05-17 08:18:51,306] [22/MainThread] [INFO] [root] DaskDistributedBackend.__init__ called
[2019-05-17 08:18:51,306] [22/MainThread] [INFO] [root] delayed called
[2019-05-17 08:18:51,308] [22/MainThread] [INFO] [root] Parallel.dispatch: calling self._dispatch
[2019-05-17 08:18:51,308] [22/MainThread] [INFO] [root] Parallel._dispatch called
[2019-05-17 08:18:51,308] [22/MainThread] [INFO] [root] Parallel._dispatch: submitting job of id 4 to back-end
[2019-05-17 08:18:51,308] [22/MainThread] [INFO] [root] DaskDistributedBackend.apply_async called
[2019-05-17 08:18:51,308] [22/MainThread] [INFO] [root] DaskDistributedBackend._to_func_args called
[2019-05-17 08:18:51,308] [22/MainThread] [INFO] [root] in '_dask._to_func_args'
[2019-05-17 08:18:51,308] [22/MainThread] [INFO] [root] Looping on func.items
[2019-05-17 08:18:51,308] [22/MainThread] [INFO] [root] maybe to future called
[2019-05-17 08:18:51,313] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,317] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,319] [22/IO loop] [INFO] [root] Worker.get_worker called
[2019-05-17 08:18:51,321] [22/Thread-1] [INFO] [root] Scheduler.scatter called
[2019-05-17 08:18:51,399] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,399] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,455] [22/Thread-1] [INFO] [root] Scheduler.handle_task_finished called
[2019-05-17 08:18:51,455] [22/Thread-1] [INFO] [root] Scheduler._remove_from_processing called
[2019-05-17 08:18:51,455] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,455] [22/Thread-1] [INFO] [root] Scheduler.release_resources called
[2019-05-17 08:18:51,455] [22/Thread-1] [INFO] [root] Scheduler._add_to_memory called
[2019-05-17 08:18:51,455] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,455] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,457] [22/IO loop] [INFO] [root] Parallel.print_progress called
[2019-05-17 08:18:51,457] [22/IO loop] [INFO] [root] Parallel._print called
[2019-05-17 08:18:51,500] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,500] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,595] [22/Thread-1] [INFO] [root] Scheduler.handle_task_finished called
[2019-05-17 08:18:51,596] [22/Thread-1] [INFO] [root] Scheduler._remove_from_processing called
[2019-05-17 08:18:51,596] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,596] [22/Thread-1] [INFO] [root] Scheduler.release_resources called
[2019-05-17 08:18:51,596] [22/Thread-1] [INFO] [root] Scheduler._add_to_memory called
[2019-05-17 08:18:51,596] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,596] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,596] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:51,596] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,596] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,597] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,597] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:51,597] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,597] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,597] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,600] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,601] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,695] [22/Thread-1] [INFO] [root] Scheduler.handle_task_finished called
[2019-05-17 08:18:51,695] [22/Thread-1] [INFO] [root] Scheduler._remove_from_processing called
[2019-05-17 08:18:51,696] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,696] [22/Thread-1] [INFO] [root] Scheduler.release_resources called
[2019-05-17 08:18:51,696] [22/Thread-1] [INFO] [root] Scheduler._add_to_memory called
[2019-05-17 08:18:51,696] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,696] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:51,696] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:51,697] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,697] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,697] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,697] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:51,697] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,697] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:51,697] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:51,701] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,802] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:51,902] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:52,003] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:52,104] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:52,204] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:52,305] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:52,407] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:52,508] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:52,608] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:52,709] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:52,810] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:52,910] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:53,011] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:53,061] [22/Thread-1] [INFO] [root] Scheduler.update_data called
[2019-05-17 08:18:53,062] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:53,062] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:53,062] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:53,076] [22/Thread-1] [INFO] [root] Scheduler.add_keys called
[2019-05-17 08:18:53,076] [22/Thread-1] [INFO] [root] Scheduler.add_keys called
[2019-05-17 08:18:53,076] [22/Thread-1] [INFO] [root] Scheduler.add_keys called
[2019-05-17 08:18:53,089] [22/Thread-1] [INFO] [root] Scheduler.add_client called
distributed.scheduler - INFO - Receive client connection: Client-worker-6830848a-787c-11e9-8016-0a580a0c034b
distributed.core - INFO - Starting established connection
[2019-05-17 08:18:53,111] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:53,169] [22/Thread-1] [INFO] [root] Scheduler.handle_task_finished called
[2019-05-17 08:18:53,169] [22/Thread-1] [INFO] [root] Scheduler._remove_from_processing called
[2019-05-17 08:18:53,169] [22/Thread-1] [INFO] [root] Scheduler.check_idle_saturated called
[2019-05-17 08:18:53,170] [22/Thread-1] [INFO] [root] Scheduler.release_resources called
[2019-05-17 08:18:53,170] [22/Thread-1] [INFO] [root] Scheduler._add_to_memory called
[2019-05-17 08:18:53,170] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:53,170] [22/Thread-1] [INFO] [root] Scheduler.transitions called
[2019-05-17 08:18:53,171] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:53,171] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:53,171] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:53,171] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:53,171] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:53,171] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:53,171] [22/Thread-1] [INFO] [root] Scheduler.worker_send called
[2019-05-17 08:18:53,171] [22/Thread-1] [INFO] [root] Scheduler.report called
[2019-05-17 08:18:53,171] [22/Thread-1] [INFO] [root] Scheduler.report_on_key called
[2019-05-17 08:18:53,171] [22/Thread-1] [INFO] [root] Scheduler.report called
# continuing indefinitely
@jjerphan
Copy link
Contributor Author

jjerphan commented May 17, 2019

I've conducted some more experiments with varying number of workers (ranging from 1 to 10) and the results are the same (the problem is always there). The number of tasks executed is the same, so the problem may be more to be related to joblib and not Dask, IMO.

In one of those experiments, several batches (generally 6 batches, see what follows) are dispatched, but only one BatchCompletionCallBack is executed.

This can be seen by counting the logs : everything seems to be executed accordingly to the batches processed (see the logs that are done 6 times vs the logs that are only done once).

Logs count ```
   1 BatchCompletionCallBack.__call__ called
   1 Client.__init__ called
   1 Client._start called
   1 Client.start called
   1 Creating Dask client
   1 Creating joblib.Parallel(n_jobs=4, verbose=100, pre_dispatch=n_jobs, backend=dask)
   1 DaskDistributedBackend.configure called
   1 DaskDistributedBackend.effective_n_jobs called
   1 DaskDistributedBackend.start_call called
   1 Early stopping enabled
   1 Entering Dask context
   1 Parallel.__call__ called
   1 Parallel.__call__ dispatched initial batch
   1 Parallel.__call__: iterator created
   1 Parallel.__init__ called
   1 Parallel._dispatch: job of id 0 inserted to back-end
   1 Parallel._dispatch: job of id 1 inserted to back-end
   1 Parallel._dispatch: job of id 2 inserted to back-end
   1 Parallel._dispatch: job of id 3 inserted to back-end
   1 Parallel._dispatch: job of id 4 inserted to back-end
   1 Parallel._dispatch: job of id 5 inserted to back-end
   1 Parallel._dispatch: submitting job of id 0 to back-end of size 1
   1 Parallel._dispatch: submitting job of id 1 to back-end of size 1
   1 Parallel._dispatch: submitting job of id 2 to back-end of size 1
   1 Parallel._dispatch: submitting job of id 3 to back-end of size 1
   1 Parallel._dispatch: submitting job of id 4 to back-end of size 1
   1 Parallel._dispatch: submitting job of id 5 to back-end of size 1
   1 Parallel._dispatch: submitting job of id 6 to back-end of size 1
   1 Parallel._initialize_backend called
   1 Parallel.print_progress called
   1 Scheduler.__init__ called
   1 Scheduler._setup_logging called
   1 Scheduler.add_worker called
   1 Scheduler.finished called
   1 Scheduler.handle_worker called
   1 Worker.get_client called
   1 [Parallel(n_jobs=4)]: Done   1 tasks      | elapsed:    2.3s
   1 [Parallel(n_jobs=4)]: Using backend DaskDistributedBackend with 8 concurrent workers.
   1 register_parallel_backend called
   2 Parallel._print called
   2 Scheduler.add_client called
   5 Parallel.__call__ dispatched one batch
   6 Appending tasks done
   6 Batch.__init__ called
   6 Batch.__reduce__ called
   6 Client._graph_to_futures called
   6 Client.submit called
   6 DaskDistributedBackend.apply_async: Added future to tasks
   6 DaskDistributedBackend.apply_async: Done submitting to the Client
   6 DaskDistributedBackend.apply_async: Executing callback_wrapper
   6 DaskDistributedBackend.apply_async: Returning future
   6 DaskDistributedBackend.apply_async: Submitting to Client
   6 Parallel.dispatch_one_batch: leaving the lock after self._dispatch
   6 Scheduler._add_to_memory called
   6 Scheduler._remove_from_processing called
   6 Scheduler.consumre_resources called
   6 Scheduler.handle_task_finished called
   6 Scheduler.release_resources called
   6 Scheduler.send_task_to_worker called
   6 Scheduler.update_graph called
   6 Scheduler.valid_workers called
   7 BatchCompletionCallBack.__init__ called
   7 BatchedCalls.__init__ : wraps 1 objects
   7 BatchedCalls.__init__ called
   7 DaskDistributedBackend._to_func_args called
   7 DaskDistributedBackend.apply_async called
   7 DaskDistributedBackend.get_nested_backend called
   7 Looping on func.items
   7 Parallel._dispatch called
   7 Parallel.dispatch: calling self._dispatch
   7 Parallel.dispatch_one_batch called
   7 Parallel.dispatch_one_batch: took the lock
   7 delayed called
   7 in '_dask._to_func_args'
   8 DaskDistributedBackend.__init__ called
  10 Scheduler.worker_send called
  12 Scheduler.decide_worker called
  16 Scheduler.scatter called
  16 Scheduler.update_data called
  16 [f] = self.client.scatter([arg])
  16 maybe to future called
  17 Worker.get_worker called
  24 Scheduler.transitions called
  33 Scheduler.check_idle_saturated called
  44 Scheduler.report_on_key called
  70 Scheduler.report called

Here, a 7th batch is to be submitted but is dispatch is not entirely done.
This can be seen with the number of times the lock is left by Parallel.dispatch_one_batch vs the number of times it is called:

# Sorted by execution order
7 Parallel.dispatch_one_batch called
7 Parallel.dispatch_one_batch: took the lock
7 Parallel.dispatch: calling self._dispatch
6 Parallel.dispatch_one_batch: leaving the lock after self._dispatch

My guess is that a race condition is present on Parallel.lock: both BatchCompletionCallBack.__call__ call to Parallel._dispatch (via Parallel.dispatch_next) and Parallel.dispatch_one_batch call to Parallel._dispatch are racing for it in different threads.

I'll run another experiment with even more logs on the lock.

@jjerphan
Copy link
Contributor Author

I've created a repository for a minimal reproducible example: jjerphan/joblib_dask_deadlock.

I am trying to tweak things to have the same behavior.

@mrocklin
Copy link
Contributor

FYI @TomAugspurger

@ogrisel
Copy link
Contributor

ogrisel commented May 22, 2019

Thanks very much @jjerphan for investigating. I had meant to have a look into this issue for quite some time but I have been busy with talks, reviews and the scikit-learn release. I hope I will find the time to review this soon but unfortunately probably not this week nor the next either...

@jjerphan
Copy link
Contributor Author

jjerphan commented May 22, 2019

No problem @ogrisel. :)


Here are some other logs that focus on the lock, especially on the race condition between the dispatch of jobs in DaskDistributedBackend and BatchCompletionCallBack.

Briefly (see the tail), the DaskDistributedBackEnd:

  • takes the Rlock (Parallel._lock)
  • scatters the data via the Client
  • dispatches tasks ie (f, args, kwargs)

The BatchCompletionCallBack comes in the middle and tries to take the lock. From there, the Scheduler seems to hang.

@jjerphan
Copy link
Contributor Author

jjerphan commented May 23, 2019

OK, so I am done with the reproducible setup.

If you have a kubernetes cluster, you can observe the behavior this with a helm chart and some scripts I provide.

I have new problems now that are a bit different from the first ones I had (see those logs).

I'll inspect this tomorrow.

@jjerphan
Copy link
Contributor Author

jjerphan commented May 24, 2019

I am trying to understand the mecanism and role of the RLock in Parallel.

This lock is taken twice in Parallel's thread: one time in dispatch_one_batch
and one time in dispatch.

I now nothing about using locks for concurrency. What is the advantages/goal of this mecanism? What is the typical design decision behind this?

@ogrisel: do you have some clues about this ?
Thank you! :)

@jjerphan
Copy link
Contributor Author

I now have a setup that hangs on Client._update_scheduler_info after the submission of tasks. I hardly can diagnose what the problem is. 🤔

I have put the logs here. I am listing problems and logs I have in my setup as they arise.

@ogrisel
Copy link
Contributor

ogrisel commented May 29, 2019

The dispatch functions can be called by the main thread in Parallel.__call__ but also by the async callback threads when tasks complete. Therefore we need a lock to avoid concurrent access to the parallel backend state by both threads when the first tasks complete before the main thread has finished dispatching.

@ogrisel
Copy link
Contributor

ogrisel commented May 29, 2019

Any datastructure that manages the queue of tasks, pending results and collected results. It depends on each backend implementation.

@jjerphan
Copy link
Contributor Author

I was thinking of something similar to this, but I hardly see how can an additional lock help.
What is the "parallel backend state" you are talking about exactly?

@jjerphan
Copy link
Contributor Author

@jjerphan
Copy link
Contributor Author

I am trying to understand the logic of _to_func_argsand maybe_to_futures and the side effects. For what I understand, args and kwargs are computed via future if needed before being returned in a Batch that is to be executed on Workers, right?

@jjerphan jjerphan changed the title Possible deadlock explanation when using Dask with Parallel Parallel hangs when using Dask as a back-end Jun 4, 2019
@jjerphan
Copy link
Contributor Author

jjerphan commented Jun 5, 2019

I've tried to use dask_ml.GridSearchCV instead of the sklearn.model_selection.GridSearchCV and I do not have this problem anymore.

It seems to be a subtle problem of interface between joblib and distributed. I have some ideas but I don't want to be speculative.

For now, I'll stop to work on this issue, but if anyone is interested in investigating it, please, feel free to contact me. :)

@jjerphan
Copy link
Contributor Author

Fixed by #914.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants