Skip to content

Collect Dask results as they complete #1025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 11, 2020

Conversation

mrocklin
Copy link
Contributor

Previously we would wait until joblib called lazy_result.get() before
collecting results. This would trigger a transfer in the main thread,
which would block things a bit.

Now we collect results as soon as they are available using
dask.distributed.as_completed. This helps to reduce overhead in joblib
a bit and improve overall bandwidth.

Somewhat related to (but doesn't entirely fix) #1020 and dask/dask#5993

There are still performance issues. We have a lot of downtime in this process, but it's not happening in Dask (all of our profilers show that we're not spending a ton of time in Dask code). I suspect that something on the joblib end is blocking things, but profiling so far hasn't shown anything.

@mrocklin
Copy link
Contributor Author

This currently fails on test_auto_scatter

@codecov
Copy link

codecov bot commented Mar 21, 2020

Codecov Report

Merging #1025 into master will increase coverage by 0.01%.
The diff coverage is 97.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1025      +/-   ##
==========================================
+ Coverage   94.15%   94.17%   +0.01%     
==========================================
  Files          46       46              
  Lines        6522     6540      +18     
==========================================
+ Hits         6141     6159      +18     
  Misses        381      381              
Impacted Files Coverage Δ
joblib/_dask.py 94.84% <97.82%> (+0.52%) ⬆️
joblib/test/test_dask.py 98.03% <100.00%> (ø)
joblib/_parallel_backends.py 94.69% <0.00%> (-1.14%) ⬇️
joblib/parallel.py 96.33% <0.00%> (-0.57%) ⬇️
joblib/test/test_parallel.py 96.91% <0.00%> (+0.11%) ⬆️
joblib/disk.py 88.33% <0.00%> (+6.66%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd4b9a0...d2f2991. Read the comment docs.

@lesteve
Copy link
Member

lesteve commented Mar 23, 2020

I looked a bit at the failure on Python 3 (Python 2 failures can be ignored since there is a non-Python 2 compatible change in this PR) and I can reproduce it locally, it manifests itself as a "hang" (sorry for the vague term) i.e. the Parallel call never completes:

Parallel()(delayed(noop)(data, data, i, opt=data)
for i, data in enumerate(data_to_process))

What I observed is that changing the input data slightly, the "hang" goes away and the test pass. I just pushed a commit to see whether that was the case on the CI. I don't have any clue why this is the case at the moment.

@lesteve
Copy link
Member

lesteve commented Mar 23, 2020

What I observed is that changing the input data slightly, the "hang" goes away and the test pass. I just pushed a commit to see whether that was the case on the CI. I don't have any clue why this is the case at the moment.

So this quirk is confirmed on the CI:

Side-comment: the "hang" was fully deterministic locally i.e. it does not seem due to some race condition.

@mrocklin
Copy link
Contributor Author

Side-comment: the "hang" was fully deterministic locally i.e. it does not seem due to some race condition.

Right, I wonder if the imbalance of data before was intentional some way. Maybe the current implementation is not as robust to some bad user inputs.

@mrocklin
Copy link
Contributor Author

Python 2 failures can be ignored since there is a non-Python 2 compatible change in this PR

Is Joblib still supporting Python 2? If so, I can try to switch to older API.

@lesteve
Copy link
Member

lesteve commented Mar 23, 2020

Is Joblib still supporting Python 2? If so, I can try to switch to older API.

I have not been following joblib very closely, but there is an ongoing PR for dropping Python 2: #1018. I would not worry too much about Python 2 support.

@pierreglaser
Copy link
Contributor

pierreglaser commented Apr 3, 2020

Thank you very much for this fix @mrocklin, I can also confirm this PR fixes at least #957. There are a few big PRs to merge right now (#1018 , #966), but this definitely has to be merged ASAP (discussing it with @ogrisel right now).

Copy link
Contributor

@pierreglaser pierreglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR @mrocklin!

joblib/_dask.py Outdated
cf_future.set_exception(exc)
else:
cf_future.set_result(result)
self._callbacks.pop(future)(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I'm wrong, but the joblib callback contains a dask client.scatter call, which will be automatically made into a coroutine if running in the IOLoop thread (such as here, as dask would detect this call is done in an thread that should not be blocked). Would we want the callback to be ran in the IOLoop, we would need to make the whole joblib callback function execution chain async right?

Would it be possible/make sense to run such a _collect routine in a regular thread instead? This way the callback can remain blocking without blocking the IOLoop itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up. I hadn't thought of it before.

Currently I see scatter calls both in the constructor, and in the apply_async call. My guess is that both of these will be called from the user's main thread, and not from the IOLoop. If that's the case then I think that we should be safe. Dask generally runs the IOLoop in a separate thread.

You mention the callback function and I have to admit that I don't know what happens in that function. Does that call Joblib code which might then call apply_async? If so then yes, we might want to queue up the callbacks to run in the main thread (or some other thread)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fast feedback.

You mention the callback function and I have to admit that I don't know what happens in that function.

This callback is actually in charge of dispatching a new task from the iterator given as input to the Parallel object. Thus, this callback calls apply_async which itself triggers a scatter call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we're now always scattering from the event loop, so this should be resolved.

Oddly, we now sometimes scatter too frequently (maybe because we're now allowing some concurrency) and so there is some inefficiency here that was introduced (see the modified test). I tried avoiding this with an asyncio.Lock but that stopped things. I'm not yet sure why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, now the whole callback is done in the event loop. This callback runs scatter calls, which are made async (great!), thus non-blocking, but this callback runs also a lot of "not async" joblib code, and I'm afraid we affect performance by running so much joblib code in the event loop (eventhough I'm not an async expert).

I coded an alternative here pierreglaser@00158c8, where we should run much less joblib code inside the event loop. Not saying this is the way to go though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm, alternatively could we call just the callback in another thread? For debugging reasons it's useful to keep Dask code running the event loop. Things tend to be smoother generally

How much joblib work happens in the callback? What are these calls doing?

Copy link
Contributor

@pierreglaser pierreglaser Apr 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These calls are basically consuming new tasks from the original iterator sent as an input of Parallel.__call__. So there's a fair amount of joblib code that this callbacks runs, including code inside with lock context managers, client.scatter calls to scatter the input, and eventually client.submit calls. So there is a bunch of dask code, and a bunch of joblib code run by the callbacks.

Hrm, alternatively could we call just the callback in another thread?

Doesn't this contradict the fact that we want to keep dask code in the event loop since the callback contains dask (client.submit, client.scatter) code?

See here for the callbacks "head of call stack"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel do you have some thoughts on this?

Copy link
Contributor

@ogrisel ogrisel May 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for 90% of our users, the task generating iterator will be fast to consume, so this should not be a problem for them.

But the iterator could also be a lazy data loader from a database or a folder with large compressed files, in which case delegating to another thread is probably a good idea.

Maybe it would be worth experimenting with a synthetic, slow task itetor to see what this would mean for the current design of this PR?

This experiment could probably be turned into an integration test both for the task backend and the other thread or process based backends.

@pierreglaser
Copy link
Contributor

@mrocklin we merged #1018 into master, so you might want to rebase.

mrocklin and others added 5 commits April 25, 2020 09:13
Previously we would wait until joblib called lazy_result.get() before
collecting results.  This would trigger a transfer in the main thread,
which would block things a bit.

Now we collect results as soon as they are available using
dask.distributed.as_completed.  This helps to reduce overhead in joblib
a bit and improve overall bandwidth.
We recommend using dask.distributed imports

Also we remove some Python 2 compatibility bits
This removes some additional state and lets us clean up Dask futures
more quickly
@mrocklin mrocklin force-pushed the dask-as-completed branch from 24da328 to e58f8f4 Compare April 25, 2020 16:18
@mrocklin
Copy link
Contributor Author

Force pushed

@mrocklin
Copy link
Contributor Author

I'm not sure that I understand the test failures. Are they unrelated perhaps?

@pierreglaser
Copy link
Contributor

Probably. Feel free to push an empty commit to trigger the CI again.

@pierreglaser
Copy link
Contributor

FYI I benchmarked this PR and joblib master. against simple use-cases -- this PR generates great performance improvements! (y axis is shared)
dask-as-completed-vs-master

@mrocklin
Copy link
Contributor Author

Woo! That's really satisfying to see.

@mrocklin
Copy link
Contributor Author

Do we want to go ahead with this, or wait until we resolve the issue about getting the next element from the iterator in a separate thread.

As a warning, I'm pretty saturated this week and am unlikely to work more on this in the next few days. Hopefully future weeks are better, but it's hard to predcit.

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the slow input producer benchmark did not reveal any pathological issue I would in favor of postponing this change to the day we actually need it.

+1 for merge as is. Thank you very much @mrocklin and @pierreglaser for the benchmarks.

@ogrisel ogrisel merged commit 36c06c2 into joblib:master May 11, 2020
@ogrisel
Copy link
Contributor

ogrisel commented May 11, 2020

And thanks @lesteve as well!

@mrocklin mrocklin deleted the dask-as-completed branch May 11, 2020 16:54
@lesteve
Copy link
Member

lesteve commented May 13, 2020

And thanks @lesteve as well!

I don't think I did much on this one, but it is great to see this merged! IMO this is the kind of issue at the intersection of two libraries that is fixable if you happen two have one expert on each library working hand-in-hand but really tricky to fix otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants