Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dask scatter deadlock #914

Merged
merged 7 commits into from Sep 10, 2019
Merged

Conversation

@pierreglaser
Copy link
Contributor

pierreglaser commented Jul 19, 2019

Builds on top of #910
Fixes #852

(I'm still not 100% sure of whats going on here). This PR ensures the synchronous execution of distributed scatter operations by not running joblib's callbacks directly into distributed client event loop. In this situation, the operations can safely be made blocking.

I tried running the sklearn example in #852 and it runs fine. Hopefully the results are correct also.
@ogrisel

@codecov

This comment has been minimized.

Copy link

codecov bot commented Jul 19, 2019

Codecov Report

Merging #914 into master will decrease coverage by 0.74%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #914      +/-   ##
==========================================
- Coverage    95.4%   94.66%   -0.75%     
==========================================
  Files          45       45              
  Lines        6460     6462       +2     
==========================================
- Hits         6163     6117      -46     
- Misses        297      345      +48
Impacted Files Coverage Δ
joblib/test/test_dask.py 96.55% <100%> (-1.96%) ⬇️
joblib/_dask.py 95.48% <100%> (ø) ⬆️
joblib/backports.py 39.58% <0%> (-54.17%) ⬇️
joblib/test/test_store_backends.py 91.17% <0%> (-5.89%) ⬇️
joblib/pool.py 89.65% <0%> (-1.73%) ⬇️
joblib/test/common.py 86.44% <0%> (-1.7%) ⬇️
joblib/disk.py 80% <0%> (-1.67%) ⬇️
joblib/logger.py 85.52% <0%> (-1.32%) ⬇️
joblib/func_inspect.py 94.31% <0%> (-1.14%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd585ce...f0e6879. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Jul 19, 2019

Codecov Report

Merging #914 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #914      +/-   ##
==========================================
- Coverage   95.42%   95.41%   -0.01%     
==========================================
  Files          45       45              
  Lines        6494     6462      -32     
==========================================
- Hits         6197     6166      -31     
+ Misses        297      296       -1
Impacted Files Coverage Δ
joblib/test/test_dask.py 98.52% <100%> (+0.01%) ⬆️
joblib/_dask.py 95.48% <100%> (ø) ⬆️
joblib/memory.py 95.69% <0%> (-0.54%) ⬇️
joblib/test/test_parallel.py 96.96% <0%> (-0.09%) ⬇️
joblib/parallel.py 97.28% <0%> (-0.06%) ⬇️
joblib/hashing.py 93.16% <0%> (ø) ⬆️
joblib/_parallel_backends.py 95.6% <0%> (+0.73%) ⬆️
joblib/test/test_module.py 100% <0%> (+5.55%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb7a949...bc6c2e1. Read the comment docs.

@pierreglaser pierreglaser changed the title [WIP] Fix dask scatter deadlock Fix dask scatter deadlock Jul 19, 2019
@pierreglaser

This comment has been minimized.

Copy link
Contributor Author

pierreglaser commented Jul 19, 2019

Also, @mrocklin feel free to chime and share your thoughts :)

@julioasotodv

This comment has been minimized.

Copy link

julioasotodv commented Jul 22, 2019

@pierreglaser I just tried this patch with the following code:

import numpy as np

from dask.distributed import Client, LocalCluster
from joblib import Parallel, delayed, parallel_backend

def sum_values(array, scalar):
    return (array + scalar).sum()

cluster = LocalCluster()
client = Client(cluster)

with parallel_backend("dask"):
    results = Parallel()(delayed(sum_values)(np.zeros(i), 5) for i in range(50000, 50010))

print(results)

And now It does not get hung and finishes as expected :) 🎉🎉 It used to get hung forever until I Ctrl-C.

If anyone can double-check this, it would be really helpful

@pierreglaser

This comment has been minimized.

Copy link
Contributor Author

pierreglaser commented Jul 22, 2019

Thanks for the feedback -- I'm looking at it.

@julioasotodv

This comment has been minimized.

Copy link

julioasotodv commented Jul 22, 2019

@pierreglaser sorry, edited my previous comment. It DOES work now :) (was testing the wrong branch)

@pierreglaser

This comment has been minimized.

Copy link
Contributor Author

pierreglaser commented Jul 22, 2019

Great!

@jjerphan

This comment has been minimized.

Copy link

jjerphan commented Jul 29, 2019

It also work for me with the snippet provided by @julioasotodv. 🎉🎉 (cc @samronsin).

@jjerphan

This comment has been minimized.

Copy link

jjerphan commented Jul 29, 2019

I've also tried the snippet provided by @ogrisel on #852 (that originally comes from distributed#2532) and it also works now.

Python packages used
backcall==0.1.0
bokeh==0.13.0
Click==7.0
cloudpickle==1.2.1
dask==1.2.2
decorator==4.4.0
-e git+https://github.com/jjerphan/distributed.git@6ea010bcf21db7445bd26286966f59a7e75ab390#egg=distributed
HeapDict==1.0.0
ipython==7.7.0
ipython-genutils==0.2.0
jedi==0.14.1
Jinja2==2.10.1
-e git+https://github.com/pierreglaser/joblib.git@f0e687901adf309dc7e7f47af5a8d901383e67eb#egg=joblib
MarkupSafe==1.1.1
msgpack==0.6.1
numpy==1.15.4
packaging==19.0
pandas==0.23.4
parso==0.5.1
pexpect==4.7.0
pickleshare==0.7.5
prompt-toolkit==2.0.9
psutil==5.6.3
ptyprocess==0.6.0
Pygments==2.4.2
pyparsing==2.4.1.1
python-dateutil==2.8.0
pytz==2019.1
PyYAML==5.1.1
scikit-learn==0.20.3
scipy==1.1.0
six==1.12.0
sortedcontainers==2.1.0
tblib==1.4.0
toolz==0.10.0
tornado==5.1.1
traitlets==4.3.2
wcwidth==0.1.7
zict==1.0.0
@jlopezpena

This comment has been minimized.

Copy link
Contributor

jlopezpena commented Aug 16, 2019

I have tested this PR and verified it also fixes the problem I reported in dask/dask#2665

@jakirkham

This comment has been minimized.

Copy link

jakirkham commented Aug 20, 2019

@pierreglaser , is this PR still waiting on something or is it ready to merge?

@pierreglaser

This comment has been minimized.

Copy link
Contributor Author

pierreglaser commented Aug 20, 2019

It's missing a review by a core dev. @ogrisel should be back from vacations in a week or so.

@ogrisel

This comment has been minimized.

Copy link
Contributor

ogrisel commented Sep 10, 2019

Ok I was trying to see if you could do a full async / no-blocking version but it's sounds too cumbersome and would probably require a significant refactoring of joblib backends. I will push a what's new entry to re-trigger CI and then merge this PR.

Copy link
Contributor

ogrisel left a comment

LGTM, merging.

@ogrisel ogrisel merged commit 591a8e6 into joblib:master Sep 10, 2019
3 checks passed
3 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jakirkham

This comment has been minimized.

Copy link

jakirkham commented Sep 10, 2019

Thanks @ogrisel and @pierreglaser! 😀

@jjerphan

This comment has been minimized.

Copy link

jjerphan commented Sep 10, 2019

Thanks @ogrisel and @pierreglaser! 😀 — bis

@mrocklin

This comment has been minimized.

Copy link
Contributor

mrocklin commented Sep 10, 2019

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Oct 31, 2019
Release 0.14.0
Improved the load balancing between workers to avoid stranglers caused by an excessively large batch size when the task duration is varying significantly (because of the combined use of joblib.Parallel and joblib.Memory with a partially warmed cache for instance). joblib/joblib#899
Add official support for Python 3.8: fixed protocol number in Hasher and updated tests.
Fix a deadlock when using the dask backend (when scattering large numpy arrays). joblib/joblib#914
Warn users that they should never use joblib.load with files from untrusted sources. Fix security related API change introduced in numpy 1.6.3 that would prevent using joblib with recent numpy versions. joblib/joblib#879
Upgrade to cloudpickle 1.1.1 that add supports for the upcoming Python 3.8 release among other things. joblib/joblib#878
Fix semaphore availability checker to avoid spawning resource trackers on module import. joblib/joblib#893
Fix the oversubscription protection to only protect against nested Parallel calls. This allows joblib to be run in background threads. joblib/joblib#934
Fix ValueError (negative dimensions) when pickling large numpy arrays on Windows. joblib/joblib#920
Upgrade to loky 2.6.0 that add supports for the setting environment variables in child before loading any module. joblib/joblib#940
Fix the oversubscription protection for native libraries using threadpools (OpenBLAS, MKL, Blis and OpenMP runtimes). The maximal number of threads is can now be set in children using the inner_max_num_threads in parallel_backend. It defaults to cpu_count() // n_jobs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

7 participants
You can’t perform that action at this time.