Skip to content

Fix dask scatter deadlock#914

Merged
ogrisel merged 7 commits intojoblib:masterfrom
pierreglaser:fix-dask-scatter-deadlock
Sep 10, 2019
Merged

Fix dask scatter deadlock#914
ogrisel merged 7 commits intojoblib:masterfrom
pierreglaser:fix-dask-scatter-deadlock

Conversation

@pierreglaser
Copy link
Contributor

Builds on top of #910
Fixes #852

(I'm still not 100% sure of whats going on here). This PR ensures the synchronous execution of distributed scatter operations by not running joblib's callbacks directly into distributed client event loop. In this situation, the operations can safely be made blocking.

I tried running the sklearn example in #852 and it runs fine. Hopefully the results are correct also.
@ogrisel

@codecov
Copy link

codecov bot commented Jul 19, 2019

Codecov Report

Merging #914 into master will decrease coverage by 0.74%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #914      +/-   ##
==========================================
- Coverage    95.4%   94.66%   -0.75%     
==========================================
  Files          45       45              
  Lines        6460     6462       +2     
==========================================
- Hits         6163     6117      -46     
- Misses        297      345      +48
Impacted Files Coverage Δ
joblib/test/test_dask.py 96.55% <100%> (-1.96%) ⬇️
joblib/_dask.py 95.48% <100%> (ø) ⬆️
joblib/backports.py 39.58% <0%> (-54.17%) ⬇️
joblib/test/test_store_backends.py 91.17% <0%> (-5.89%) ⬇️
joblib/pool.py 89.65% <0%> (-1.73%) ⬇️
joblib/test/common.py 86.44% <0%> (-1.7%) ⬇️
joblib/disk.py 80% <0%> (-1.67%) ⬇️
joblib/logger.py 85.52% <0%> (-1.32%) ⬇️
joblib/func_inspect.py 94.31% <0%> (-1.14%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd585ce...f0e6879. Read the comment docs.

@codecov
Copy link

codecov bot commented Jul 19, 2019

Codecov Report

Merging #914 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #914      +/-   ##
==========================================
- Coverage   95.42%   95.41%   -0.01%     
==========================================
  Files          45       45              
  Lines        6494     6462      -32     
==========================================
- Hits         6197     6166      -31     
+ Misses        297      296       -1
Impacted Files Coverage Δ
joblib/test/test_dask.py 98.52% <100%> (+0.01%) ⬆️
joblib/_dask.py 95.48% <100%> (ø) ⬆️
joblib/memory.py 95.69% <0%> (-0.54%) ⬇️
joblib/test/test_parallel.py 96.96% <0%> (-0.09%) ⬇️
joblib/parallel.py 97.28% <0%> (-0.06%) ⬇️
joblib/hashing.py 93.16% <0%> (ø) ⬆️
joblib/_parallel_backends.py 95.6% <0%> (+0.73%) ⬆️
joblib/test/test_module.py 100% <0%> (+5.55%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb7a949...bc6c2e1. Read the comment docs.

@pierreglaser pierreglaser changed the title [WIP] Fix dask scatter deadlock Fix dask scatter deadlock Jul 19, 2019
@pierreglaser
Copy link
Contributor Author

Also, @mrocklin feel free to chime and share your thoughts :)

@julioasotodv
Copy link

julioasotodv commented Jul 22, 2019

@pierreglaser I just tried this patch with the following code:

import numpy as np

from dask.distributed import Client, LocalCluster
from joblib import Parallel, delayed, parallel_backend

def sum_values(array, scalar):
    return (array + scalar).sum()

cluster = LocalCluster()
client = Client(cluster)

with parallel_backend("dask"):
    results = Parallel()(delayed(sum_values)(np.zeros(i), 5) for i in range(50000, 50010))

print(results)

And now It does not get hung and finishes as expected :) 🎉🎉 It used to get hung forever until I Ctrl-C.

If anyone can double-check this, it would be really helpful

@pierreglaser
Copy link
Contributor Author

Thanks for the feedback -- I'm looking at it.

@julioasotodv
Copy link

julioasotodv commented Jul 22, 2019

@pierreglaser sorry, edited my previous comment. It DOES work now :) (was testing the wrong branch)

@pierreglaser
Copy link
Contributor Author

Great!

@jjerphan
Copy link
Contributor

jjerphan commented Jul 29, 2019

It also work for me with the snippet provided by @julioasotodv. 🎉🎉 (cc @samronsin).

@jjerphan
Copy link
Contributor

jjerphan commented Jul 29, 2019

I've also tried the snippet provided by @ogrisel on #852 (that originally comes from distributed#2532) and it also works now.

Python packages used
backcall==0.1.0
bokeh==0.13.0
Click==7.0
cloudpickle==1.2.1
dask==1.2.2
decorator==4.4.0
-e git+https://github.com/jjerphan/distributed.git@6ea010bcf21db7445bd26286966f59a7e75ab390#egg=distributed
HeapDict==1.0.0
ipython==7.7.0
ipython-genutils==0.2.0
jedi==0.14.1
Jinja2==2.10.1
-e git+https://github.com/pierreglaser/joblib.git@f0e687901adf309dc7e7f47af5a8d901383e67eb#egg=joblib
MarkupSafe==1.1.1
msgpack==0.6.1
numpy==1.15.4
packaging==19.0
pandas==0.23.4
parso==0.5.1
pexpect==4.7.0
pickleshare==0.7.5
prompt-toolkit==2.0.9
psutil==5.6.3
ptyprocess==0.6.0
Pygments==2.4.2
pyparsing==2.4.1.1
python-dateutil==2.8.0
pytz==2019.1
PyYAML==5.1.1
scikit-learn==0.20.3
scipy==1.1.0
six==1.12.0
sortedcontainers==2.1.0
tblib==1.4.0
toolz==0.10.0
tornado==5.1.1
traitlets==4.3.2
wcwidth==0.1.7
zict==1.0.0

@jlopezpena
Copy link
Contributor

I have tested this PR and verified it also fixes the problem I reported in dask/dask#2665

@jakirkham
Copy link

@pierreglaser , is this PR still waiting on something or is it ready to merge?

@pierreglaser
Copy link
Contributor Author

It's missing a review by a core dev. @ogrisel should be back from vacations in a week or so.

@ogrisel
Copy link
Contributor

ogrisel commented Sep 10, 2019

Ok I was trying to see if you could do a full async / no-blocking version but it's sounds too cumbersome and would probably require a significant refactoring of joblib backends. I will push a what's new entry to re-trigger CI and then merge this PR.

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, merging.

@ogrisel ogrisel merged commit 591a8e6 into joblib:master Sep 10, 2019
@jakirkham
Copy link

Thanks @ogrisel and @pierreglaser! 😀

@jjerphan
Copy link
Contributor

Thanks @ogrisel and @pierreglaser! 😀 — bis

@mrocklin
Copy link
Contributor

mrocklin commented Sep 10, 2019 via email

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Oct 31, 2019
Release 0.14.0
Improved the load balancing between workers to avoid stranglers caused by an excessively large batch size when the task duration is varying significantly (because of the combined use of joblib.Parallel and joblib.Memory with a partially warmed cache for instance). joblib/joblib#899
Add official support for Python 3.8: fixed protocol number in Hasher and updated tests.
Fix a deadlock when using the dask backend (when scattering large numpy arrays). joblib/joblib#914
Warn users that they should never use joblib.load with files from untrusted sources. Fix security related API change introduced in numpy 1.6.3 that would prevent using joblib with recent numpy versions. joblib/joblib#879
Upgrade to cloudpickle 1.1.1 that add supports for the upcoming Python 3.8 release among other things. joblib/joblib#878
Fix semaphore availability checker to avoid spawning resource trackers on module import. joblib/joblib#893
Fix the oversubscription protection to only protect against nested Parallel calls. This allows joblib to be run in background threads. joblib/joblib#934
Fix ValueError (negative dimensions) when pickling large numpy arrays on Windows. joblib/joblib#920
Upgrade to loky 2.6.0 that add supports for the setting environment variables in child before loading any module. joblib/joblib#940
Fix the oversubscription protection for native libraries using threadpools (OpenBLAS, MKL, Blis and OpenMP runtimes). The maximal number of threads is can now be set in children using the inner_max_num_threads in parallel_backend. It defaults to cpu_count() // n_jobs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KeyError with joblib and sklearn cross_validate

7 participants