-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Improve load-balancing between workers for large batch sizes. #899
Conversation
The benefit is particularly marked for a large number of jobs. Before merging this PR, the functionality should be mentioned in one or two places in the docs / README / website, so that people know it is there. It's important so that they realize the benefit, and so that they understand the behavior of the library (two purposes, hence probably two documentation entries). |
For anyone interested, I did more thorough benchmarks + explained a lot of things in this notebook. |
Also, when running |
Codecov Report
@@ Coverage Diff @@
## master #899 +/- ##
==========================================
- Coverage 95.43% 95.18% -0.25%
==========================================
Files 45 45
Lines 6459 6484 +25
==========================================
+ Hits 6164 6172 +8
- Misses 295 312 +17
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #899 +/- ##
==========================================
- Coverage 95.42% 95.09% -0.34%
==========================================
Files 45 45
Lines 6497 6521 +24
==========================================
+ Hits 6200 6201 +1
- Misses 297 320 +23
Continue to review full report at Codecov.
|
db14478
to
fb7cc86
Compare
Rebased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments/questions:
Look-ahead in the tasks iterator to make sure batch size over-estimation will not lead to unbalanced batches, eventually creating strangling and harming speedups.
fb7cc86
to
9cb5f40
Compare
Rebased + addressed the review comments. I can re-run the benchmarks with the default |
That would be great. Thanks! |
There is an unprotected import queue that fails under Python 2: Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-dtWIPy/setup.py", line 6, in <module>
import joblib
File "joblib/__init__.py", line 119, in <module>
from .parallel import Parallel
File "joblib/parallel.py", line 19, in <module>
import queue
ImportError: No module named queue |
We can ignore the PEP8 failure. I think this is already the new way to deal with line break and binary operators. If you have the change could you please re-run the benchmarks to make sure that pre-dispatch did not cause any perf regression? |
Here is the output of
There is only one increase (the last one), but I saw it consistently on different machines (a machine from the INRIA center, my personal MacBookPro...) |
thanks! I think this is fine, though. Merge? |
Please add an entry to the changelog and let's merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to say I like this solution a lot! thanks @pierreglaser for all the benchmarking work + implementations!
Some nitpicks
my bad, gross mistake.
Was it not caught by a test? If so, maybe a new test?
|
Yes, I'm going to investigate why the test passed. It should have broken everything. |
Merging as the Thanks a lot @pierreglaser, I think this is a big improvement (at the very least for my work-flow! 😉 ) |
Release 0.14.0 Improved the load balancing between workers to avoid stranglers caused by an excessively large batch size when the task duration is varying significantly (because of the combined use of joblib.Parallel and joblib.Memory with a partially warmed cache for instance). joblib/joblib#899 Add official support for Python 3.8: fixed protocol number in Hasher and updated tests. Fix a deadlock when using the dask backend (when scattering large numpy arrays). joblib/joblib#914 Warn users that they should never use joblib.load with files from untrusted sources. Fix security related API change introduced in numpy 1.6.3 that would prevent using joblib with recent numpy versions. joblib/joblib#879 Upgrade to cloudpickle 1.1.1 that add supports for the upcoming Python 3.8 release among other things. joblib/joblib#878 Fix semaphore availability checker to avoid spawning resource trackers on module import. joblib/joblib#893 Fix the oversubscription protection to only protect against nested Parallel calls. This allows joblib to be run in background threads. joblib/joblib#934 Fix ValueError (negative dimensions) when pickling large numpy arrays on Windows. joblib/joblib#920 Upgrade to loky 2.6.0 that add supports for the setting environment variables in child before loading any module. joblib/joblib#940 Fix the oversubscription protection for native libraries using threadpools (OpenBLAS, MKL, Blis and OpenMP runtimes). The maximal number of threads is can now be set in children using the inner_max_num_threads in parallel_backend. It defaults to cpu_count() // n_jobs.
This PR tries to improve load-balancing between workers in joblib, mostly in two ways, creating balanced batches both in terms of number of tasks, and in terms of total batch running time.
Ensuring a balanced number of tasks per batches
Previously, the tasks iterator consumed by joblib was sliced
batch_size
tasks at a time. This can lead to unbalanced batches when we reach the end of the iterator.I propose to slice the tasks iterator
batch_size * n_jobs
tasks at a time instead. The resultingn_jobs
batches are not dispatched immediately, but stored in a local queue that the further callback-triggereddispatch_one_batch
calls will try to access before re-slicing the iterator. If the queue is empty, then the batch-size is re-computed, the iterator is re-sliced, and the queue is re-populated.Reducing running-time variance between batches
The higher the running-time variance between batches, the more we risk to create stranglers, that will decrease joblib speedup compared to the serial case. Reducing the variance can be done by reducing the batch size. Thus, I propose to be more conservative when increasing the batch size.
This plot summarizes the speedups for a set of benchmarks defined in the joblib_benchmarks repository: Each point (x, y) is a benchmark result.
x = total running time using joblib master
y = total running time using this PR
Any point above the y=x line is an performance regression, any point below the y=x line is a performance improvement.