Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Improve load-balancing between workers for large batch sizes. #899

Merged
merged 9 commits into from
Sep 24, 2019

Conversation

pierreglaser
Copy link
Contributor

This PR tries to improve load-balancing between workers in joblib, mostly in two ways, creating balanced batches both in terms of number of tasks, and in terms of total batch running time.

Ensuring a balanced number of tasks per batches

Previously, the tasks iterator consumed by joblib was sliced batch_size tasks at a time. This can lead to unbalanced batches when we reach the end of the iterator.

I propose to slice the tasks iterator batch_size * n_jobs tasks at a time instead. The resulting n_jobs batches are not dispatched immediately, but stored in a local queue that the further callback-triggered dispatch_one_batch calls will try to access before re-slicing the iterator. If the queue is empty, then the batch-size is re-computed, the iterator is re-sliced, and the queue is re-populated.

Reducing running-time variance between batches

The higher the running-time variance between batches, the more we risk to create stranglers, that will decrease joblib speedup compared to the serial case. Reducing the variance can be done by reducing the batch size. Thus, I propose to be more conservative when increasing the batch size.

This plot summarizes the speedups for a set of benchmarks defined in the joblib_benchmarks repository: Each point (x, y) is a benchmark result.
x = total running time using joblib master
y = total running time using this PR

Any point above the y=x line is an performance regression, any point below the y=x line is a performance improvement.

image

@GaelVaroquaux
Copy link
Member

The benefit is particularly marked for a large number of jobs.

Before merging this PR, the functionality should be mentioned in one or two places in the docs / README / website, so that people know it is there. It's important so that they realize the benefit, and so that they understand the behavior of the library (two purposes, hence probably two documentation entries).

@pierreglaser
Copy link
Contributor Author

For anyone interested, I did more thorough benchmarks + explained a lot of things in this notebook.

@pierreglaser
Copy link
Contributor Author

Also, when running Parallel on a set of long, but partially-cached tasks, we cannot protect ourselves from stranglers if we don't tell the Parallel objects which computed tasks where cached and which were not. I recall @ogrisel not willing to add this extra-bit of logic into joblib, is it still the case? If so, I think we should add a warning to the memory documentation where we recommend to explicitly set batch_size to 1 when running a set of partially cached tasks.

@codecov
Copy link

codecov bot commented Jul 2, 2019

Codecov Report

Merging #899 into master will decrease coverage by 0.24%.
The diff coverage is 86.2%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #899      +/-   ##
==========================================
- Coverage   95.43%   95.18%   -0.25%     
==========================================
  Files          45       45              
  Lines        6459     6484      +25     
==========================================
+ Hits         6164     6172       +8     
- Misses        295      312      +17
Impacted Files Coverage Δ
joblib/_parallel_backends.py 93.28% <60%> (-1.12%) ⬇️
joblib/parallel.py 96.6% <91.66%> (-0.69%) ⬇️
joblib/disk.py 81.66% <0%> (-6.67%) ⬇️
joblib/backports.py 87.5% <0%> (-6.25%) ⬇️
joblib/test/test_hashing.py 98.36% <0%> (-0.55%) ⬇️
joblib/test/test_parallel.py 96.55% <0%> (-0.42%) ⬇️
joblib/test/test_numpy_pickle.py 98.19% <0%> (-0.17%) ⬇️
joblib/logger.py 88.15% <0%> (+1.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b883dc2...db14478. Read the comment docs.

@codecov
Copy link

codecov bot commented Jul 2, 2019

Codecov Report

Merging #899 into master will decrease coverage by 0.33%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #899      +/-   ##
==========================================
- Coverage   95.42%   95.09%   -0.34%     
==========================================
  Files          45       45              
  Lines        6497     6521      +24     
==========================================
+ Hits         6200     6201       +1     
- Misses        297      320      +23
Impacted Files Coverage Δ
joblib/parallel.py 97.21% <100%> (-0.13%) ⬇️
joblib/_parallel_backends.py 94.92% <60%> (-2.32%) ⬇️
joblib/testing.py 87.5% <0%> (-7.5%) ⬇️
joblib/func_inspect.py 92.04% <0%> (-3.41%) ⬇️
joblib/test/test_parallel.py 95.98% <0%> (-1.07%) ⬇️
joblib/logger.py 88.15% <0%> (+1.31%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 470a637...c16b003. Read the comment docs.

@pierreglaser
Copy link
Contributor Author

Rebased.

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments/questions:

joblib/parallel.py Outdated Show resolved Hide resolved
joblib/parallel.py Outdated Show resolved Hide resolved
joblib/parallel.py Outdated Show resolved Hide resolved
Look-ahead in the tasks iterator to make sure batch size over-estimation
will not lead to unbalanced batches, eventually creating strangling and
harming speedups.
@pierreglaser
Copy link
Contributor Author

Rebased + addressed the review comments. I can re-run the benchmarks with the default pre_dispatch value to make sure the last commits did not affect performance.

@ogrisel
Copy link
Contributor

ogrisel commented Sep 17, 2019

I can re-run the benchmarks with the default pre_dispatch value to make sure the last commits did not affect performance.

That would be great. Thanks!

@ogrisel
Copy link
Contributor

ogrisel commented Sep 17, 2019

There is an unprotected import queue that fails under Python 2:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-dtWIPy/setup.py", line 6, in <module>
        import joblib
      File "joblib/__init__.py", line 119, in <module>
        from .parallel import Parallel
      File "joblib/parallel.py", line 19, in <module>
        import queue
    ImportError: No module named queue

@ogrisel
Copy link
Contributor

ogrisel commented Sep 18, 2019

We can ignore the PEP8 failure. I think this is already the new way to deal with line break and binary operators.

If you have the change could you please re-run the benchmarks to make sure that pre-dispatch did not cause any perf regression?

@pierreglaser
Copy link
Contributor Author

Here is the output of asv compare between before and after the pre_dispatch change:

          1.84±0m          1.84±0m     1.00  bench_auto_batching.AutoBatchingSuite.time_cyclic_trend(10000, 1) [drago/conda-py3.5-numpy]
          56.1±0s          56.1±0s     1.00  bench_auto_batching.AutoBatchingSuite.time_cyclic_trend(10000, 2) [drago/conda-py3.5-numpy]
-         31.6±0s          28.3±0s     0.89  bench_auto_batching.AutoBatchingSuite.time_cyclic_trend(10000, 4) [drago/conda-py3.5-numpy]
          42.2±0s          42.2±0s     1.00  bench_auto_batching.AutoBatchingSuite.time_high_variance_no_trend(10000, 1) [drago/conda-py3.5-numpy]
          23.0±0s          22.3±0s     0.97  bench_auto_batching.AutoBatchingSuite.time_high_variance_no_trend(10000, 2) [drago/conda-py3.5-numpy]
          11.6±0s          11.3±0s     0.97  bench_auto_batching.AutoBatchingSuite.time_high_variance_no_trend(10000, 4) [drago/conda-py3.5-numpy]
          41.4±0s          41.4±0s     1.00  bench_auto_batching.AutoBatchingSuite.time_low_variance_no_trend(10000, 1) [drago/conda-py3.5-numpy]
          20.9±0s          20.5±0s     0.98  bench_auto_batching.AutoBatchingSuite.time_low_variance_no_trend(10000, 2) [drago/conda-py3.5-numpy]
          10.5±0s          10.3±0s     0.98  bench_auto_batching.AutoBatchingSuite.time_low_variance_no_trend(10000, 4) [drago/conda-py3.5-numpy]
          50.4±0s          50.4±0s     1.00  bench_auto_batching.AutoBatchingSuite.time_partially_cached(10000, 1) [drago/conda-py3.5-numpy]
          26.3±0s          26.4±0s     1.00  bench_auto_batching.AutoBatchingSuite.time_partially_cached(10000, 2) [drago/conda-py3.5-numpy]
+         14.2±0s          16.2±0s     1.14  bench_auto_batching.AutoBatchingSuite.time_partially_cached(10000, 4) [drago/conda-py3.5-numpy]

There is only one increase (the last one), but I saw it consistently on different machines (a machine from the INRIA center, my personal MacBookPro...)

@ogrisel
Copy link
Contributor

ogrisel commented Sep 18, 2019

thanks! I think this is fine, though. Merge?

@ogrisel
Copy link
Contributor

ogrisel commented Sep 18, 2019

Please add an entry to the changelog and let's merge.

CHANGES.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@tomMoral tomMoral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to say I like this solution a lot! thanks @pierreglaser for all the benchmarking work + implementations!

Some nitpicks

joblib/_parallel_backends.py Show resolved Hide resolved
joblib/parallel.py Show resolved Hide resolved
joblib/parallel.py Outdated Show resolved Hide resolved
joblib/parallel.py Outdated Show resolved Hide resolved
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 24, 2019 via email

@pierreglaser
Copy link
Contributor Author

Yes, I'm going to investigate why the test passed. It should have broken everything.

@tomMoral tomMoral changed the title Improve load-balancing between workers for large batch sizes. ENH Improve load-balancing between workers for large batch sizes. Sep 24, 2019
@tomMoral tomMoral merged commit dec1595 into joblib:master Sep 24, 2019
@tomMoral
Copy link
Contributor

Merging as the pep8 failure should be ignored (or fixed elsewhere after chosing a standard between this W503 and W504). The previous mistake was silent so it did not needed extra tests.

Thanks a lot @pierreglaser, I think this is a big improvement (at the very least for my work-flow! 😉 )

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Oct 31, 2019
Release 0.14.0
Improved the load balancing between workers to avoid stranglers caused by an excessively large batch size when the task duration is varying significantly (because of the combined use of joblib.Parallel and joblib.Memory with a partially warmed cache for instance). joblib/joblib#899
Add official support for Python 3.8: fixed protocol number in Hasher and updated tests.
Fix a deadlock when using the dask backend (when scattering large numpy arrays). joblib/joblib#914
Warn users that they should never use joblib.load with files from untrusted sources. Fix security related API change introduced in numpy 1.6.3 that would prevent using joblib with recent numpy versions. joblib/joblib#879
Upgrade to cloudpickle 1.1.1 that add supports for the upcoming Python 3.8 release among other things. joblib/joblib#878
Fix semaphore availability checker to avoid spawning resource trackers on module import. joblib/joblib#893
Fix the oversubscription protection to only protect against nested Parallel calls. This allows joblib to be run in background threads. joblib/joblib#934
Fix ValueError (negative dimensions) when pickling large numpy arrays on Windows. joblib/joblib#920
Upgrade to loky 2.6.0 that add supports for the setting environment variables in child before loading any module. joblib/joblib#940
Fix the oversubscription protection for native libraries using threadpools (OpenBLAS, MKL, Blis and OpenMP runtimes). The maximal number of threads is can now be set in children using the inner_max_num_threads in parallel_backend. It defaults to cpu_count() // n_jobs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants