[MRG+1] Refactored Parallel backends#306
[MRG+1] Refactored Parallel backends#306NielsZeilemaker wants to merge 6 commits intojoblib:masterfrom
Conversation
|
retest this please |
|
A few early comments:
Thanks a lot for this contribution, it will be useful! |
|
It also seems that the test results on appveyor are not stable, see https://ci.appveyor.com/project/joblib-ci/joblib/build/1.0.445/job/7li1lxc290hb640x Maybe @ogrisel has an idea. |
There was a problem hiding this comment.
Sorry, I was not clear. I thought of adding another test function for the new TestParallelBackend class and in the initial test_overwrite_default_backend add an assert to verify the 'default' backend has correctly been updated. Something like that:
def test_overwrite_default_backend():
register_parallel_backend("default", VALID_BACKENDS["multiprocessing"])
assert_equal(VALID_BACKENDS["default"], VALID_BACKENDS["multiprocessing"])
def test_register_parallel_backend():
register_parallel_backend("unit-testing", TestParallelBackend)
assert_true("unit-testing" in VALID_BACKENDS)
assert_equal(VALID_BACKENDS["unit-testing"], TestParallelBackend)
Maybe this is a bit overkill though.|
@GaelVaroquaux I'm not sure what you mean by having a context manager. Could you give an example? |
|
@GaelVaroquaux I'm not sure what you mean by having a context manager. Could
you give an example?
Furthermore, I'll rebase and squash all commits after the pull request
is approved by you guys.
Sounds good
|
|
@GaelVaroquaux I like the idea of a context manager. However, I feel it would be an extension to this pull request, and therefore a seperate one. The contextmanager should then look something like: |
There was a problem hiding this comment.
It would be good to have on liner docstrings on all these methods: it would help other developers understand what the philosophy/purpose is behind them.
I forgot to comment on this: I did not get the point of being able to set the default backend in a local context but maybe I am missing something. Can you not pass explicitly backend=whatever when you create your Parallel object? |
|
I forgot to comment on this: I did not get the point of being able to set the
default backend in a local context but maybe I am missing something. Can you
not pass explicitly backend=whatever when you create your Parallel object?
When you are using a blackbox algorithm / object on which you have no
control. Eg a scikit-learn GridSearchCV.
|
There was a problem hiding this comment.
Shouldn't this error be raised in all backends?
OK fair point. |
|
Still, those implementations would use the "default" backend. So overwriting the default backend in that case would suffice. |
The local default backend switch should be protected using @contextmanager
def parallel_backend(cls):
old_backend = VALID_BACKENDS['default']
try:
register_parallel_backend('default', cls)
yield
finally:
register_parallel_backend('default', old_backend)I'm not super fund of using the same function to override the default backend and register new backends. Maybe the default backend should be set using an explicit function |
|
I actually implemented a public effective_n_jobs method upon the request of @GaelVaroquaux. I think it should just be used as a best effort sanity check. So without the need of initializing the backend. I agree that exposing the effective_n_jobs might not be needed as the backend can autobatch smaller jobs to reduce the overhead. |
This feels too hackish / specific to me. |
|
FYI I am working on the default backend / context manager refactoring. |
|
I think that the programmer designing the parallel job can determine if a job must run on the threading backend or if it would be nice to run on the threading backend. So why not give him this possibility? |
|
This is what the |
|
That's what I meant, let him pass backend="prefer_threading" or backend="threading" to distinguish between should use threading and must use threading. |
|
I implemented the context manager for switching the default backend in NielsZeilemaker#2. I will now give a deeper look at @mrocklin's prototype and in particular the the |
Context manager to change the default backend
|
@ogrisel thanks for the pull request, i've merged it. My backend is available at https://github.com/NielsZeilemaker/yarnpool/blob/master/yarnbackend.py btw. |
|
@NielsZeilemaker I issued a new PR to further simplify the API and make it more flexible (by registering any callable as a factory). I will do more experiments to adapt @mrocklin PoC backend for distributed with this new API. Your own backend will need to be adapted to remove the parallel instance from the constructor and implement the configure function with the new arguments. |
More simplification refactoring
|
Merged as #320. Thanks again @NielsZeilemaker! |
|
This is really great! I added a prototype for IPython parallel over at ipython/ipyparallel#122, and it seems really easy, so kudos! The one thing I'm not familiar enough with to really decide on is tuning the batching. The AutoBatchingMixin is slick, but I'm not sure how to get the most out of IPython Parallel with these APIs. To use IPython Parallel really efficiently, many more jobs than engines should be queued, so they can be making their way across the network while the engines are working on early tasks. Perhaps I should return a large number from Another question is that IPython has its own mechanisms for batching more efficiently than submitting one chunk at a time, especially with the DirectView API. It seems like to leverage that, I would need an |
|
Merged as #320. Thanks again @NielsZeilemaker!
Yey. This is great. This is really, really excellent.
|
|
Great news indeed. I'm already went ahead and started to improve dask/knit to support dynamic allocation of workers on YARN, see dask/knit#51. @minrk I think the AutoBatchingMixin will mostly help to reduce the overhead of scheduling many small jobs. So if you have jobs which take more than 2 seconds to complete, it shouldn't do anything. In order to define your own batches, you can use the BatchedCalls object. And I guess we can modify the dispatch_one_batch method to not add another BatchedCalls object if the batch_size is equal to 1. |
|
+1, I think the scheduling overhead of ipython parallel is on the same order of magnitude as the one for multiprocessing (on a local network), so I would advise to re-use the same magic constants. |
refactored into separate classes.
Some context, I am planning to create a YARN backend to allow you to run a python proces in containers spawned by YARN. It's going to be a separate plugin, to not have all hdfs/yarn dependencies in this project. My endgoal is to be able to run scikit-learn on YARN.
scikit-learn/scikit-learn#6223