Skip to content

[MRG+1] Refactored Parallel backends#306

Closed
NielsZeilemaker wants to merge 6 commits intojoblib:masterfrom
NielsZeilemaker:custom_backend
Closed

[MRG+1] Refactored Parallel backends#306
NielsZeilemaker wants to merge 6 commits intojoblib:masterfrom
NielsZeilemaker:custom_backend

Conversation

@NielsZeilemaker
Copy link
Copy Markdown
Contributor

  • The sequential, threadpool, and multiprocessing backends are now
    refactored into separate classes.
  • Fixed tests accordingly

Some context, I am planning to create a YARN backend to allow you to run a python proces in containers spawned by YARN. It's going to be a separate plugin, to not have all hdfs/yarn dependencies in this project. My endgoal is to be able to run scikit-learn on YARN.

scikit-learn/scikit-learn#6223

Comment thread joblib/parallel.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is too long

@NielsZeilemaker
Copy link
Copy Markdown
Contributor Author

retest this please

@GaelVaroquaux
Copy link
Copy Markdown
Member

A few early comments:

  • It would be good to have a context manager to set the backend in a
    local context
  • The file _backends.py should probably be called _parallel_backends.py

Thanks a lot for this contribution, it will be useful!

@aabadie
Copy link
Copy Markdown
Contributor

aabadie commented Jan 26, 2016

It also seems that the test results on appveyor are not stable, see https://ci.appveyor.com/project/joblib-ci/joblib/build/1.0.445/job/7li1lxc290hb640x

Maybe @ogrisel has an idea.

Comment thread joblib/test/test_parallel.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was not clear. I thought of adding another test function for the new TestParallelBackend class and in the initial test_overwrite_default_backend add an assert to verify the 'default' backend has correctly been updated. Something like that:

def test_overwrite_default_backend():
    register_parallel_backend("default", VALID_BACKENDS["multiprocessing"])
    assert_equal(VALID_BACKENDS["default"], VALID_BACKENDS["multiprocessing"])

def test_register_parallel_backend():
    register_parallel_backend("unit-testing", TestParallelBackend)
    assert_true("unit-testing" in VALID_BACKENDS)
    assert_equal(VALID_BACKENDS["unit-testing"], TestParallelBackend)

Maybe this is a bit overkill though.

@NielsZeilemaker
Copy link
Copy Markdown
Contributor Author

@GaelVaroquaux I'm not sure what you mean by having a context manager. Could you give an example?
Furthermore, I'll rebase and squash all commits after the pull request is approved by you guys.

@GaelVaroquaux
Copy link
Copy Markdown
Member

GaelVaroquaux commented Jan 26, 2016 via email

@NielsZeilemaker
Copy link
Copy Markdown
Contributor Author

@GaelVaroquaux I like the idea of a context manager. However, I feel it would be an extension to this pull request, and therefore a seperate one.

The contextmanager should then look something like:

@contextmanager
def parallel_backend(cls):
    old_backend = VALID_BACKENDS['default']
    register_parallel_backend('default', cls)
    yield
    register_parallel_backend('default', old_backend)

Comment thread joblib/_parallel_backends.py Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have on liner docstrings on all these methods: it would help other developers understand what the philosophy/purpose is behind them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@lesteve
Copy link
Copy Markdown
Member

lesteve commented Jan 27, 2016

It would be good to have a context manager to set the backend in a local context

I forgot to comment on this: I did not get the point of being able to set the default backend in a local context but maybe I am missing something. Can you not pass explicitly backend=whatever when you create your Parallel object?

@GaelVaroquaux
Copy link
Copy Markdown
Member

GaelVaroquaux commented Jan 27, 2016 via email

Comment thread joblib/_parallel_backends.py Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this error be raised in all backends?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@lesteve
Copy link
Copy Markdown
Member

lesteve commented Jan 27, 2016

I forgot to comment on this: I did not get the point of being able to set the
default backend in a local context but maybe I am missing something. Can you
not pass explicitly backend=whatever when you create your Parallel object?

When you are using a blackbox algorithm / object on which you have no
control. Eg a scikit-learn GridSearchCV.

OK fair point.

@NielsZeilemaker
Copy link
Copy Markdown
Contributor Author

Still, those implementations would use the "default" backend. So overwriting the default backend in that case would suffice.

@aabadie
Copy link
Copy Markdown
Contributor

aabadie commented Jan 27, 2016

The contextmanager should then look something like:
....

The local default backend switch should be protected using try finally to ensure the old default backend is correctly reset in case of a failure.
Something like this:

@contextmanager
def parallel_backend(cls):
    old_backend = VALID_BACKENDS['default']
    try:
        register_parallel_backend('default', cls)
        yield
    finally:
        register_parallel_backend('default', old_backend)

I'm not super fund of using the same function to override the default backend and register new backends. Maybe the default backend should be set using an explicit function set_default_parallel_backend ?

@NielsZeilemaker
Copy link
Copy Markdown
Contributor Author

I actually implemented a public effective_n_jobs method upon the request of @GaelVaroquaux. I think it should just be used as a best effort sanity check. So without the need of initializing the backend.

I agree that exposing the effective_n_jobs might not be needed as the backend can autobatch smaller jobs to reduce the overhead.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Mar 2, 2016

Maybe we can add an alias for threading which allows it to be overwritten with another backend.

This feels too hackish / specific to me.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Mar 2, 2016

FYI I am working on the default backend / context manager refactoring.

@NielsZeilemaker
Copy link
Copy Markdown
Contributor Author

I think that the programmer designing the parallel job can determine if a job must run on the threading backend or if it would be nice to run on the threading backend. So why not give him this possibility?

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Mar 2, 2016

This is what the backend argument of Parallel class is for.

@NielsZeilemaker
Copy link
Copy Markdown
Contributor Author

That's what I meant, let him pass backend="prefer_threading" or backend="threading" to distinguish between should use threading and must use threading.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Mar 3, 2016

I implemented the context manager for switching the default backend in NielsZeilemaker#2. I will now give a deeper look at @mrocklin's prototype and in particular the the effective_n_jobs / initialize and backend parameters issues.

Context manager to change the default backend
@NielsZeilemaker
Copy link
Copy Markdown
Contributor Author

@ogrisel thanks for the pull request, i've merged it. My backend is available at https://github.com/NielsZeilemaker/yarnpool/blob/master/yarnbackend.py btw.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Mar 7, 2016

@NielsZeilemaker I issued a new PR to further simplify the API and make it more flexible (by registering any callable as a factory). I will do more experiments to adapt @mrocklin PoC backend for distributed with this new API.

Your own backend will need to be adapted to remove the parallel instance from the constructor and implement the configure function with the new arguments.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Mar 14, 2016

Merged as #320. Thanks again @NielsZeilemaker!

@minrk
Copy link
Copy Markdown

minrk commented Mar 15, 2016

This is really great! I added a prototype for IPython parallel over at ipython/ipyparallel#122, and it seems really easy, so kudos!

The one thing I'm not familiar enough with to really decide on is tuning the batching. The AutoBatchingMixin is slick, but I'm not sure how to get the most out of IPython Parallel with these APIs. To use IPython Parallel really efficiently, many more jobs than engines should be queued, so they can be making their way across the network while the engines are working on early tasks. Perhaps I should return a large number from effective_n_jobs?

Another question is that IPython has its own mechanisms for batching more efficiently than submitting one chunk at a time, especially with the DirectView API. It seems like to leverage that, I would need an apply_batch_async, not just the single-call apply_async. Could there be a mechanism for the backend to fire N jobs in a single call, that defaults to mapping apply_async?

@GaelVaroquaux
Copy link
Copy Markdown
Member

GaelVaroquaux commented Mar 15, 2016 via email

@NielsZeilemaker
Copy link
Copy Markdown
Contributor Author

Great news indeed. I'm already went ahead and started to improve dask/knit to support dynamic allocation of workers on YARN, see dask/knit#51.

@minrk I think the AutoBatchingMixin will mostly help to reduce the overhead of scheduling many small jobs. So if you have jobs which take more than 2 seconds to complete, it shouldn't do anything.
Increasing the effective_n_jobs, won't do much, as the Parallel class itself doesn't do much with that information.

In order to define your own batches, you can use the BatchedCalls object. And I guess we can modify the dispatch_one_batch method to not add another BatchedCalls object if the batch_size is equal to 1.
https://github.com/joblib/joblib/blob/master/joblib/parallel.py#L573
@ogrisel any comment on this?

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented Mar 15, 2016

+1, I think the scheduling overhead of ipython parallel is on the same order of magnitude as the one for multiprocessing (on a local network), so I would advise to re-use the same magic constants.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants