FEA Add `EngineAwareMixin` to factor the common logic of estimators which are plugin-extendable #13

betatim · 2022-12-21T13:10:43Z

Adds a mixin for estimators to use when they support having an engine.

Also adds a decorator that can be used on fit to convert the estimators attributes to either engine_native or sklearn_native types.

WDYT?

Notes: I considered using something like __init_subclass__ to automagically wrap fit but decided that "explicit is better than implicit", so this is how we got the decorator approach. After building something based on __init_subclass__, which seemed cool because it means all you have to do is add the mixin, it seemed a bit too magical that this would wrap fit. So I switched to having a decorator.

fcharras · 2023-01-10T14:59:56Z

Sorry I wasn't aware of this PR until now. Will take a look !

ogrisel

This looks great! Just a few minor suggestions below:

sklearn/_engine/base.py

sklearn/_config.py

sklearn/_engine/tests/test_engines.py

betatim · 2023-01-13T13:11:48Z

Implemented the suggestions almost exactly as suggested. Made a change to the docstring suggestion and renamed the conversion function (not sure I like the name but it seems better than "convert to numpy", when really it is "to sklearn types").

betatim · 2023-01-17T12:56:44Z

@ogrisel what do you think about the changes in the last commit? This makes it possible to perform input validation in the "acceptance" function. Ideas on making the return value better (I dislike the True, input_vals_dict thing. Maybe the engine should store the values on its instance (and we don't pass them in to prepare_fit and so on?)? That would make it clear that the engine can do what it wants with the values (modify them, convert them, etc) and that they won't change between calling validate and the other steps of fitting.

ogrisel

Thanks for the update @betatim. Here is a first pass of feedback.

I would also love to hear from @fcharras on the API design proposed by team and my own suggestions.

sklearn/_engine/base.py

sklearn/cluster/_kmeans.py

ogrisel · 2023-01-17T13:55:52Z

sklearn/cluster/_kmeans.py

+        Should fail as quickly as possible.
+        """
+        # The default engine accepts everything and does not convert inputs
+        return True, {"X": X, "y": y, "sample_weight": sample_weight}


I have the feeling that we should factor in the calls to self.estimator._validate_data and remove the _check_test_data method and simplify the _prepare_fit method.

Otherwise, the default engine would not behave in a consistent way with non-default engine and therefore would not serve as an educational implementation template for third party engine implementers.

sklearn/base.py

sklearn/_engine/base.py

jjerphan

Thanks, @betatim.

At this stage, this LGTM. I agree with @ogrisel's remarks and have a few comments.

jjerphan · 2023-01-18T08:26:34Z

sklearn/cluster/_kmeans.py

+        # XXX Maybe a bit useless as it should never get called, but it
+        # does demonstrate the API
+        return value


LGTM for now. I think this is a no-op generally.

sklearn/cluster/_kmeans.py

jjerphan · 2023-01-18T08:33:47Z

sklearn/_engine/base.py

@@ -112,3 +112,37 @@ def get_engine_classes(engine_name, default, verbose=False):
                f"trying engine {engine_class.__module__}.{engine_class.__qualname__} ."
            )
        yield provider, engine_class
+
+
+def convert_attributes(method):


Note for future iteration: Could method decorated by convert_attributes be called concurrently? If it is the case, should we add some handling to make converting attributes safe?

I think we can assume that calling fit concurrently on the same estimator instance is not thread safe.

However, calling fit on different clones of the same estimator should always be thread-safe.

sklearn/cluster/_kmeans.py

fcharras · 2023-01-18T09:35:37Z

Thanks for the work @betatim

I also caught up with the PR, I like the overall approach, don't have anything to add beside what's has already been suggested.

In particular like @ogrisel suggested I think _get_engine is supposed to return input validated by the engine even when hasattr(self, "_engine_provider") and not reset: ? or is there another reason behind not validating ? (it seems like an error since later on the prediction methods expect validated output already)

When it's merged I'll open a PR on our gpu plugin to see how it goes !

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

ogrisel · 2023-01-18T17:00:51Z

Just to make sure that this is visible, I replied in the previous thread:

#13 (comment)

betatim · 2023-01-20T16:29:00Z

I've updated the code so that accept does not return the inputs. We discussed this two days ago and settled on having a separate "acceptance" and "conversion/validation" phase.

To test it I also added a first draft of being able to pass an engine class to config_context(engine_provider=[SomeEngineAsAClass, "some_other_plugin_name"]). This way you can define an engine at runtime and pass it in. This solves the problem that you can't define a new engine at runtime, because you can't register a new entry point.

ogrisel

Thanks, this looks good to me. I would just like to better define the case when reset=False with a previously resolved engine class as explained below.

sklearn/_engine/tests/test_engines.py

sklearn/base.py

sklearn/cluster/_kmeans.py

sklearn/_engine/base.py

sklearn/_config.py

sklearn/_engine/tests/test_engines.py

fcharras · 2023-01-23T10:54:07Z

So we've completed a circle 😁

It would help if you can highlight how it's intended to be used with a few examples regarding the inputs that could reasonably be inputed by a user and limit use-cases in the range of supported inputs (list of lists, input dtype that might require casting depending on the engine,...). What would you recommend to plugin developers regarding the behavior of the engine in each case (~does accept return True or False, does prepare_* raise an error and which kind of error).

If it's back to the same accept as a plugin developer I'm still tempted to come back to soda-inria/sklearn-numba-dpex#74 (where accept uses asarray and store the converted array before returning True) but improving it so that it raises errors for cases where we can be sure other engines will all fail too (e.g ndim).

If you consider it's a bad practice, I'd suggest to throw the engine instance after accept and re-instantiate later. But then, I think the acceptance criterion should be expected to be light and maybe non exhaustive, because I'd see triggering a conversion just for acceptation a bit wasteful.

ogrisel · 2023-01-23T11:02:05Z

If it's back to the same accept as a plugin developer I'm still tempted to come back to soda-inria/sklearn-numba-dpex#74 (where accept uses asarray and store the converted array before returning True) but improving it so that it raises errors for cases where we can be sure other engines will all fail too (e.g ndim).

I would rather not do store the attribute. I think in the case of sklearn_numba_dpex, it should only return False if the algorithm param is not Lloyd or if scipy.sparse.issparse(X) and return True otherwise. Then then the validation itself would happen in the prepare_ methods as usual and raise exceptions in case of problems.

Later we might refine the accept logic of sklearn_numba_dpex engines so that, for inputs which have a __dlpack__ method we inspect the device and it it's not a device managed by the sycl runtime, they would return False.

fcharras · 2023-01-23T12:14:39Z

I think in the case of sklearn_numba_dpex, it should only return False if the algorithm param is not Lloyd or if scipy.sparse.issparse(X) and return True otherwise. Then then the validation itself would happen in the prepare_ methods as usual and raise exceptions in case of problems.

👍 I'm fine with that. 👍

jjerphan · 2023-01-23T13:45:18Z

Small interjection: Do you think the following would be a better title for this PR ?

FEA Add `EngineAwareMixin` to factor the common logic of estimators which are plugin-extendable

betatim · 2023-01-25T09:22:46Z

I would rather not do store the attribute. I think in the case of sklearn_numba_dpex, it should only return False if the algorithm param is not Lloyd or if scipy.sparse.issparse(X) and return True otherwise. Then then the validation itself would happen in the prepare_ methods as usual and raise exceptions in case of problems.

Quoting this because I like it as answer to the question of how accepts should work with prepare_*.

In my mind accept should return False if either the hyper-parameters are not supported or if you know you can't convert the input data to a type that the engine supports (e.g. sparse data) or because the engine doesn't want to handle it (hypothetical example: because the input is on the wrong device).

To use an analogy: accepts is like checking if the plug on the end of a cable fits. A 220V power plug won't fit into an ethernet port -> return False. But of course, just because you have a cat5 cable with the right plug doesn't mean that the other end of the cable is also plugged into a ethernet port and that the data that will arrive is ethernet (some crazy person might actually try to send 220V via cat5 cable...).

This means input validation, like shape and such, should happen in the prepare_* methods.

I think it is good that, say, a "cupy plugin" accepts input that is a cupy array, but then raises an exception during prepare_* if something is wrong with the contents of the array. The alternative of not accepts'ing would mean that some other plugin (likely the default engine) gets the input and then raises an exception related to data conversion. Where the real problem is the contents. Getting the more specific error from the "cupy plugin" is better.

betatim · 2023-01-25T09:25:06Z

Small interjection: Do you think the following would be a better title for this PR ?
FEA Add `EngineAwareMixin` to factor the common logic of estimators which are plugin-extendable

That is indeed a better title. Switched to it. However, I think this PR is becoming a collection of several ideas (e.g. it "accidentally" also implements the idea of having ad-hoc engines that aren't registered via entrypoints), so IMHO we should try and wrap this one up and start new PRs for new ideas, instead of putting them all here.

sklearn/_engine/base.py

sklearn/_engine/tests/test_engines.py

ogrisel

LGTM with the previous and following suggestion for class filtering by engine_name.

sklearn/_engine/tests/test_engines.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

betatim added 2 commits December 21, 2022 14:15

Add engine aware mixin to factor out engine stuff

d74d652

Add attribute conversion decorator

dfb3fe0

betatim force-pushed the engine-mixin branch from a899776 to dfb3fe0 Compare December 21, 2022 13:15

betatim marked this pull request as ready for review January 5, 2023 14:00

Tweak attribute conversion

e531a6d

betatim changed the title ~~WIP Add engine aware mixin to factor out engine stuff~~ Add engine aware mixin to factor out engine stuff Jan 6, 2023

ogrisel approved these changes Jan 12, 2023

View reviewed changes

sklearn/_engine/base.py Outdated Show resolved Hide resolved

sklearn/_config.py Show resolved Hide resolved

sklearn/_engine/tests/test_engines.py Show resolved Hide resolved

sklearn/_engine/tests/test_engines.py Outdated Show resolved Hide resolved

Update conversion related bits

519ae67

betatim added 2 commits January 16, 2023 14:36

Fix engine provider at fit time

1a19de0

Combine engine selection and validation

96c5f9b

ogrisel reviewed Jan 17, 2023

View reviewed changes

sklearn/_engine/base.py Show resolved Hide resolved

jjerphan approved these changes Jan 18, 2023

View reviewed changes

Rename loop variable

48803d9

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

betatim added 2 commits January 20, 2023 15:08

Switch back to using accepts for engine selection

6d1b390

Allow "runtime" engines to be passed as well as provider names

157a9c6

ogrisel reviewed Jan 23, 2023

View reviewed changes

sklearn/_engine/tests/test_engines.py Outdated Show resolved Hide resolved

ogrisel reviewed Jan 23, 2023

View reviewed changes

sklearn/_engine/tests/test_engines.py Outdated Show resolved Hide resolved

Update tests

1441088

Update docstring for accepts

68a64a8

betatim changed the title ~~Add engine aware mixin to factor out engine stuff~~ FEA Add EngineAwareMixin to factor the common logic of estimators which are plugin-extendable Jan 25, 2023

Update comment for engine class config

d9020e7

ogrisel reviewed Jan 26, 2023

View reviewed changes

sklearn/_engine/base.py Outdated Show resolved Hide resolved

ogrisel reviewed Jan 26, 2023

View reviewed changes

sklearn/_engine/tests/test_engines.py Show resolved Hide resolved

ogrisel reviewed Jan 26, 2023

View reviewed changes

sklearn/_engine/tests/test_engines.py Show resolved Hide resolved

ogrisel approved these changes Jan 26, 2023

View reviewed changes

sklearn/_engine/tests/test_engines.py Show resolved Hide resolved

Add engine_name attribute to ad-hoc engine classes

183068d

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

betatim merged commit 39a39ad into ogrisel:wip-engines Jan 26, 2023

betatim deleted the engine-mixin branch January 26, 2023 13:33

ogrisel mentioned this pull request Feb 1, 2023

[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans scikit-learn/scikit-learn#24497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA Add `EngineAwareMixin` to factor the common logic of estimators which are plugin-extendable #13

FEA Add `EngineAwareMixin` to factor the common logic of estimators which are plugin-extendable #13

betatim commented Dec 21, 2022 •

edited

fcharras commented Jan 10, 2023

ogrisel left a comment

betatim commented Jan 13, 2023

betatim commented Jan 17, 2023

ogrisel left a comment

ogrisel Jan 17, 2023

jjerphan left a comment

jjerphan Jan 18, 2023

jjerphan Jan 18, 2023

ogrisel Jan 18, 2023

fcharras commented Jan 18, 2023 •

edited

ogrisel commented Jan 18, 2023

betatim commented Jan 20, 2023

ogrisel left a comment

fcharras commented Jan 23, 2023 •

edited

ogrisel commented Jan 23, 2023 •

edited

fcharras commented Jan 23, 2023

jjerphan commented Jan 23, 2023 •

edited

betatim commented Jan 25, 2023

betatim commented Jan 25, 2023

ogrisel left a comment

FEA Add EngineAwareMixin to factor the common logic of estimators which are plugin-extendable #13

FEA Add EngineAwareMixin to factor the common logic of estimators which are plugin-extendable #13

Conversation

betatim commented Dec 21, 2022 • edited

fcharras commented Jan 10, 2023

ogrisel left a comment

Choose a reason for hiding this comment

betatim commented Jan 13, 2023

betatim commented Jan 17, 2023

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Jan 17, 2023

Choose a reason for hiding this comment

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan Jan 18, 2023

Choose a reason for hiding this comment

jjerphan Jan 18, 2023

Choose a reason for hiding this comment

ogrisel Jan 18, 2023

Choose a reason for hiding this comment

fcharras commented Jan 18, 2023 • edited

ogrisel commented Jan 18, 2023

betatim commented Jan 20, 2023

ogrisel left a comment

Choose a reason for hiding this comment

fcharras commented Jan 23, 2023 • edited

ogrisel commented Jan 23, 2023 • edited

fcharras commented Jan 23, 2023

jjerphan commented Jan 23, 2023 • edited

betatim commented Jan 25, 2023

betatim commented Jan 25, 2023

ogrisel left a comment

Choose a reason for hiding this comment

FEA Add `EngineAwareMixin` to factor the common logic of estimators which are plugin-extendable #13

FEA Add `EngineAwareMixin` to factor the common logic of estimators which are plugin-extendable #13

betatim commented Dec 21, 2022 •

edited

fcharras commented Jan 18, 2023 •

edited

fcharras commented Jan 23, 2023 •

edited

ogrisel commented Jan 23, 2023 •

edited

jjerphan commented Jan 23, 2023 •

edited