Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEA Add EngineAwareMixin to factor the common logic of estimators which are plugin-extendable #13

Merged
merged 13 commits into from Jan 26, 2023

Conversation

betatim
Copy link
Collaborator

@betatim betatim commented Dec 21, 2022

Adds a mixin for estimators to use when they support having an engine.

Also adds a decorator that can be used on fit to convert the estimators attributes to either engine_native or sklearn_native types.

WDYT?

Notes: I considered using something like __init_subclass__ to automagically wrap fit but decided that "explicit is better than implicit", so this is how we got the decorator approach. After building something based on __init_subclass__, which seemed cool because it means all you have to do is add the mixin, it seemed a bit too magical that this would wrap fit. So I switched to having a decorator.

@betatim betatim marked this pull request as ready for review January 5, 2023 14:00
@betatim betatim changed the title WIP Add engine aware mixin to factor out engine stuff Add engine aware mixin to factor out engine stuff Jan 6, 2023
@fcharras
Copy link
Collaborator

Sorry I wasn't aware of this PR until now. Will take a look !

Copy link
Owner

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Just a few minor suggestions below:

sklearn/_engine/base.py Outdated Show resolved Hide resolved
sklearn/_config.py Show resolved Hide resolved
sklearn/_engine/tests/test_engines.py Show resolved Hide resolved
sklearn/_engine/tests/test_engines.py Outdated Show resolved Hide resolved
@betatim
Copy link
Collaborator Author

betatim commented Jan 13, 2023

Implemented the suggestions almost exactly as suggested. Made a change to the docstring suggestion and renamed the conversion function (not sure I like the name but it seems better than "convert to numpy", when really it is "to sklearn types").

@betatim
Copy link
Collaborator Author

betatim commented Jan 17, 2023

@ogrisel what do you think about the changes in the last commit? This makes it possible to perform input validation in the "acceptance" function. Ideas on making the return value better (I dislike the True, input_vals_dict thing. Maybe the engine should store the values on its instance (and we don't pass them in to prepare_fit and so on?)? That would make it clear that the engine can do what it wants with the values (modify them, convert them, etc) and that they won't change between calling validate and the other steps of fitting.

Copy link
Owner

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @betatim. Here is a first pass of feedback.

I would also love to hear from @fcharras on the API design proposed by team and my own suggestions.

sklearn/_engine/base.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
Should fail as quickly as possible.
"""
# The default engine accepts everything and does not convert inputs
return True, {"X": X, "y": y, "sample_weight": sample_weight}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling that we should factor in the calls to self.estimator._validate_data and remove the _check_test_data method and simplify the _prepare_fit method.

Otherwise, the default engine would not behave in a consistent way with non-default engine and therefore would not serve as an educational implementation template for third party engine implementers.

sklearn/base.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @betatim.

At this stage, this LGTM. I agree with @ogrisel's remarks and have a few comments.

Comment on lines +302 to +304
# XXX Maybe a bit useless as it should never get called, but it
# does demonstrate the API
return value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for now. I think this is a no-op generally.

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
@@ -112,3 +112,37 @@ def get_engine_classes(engine_name, default, verbose=False):
f"trying engine {engine_class.__module__}.{engine_class.__qualname__} ."
)
yield provider, engine_class


def convert_attributes(method):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for future iteration: Could method decorated by convert_attributes be called concurrently? If it is the case, should we add some handling to make converting attributes safe?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can assume that calling fit concurrently on the same estimator instance is not thread safe.

However, calling fit on different clones of the same estimator should always be thread-safe.

sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
@fcharras
Copy link
Collaborator

fcharras commented Jan 18, 2023

Thanks for the work @betatim

I also caught up with the PR, I like the overall approach, don't have anything to add beside what's has already been suggested.

In particular like @ogrisel suggested I think _get_engine is supposed to return input validated by the engine even when hasattr(self, "_engine_provider") and not reset: ? or is there another reason behind not validating ? (it seems like an error since later on the prediction methods expect validated output already)

When it's merged I'll open a PR on our gpu plugin to see how it goes !

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@ogrisel
Copy link
Owner

ogrisel commented Jan 18, 2023

Just to make sure that this is visible, I replied in the previous thread:

#13 (comment)

@betatim
Copy link
Collaborator Author

betatim commented Jan 20, 2023

I've updated the code so that accept does not return the inputs. We discussed this two days ago and settled on having a separate "acceptance" and "conversion/validation" phase.

To test it I also added a first draft of being able to pass an engine class to config_context(engine_provider=[SomeEngineAsAClass, "some_other_plugin_name"]). This way you can define an engine at runtime and pass it in. This solves the problem that you can't define a new engine at runtime, because you can't register a new entry point.

Copy link
Owner

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks good to me. I would just like to better define the case when reset=False with a previously resolved engine class as explained below.

sklearn/_engine/tests/test_engines.py Outdated Show resolved Hide resolved
sklearn/base.py Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/cluster/_kmeans.py Outdated Show resolved Hide resolved
sklearn/_engine/base.py Outdated Show resolved Hide resolved
sklearn/_config.py Show resolved Hide resolved
@fcharras
Copy link
Collaborator

fcharras commented Jan 23, 2023

So we've completed a circle 😁

It would help if you can highlight how it's intended to be used with a few examples regarding the inputs that could reasonably be inputed by a user and limit use-cases in the range of supported inputs (list of lists, input dtype that might require casting depending on the engine,...). What would you recommend to plugin developers regarding the behavior of the engine in each case (~does accept return True or False, does prepare_* raise an error and which kind of error).

If it's back to the same accept as a plugin developer I'm still tempted to come back to soda-inria/sklearn-numba-dpex#74 (where accept uses asarray and store the converted array before returning True) but improving it so that it raises errors for cases where we can be sure other engines will all fail too (e.g ndim).

If you consider it's a bad practice, I'd suggest to throw the engine instance after accept and re-instantiate later. But then, I think the acceptance criterion should be expected to be light and maybe non exhaustive, because I'd see triggering a conversion just for acceptation a bit wasteful.

@ogrisel
Copy link
Owner

ogrisel commented Jan 23, 2023

If it's back to the same accept as a plugin developer I'm still tempted to come back to soda-inria/sklearn-numba-dpex#74 (where accept uses asarray and store the converted array before returning True) but improving it so that it raises errors for cases where we can be sure other engines will all fail too (e.g ndim).

I would rather not do store the attribute. I think in the case of sklearn_numba_dpex, it should only return False if the algorithm param is not Lloyd or if scipy.sparse.issparse(X) and return True otherwise. Then then the validation itself would happen in the prepare_ methods as usual and raise exceptions in case of problems.

Later we might refine the accept logic of sklearn_numba_dpex engines so that, for inputs which have a __dlpack__ method we inspect the device and it it's not a device managed by the sycl runtime, they would return False.

@fcharras
Copy link
Collaborator

I think in the case of sklearn_numba_dpex, it should only return False if the algorithm param is not Lloyd or if scipy.sparse.issparse(X) and return True otherwise. Then then the validation itself would happen in the prepare_ methods as usual and raise exceptions in case of problems.

👍 I'm fine with that. 👍

@jjerphan
Copy link
Collaborator

jjerphan commented Jan 23, 2023

Small interjection: Do you think the following would be a better title for this PR ?

FEA Add `EngineAwareMixin` to factor the common logic of estimators which are plugin-extendable

@betatim
Copy link
Collaborator Author

betatim commented Jan 25, 2023

I would rather not do store the attribute. I think in the case of sklearn_numba_dpex, it should only return False if the algorithm param is not Lloyd or if scipy.sparse.issparse(X) and return True otherwise. Then then the validation itself would happen in the prepare_ methods as usual and raise exceptions in case of problems.

Quoting this because I like it as answer to the question of how accepts should work with prepare_*.

In my mind accept should return False if either the hyper-parameters are not supported or if you know you can't convert the input data to a type that the engine supports (e.g. sparse data) or because the engine doesn't want to handle it (hypothetical example: because the input is on the wrong device).

To use an analogy: accepts is like checking if the plug on the end of a cable fits. A 220V power plug won't fit into an ethernet port -> return False. But of course, just because you have a cat5 cable with the right plug doesn't mean that the other end of the cable is also plugged into a ethernet port and that the data that will arrive is ethernet (some crazy person might actually try to send 220V via cat5 cable...).

This means input validation, like shape and such, should happen in the prepare_* methods.

I think it is good that, say, a "cupy plugin" accepts input that is a cupy array, but then raises an exception during prepare_* if something is wrong with the contents of the array. The alternative of not accepts'ing would mean that some other plugin (likely the default engine) gets the input and then raises an exception related to data conversion. Where the real problem is the contents. Getting the more specific error from the "cupy plugin" is better.

@betatim betatim changed the title Add engine aware mixin to factor out engine stuff FEA Add EngineAwareMixin to factor the common logic of estimators which are plugin-extendable Jan 25, 2023
@betatim
Copy link
Collaborator Author

betatim commented Jan 25, 2023

Small interjection: Do you think the following would be a better title for this PR ?

FEA Add `EngineAwareMixin` to factor the common logic of estimators which are plugin-extendable

That is indeed a better title. Switched to it. However, I think this PR is becoming a collection of several ideas (e.g. it "accidentally" also implements the idea of having ad-hoc engines that aren't registered via entrypoints), so IMHO we should try and wrap this one up and start new PRs for new ideas, instead of putting them all here.

sklearn/_engine/base.py Outdated Show resolved Hide resolved
Copy link
Owner

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the previous and following suggestion for class filtering by engine_name.

sklearn/_engine/tests/test_engines.py Show resolved Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants