Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python][sklearn] add __sklearn_is_fitted__() method to be better compatible with scikit-learn API #4636

Merged
merged 2 commits into from
Oct 5, 2021

Conversation

StrikerRUS
Copy link
Collaborator

Refer to scikit-learn 1.0 release notes and corresponding PR:

Enhancement utils.validation.check_is_fitted now uses __sklearn_is_fitted__ if available, instead of checking for attributes ending with an underscore.
https://scikit-learn.org/stable/whats_new/v1.0.html#sklearn-utils
scikit-learn/scikit-learn#20657

After doing some search over GitHub repos, it looks like that only XGBoost has already adopted that new API:
dmlc/xgboost#7230

@@ -529,6 +529,9 @@ def _more_tags(self):
}
}

def __sklearn_is_fitted__(self) -> bool:
return getattr(self, "fitted_", False)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set fitted_ attribute at the end of the fit() method

self.fitted_ = True

Copy link
Collaborator

@jameslamb jameslamb Oct 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed in the dmlc/xgboost#7230, they also replaced some other internal checks of estimator attributes with this method call.

I think that's a good idea, so only this method needs to know about the fitted_ property. What do you think?

That would mean changing the following to if not self.__sklearn_is_fitted__():

if not getattr(self, "fitted_", False):
raise LGBMNotFittedError('Cannot access property client_ before calling fit().')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I remember you were asking why do we use if self._n_features is None everywhere for checking that estimator is fitted when we have self.fitted_.

if self._n_features is None:

#3883 (comment)
My answer was that it's because self.fitted_ was introduced much later. Now we can replace everything with self.__sklearn_is_fitted__(). And I think it will be clearer.

But I'd prefer to do it in a follow-up PR, if you do not mind.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yep, I'm fine with it being a followup

@StrikerRUS StrikerRUS marked this pull request as ready for review September 30, 2021 15:22
@StrikerRUS
Copy link
Collaborator Author

Are we facing the following issue at our CI services?
https://news.yahoo.com/millions-old-phones-laptops-smart-210600565.html

@jameslamb
Copy link
Collaborator

Are we facing the following issue at our CI services? https://news.yahoo.com/millions-old-phones-laptops-smart-210600565.html

oh wow! maybe!

Err:10 https://apt.kitware.com/ubuntu bionic Release
Certificate verification failed: The certificate is NOT trusted. The certificate chain uses expired certificate. Could not handshake: Error in the certificate verification. [IP: 66.194.253.25 443]

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for catching this in the scikit-learn release notes! Totally agree.

Just left one comment for your consideration.

@jameslamb
Copy link
Collaborator

@StrikerRUS I've open #4646 to consolidate the discussion about the certificate issues for CUDA CI jobs (since they aren't specific to this PR).

@jk7ss
Copy link

jk7ss commented Nov 5, 2021

This change

    def __sklearn_is_fitted__(self) -> bool:
        return getattr(self, "fitted_", False)

that defaults to False, together with this test:

    def predict(...):
        if not self.__sklearn_is_fitted__():
            raise LGBMNotFittedError("Estimator not fitted, call fit before exploiting the model.")

renders all models that were trained and serialized prior to v3.3.0 failing prediction after deserialization, because they are missing the fitted_ attribute.
Shouldn't it either
a) cause AttributeError if it is missing -> meaning that the model object is not code compatible? or
b) silently expect it's fitted to be backward compatible?

It's a reasonable assumption that a deserialized model was already fitted?

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented Nov 5, 2021

@jk7ss Thank you very much for your report! This is indeed important to be backward compatible with serialized models fitted in previous versions of LightGBM. But fitted_ attribute was added in #3329 and that PR was included in the v3.1.0 release (16 Nov 2020). So I guess everything should be OK for models fitted and serialized with versions started from v3.1.0. Do you think we should do something to be backward compatible even for models trained and serialized prior to v3.1.0?

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants