Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Stacking classifier cannot use Thresholder function - no .predict_proba #501

Closed
L-Marriott opened this issue Mar 28, 2022 · 16 comments
Labels
bug Something isn't working

Comments

@L-Marriott
Copy link

L-Marriott commented Mar 28, 2022

Description:

I'm able to use the thresholder on sklearn's voting classifer, but not on the stacking classifier. It throws this error, which I believe is in error. StackingClassifier does have predict_proba. Maybe I'm missunderstanding the use case, but this seems to fit.

ValueError: The Thresholder meta model only works on classifcation models with .predict_proba.

Code for reproduction (using the sklearn sample data for StackingClassifier):

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier

X, y = load_iris(return_X_y=True)
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('svr', make_pipeline(StandardScaler(), LinearSVC(random_state=42)))]
clf = StackingClassifier(    estimators=estimators, final_estimator=LogisticRegression())
clf.fit(X, y)

a = Thresholder(clf, threshold=0.2)
a.fit(X, y)
a.predict(X)

Full trace:

ValueError                                Traceback (most recent call last)
<ipython-input-26-1b89dbfa16b8> in <module>
     16 
     17 a = Thresholder(clf, threshold=0.2)
---> 18 a.fit(X_train_std, np.ceil(y_train[targets[2]]))
     19 a.predict(X_train_std)

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\sklego\meta\thresholder.py in fit(self, X, y, sample_weight)
     54         self.estimator_ = clone(self.model)
     55         if not isinstance(self.estimator_, ProbabilisticClassifier):
---> 56             raise ValueError(
     57                 "The Thresholder meta model only works on classifcation models with .predict_proba."
     58             )

ValueError: The Thresholder meta model only works on classifcation models with .predict_proba.
@L-Marriott L-Marriott added the bug Something isn't working label Mar 28, 2022
@koaning
Copy link
Owner

koaning commented Mar 29, 2022

Interesting.

We're using a custom class to check if an estimator is probabilistic (ie. if it has predict_proba) which has an instance check on it but it somehow does not seem to pass.

Just in case, could you share your machine info. Operating system, python version and package versions? You can use watermark for this.

@MBrouns
Copy link
Collaborator

MBrouns commented Mar 29, 2022

I think I have an idea on what's causing this. The predict_proba on StackingClassifier is conditional on its' final_estimator_ having a predict proba (https://github.com/scikit-learn/scikit-learn/blob/37ac6788c9504ee409b75e5e24ff7d86c90c2ffb/sklearn/ensemble/_stacking.py#L515 check the decorator).

Our thresholder checks in fit whether its a probabilistic classifier, but at that point the StackingClassifier isn't fitted yet, and therefore doesn't have a final_estimator_ and therefore doesn't have a predict_proba.

@L-Marriott
Copy link
Author

L-Marriott commented Mar 29, 2022

@MBrouns I'm not sure I fully understand, is the classifier not already fitted in line 14? Prefitting the final estimator doesn't seem to help either.

Here's my basic machine info. Looks like I'm a version behind on both modules, so I'll try updating.

Python implementation: CPython
Python version       : 3.9.12
IPython version      : 7.25.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
CPU cores   : 6
Architecture: 64bit

scikit-learn: 1.0.1
scikit-lego: 0.6.9

E. no change with 1.0.2 and 0.6.10 updates.

@koaning
Copy link
Owner

koaning commented Mar 29, 2022

There is a refit parameter that retrains the pipeline passed into the threshold but it's turned off by default. I'm reminded that we should really add a note on this behavior to the main docs page.

@L-Marriott could you share the full stack trace as well?

@L-Marriott
Copy link
Author

L-Marriott commented Mar 29, 2022

Sure, full trace below. I've also cleaned up the initial post as I'd confused some of my example code with my actual use case.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-3437ca1f5dae> in <module>
     16 
     17 a = Thresholder(clf, threshold=0.2, refit=True)
---> 18 a.fit(X, y)
     19 a.predict(X)

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\sklego\meta\thresholder.py in fit(self, X, y, sample_weight)
     54         self.estimator_ = clone(self.model)
     55         if not isinstance(self.estimator_, ProbabilisticClassifier):
---> 56             raise ValueError(
     57                 "The Thresholder meta model only works on classifcation models with .predict_proba."
     58             )

ValueError: The Thresholder meta model only works on classifcation models with .predict_proba.

Enabling the refit parameter for Thresholder doesn't seem to resolve it though. Same trace given.

@MBrouns
Copy link
Collaborator

MBrouns commented Mar 29, 2022

AH @L-Marriott I see! There is indeed a bug with our refit behaviour I think. We check if self.estimator_ is already fitted, but we produce it by cloning self.model. clone will always return an unfitted model, making that check fail and forcing a model refit.

Seems quite fixable though, we should only clone self.model into self.estimator_ if we actually want to refit. Else we should simply set self.estimator_ = self.model.

Just to check: @L-Marriott do you want to make a PR for this?

@koaning
Copy link
Owner

koaning commented Mar 29, 2022

It'd be good to also add a unit test that confirms it works for the stacking classifier. There's a nice unit test in this issue.

@L-Marriott
Copy link
Author

I'll be honest, I'm a very amateur programmer. I'm out of my depth writing the tests for it.

@koaning
Copy link
Owner

koaning commented Mar 29, 2022

No worries. But you may appreciate calmcode.io

@MarkusDegen
Copy link
Contributor

Unit test shows same error. I guess i need to slim it down a bit and add some asserts. Need to learn what the stacking classifier actually does

@MarkusDegen
Copy link
Contributor

MarkusDegen commented Mar 30, 2022

I do not really get the refit tests

Does something like this make sense for retest?

def test_no_refit_does_not_fit_underlying():
    X = np.array([1,2,3,4]).reshape(-1,1)
    y_ones = np.array([0,1,1,1]).reshape(-1,)
    y_zeros = np.array([0,0,0,1]).reshape(-1,)
    
    clf = DummyClassifier(strategy="most_frequent")
    clf.fit(X, y_ones)
    a = Thresholder(clf, threshold=0.2, refit=False)      
    a.fit(X, y_zeros)

    assert a.predict(np.array([[1]])) == 1

and corresponding

def test_refit_fits_underlying():
    X = np.array([1,2,3,4]).reshape(-1,1)
    y_ones = np.array([0,1,1,1]).reshape(-1,)
    y_zeros = np.array([0,0,0,1]).reshape(-1,)
    
    clf = DummyClassifier(strategy="most_frequent")
    clf.fit(X, y_ones)
    a = Thresholder(clf, threshold=0.2, refit=True)      
    a.fit(X, y_zeros)
    
    assert a.predict(np.array([[1]])) == 0

Why do we have to clone in the refit case?

@MarkusDegen
Copy link
Contributor

Do we also need to clone in _handle_refit in case we have refit = False and the underlying model was not fitted before (NotFittedError)?

@MarkusDegen
Copy link
Contributor

MarkusDegen commented Mar 31, 2022

So there are 3 cases when calling fit or?

Case Action
refit is False and underlying model has not been fitted fit the same underlying original model
refit is False and underlying model is fitted underlying model stays the same and is not fitted
refit is True and underlying model is fitted underlying model is cloned and fitted

@MarkusDegen
Copy link
Contributor

I took a look whether test_passes_sample_weight is the right test for it. But his parameterized test always has an underlying model that is unfitted, so at least the case of a fitted underlying and refit = True should mabye tested additionally

@koaning
Copy link
Owner

koaning commented Apr 2, 2022

Sorry for the delayed response; I've been dealing with a very narly flu this season.

I see you've made the relevant changes to the PR, which looked good to me!

@koaning koaning closed this as completed in 0469bc8 Apr 2, 2022
@koaning
Copy link
Owner

koaning commented Apr 2, 2022

The PR is now merged into the main branch. I'd like to give it another week of waiting to see if other PRs come in. If so, we may be able to batch together a few fixes into new releases on PyPI.

@MarkusDegen, thanks for the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants