[BUG] Stacking classifier cannot use Thresholder function - no .predict_proba #501

L-Marriott · 2022-03-28T15:58:41Z

Description:

I'm able to use the thresholder on sklearn's voting classifer, but not on the stacking classifier. It throws this error, which I believe is in error. StackingClassifier does have predict_proba. Maybe I'm missunderstanding the use case, but this seems to fit.

ValueError: The Thresholder meta model only works on classifcation models with .predict_proba.

Code for reproduction (using the sklearn sample data for StackingClassifier):

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier

X, y = load_iris(return_X_y=True)
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('svr', make_pipeline(StandardScaler(), LinearSVC(random_state=42)))]
clf = StackingClassifier(    estimators=estimators, final_estimator=LogisticRegression())
clf.fit(X, y)

a = Thresholder(clf, threshold=0.2)
a.fit(X, y)
a.predict(X)

Full trace:

ValueError                                Traceback (most recent call last)
<ipython-input-26-1b89dbfa16b8> in <module>
     16 
     17 a = Thresholder(clf, threshold=0.2)
---> 18 a.fit(X_train_std, np.ceil(y_train[targets[2]]))
     19 a.predict(X_train_std)

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\sklego\meta\thresholder.py in fit(self, X, y, sample_weight)
     54         self.estimator_ = clone(self.model)
     55         if not isinstance(self.estimator_, ProbabilisticClassifier):
---> 56             raise ValueError(
     57                 "The Thresholder meta model only works on classifcation models with .predict_proba."
     58             )

ValueError: The Thresholder meta model only works on classifcation models with .predict_proba.

The text was updated successfully, but these errors were encountered:

koaning · 2022-03-29T06:38:54Z

Interesting.

We're using a custom class to check if an estimator is probabilistic (ie. if it has predict_proba) which has an instance check on it but it somehow does not seem to pass.

Just in case, could you share your machine info. Operating system, python version and package versions? You can use watermark for this.

MBrouns · 2022-03-29T06:52:57Z

I think I have an idea on what's causing this. The predict_proba on StackingClassifier is conditional on its' final_estimator_ having a predict proba (https://github.com/scikit-learn/scikit-learn/blob/37ac6788c9504ee409b75e5e24ff7d86c90c2ffb/sklearn/ensemble/_stacking.py#L515 check the decorator).

Our thresholder checks in fit whether its a probabilistic classifier, but at that point the StackingClassifier isn't fitted yet, and therefore doesn't have a final_estimator_ and therefore doesn't have a predict_proba.

L-Marriott · 2022-03-29T08:30:55Z

@MBrouns I'm not sure I fully understand, is the classifier not already fitted in line 14? Prefitting the final estimator doesn't seem to help either.

Here's my basic machine info. Looks like I'm a version behind on both modules, so I'll try updating.

Python implementation: CPython
Python version       : 3.9.12
IPython version      : 7.25.0

Compiler    : MSC v.1929 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
CPU cores   : 6
Architecture: 64bit

scikit-learn: 1.0.1
scikit-lego: 0.6.9

E. no change with 1.0.2 and 0.6.10 updates.

koaning · 2022-03-29T08:40:12Z

There is a refit parameter that retrains the pipeline passed into the threshold but it's turned off by default. I'm reminded that we should really add a note on this behavior to the main docs page.

@L-Marriott could you share the full stack trace as well?

L-Marriott · 2022-03-29T08:50:25Z

Sure, full trace below. I've also cleaned up the initial post as I'd confused some of my example code with my actual use case.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-3437ca1f5dae> in <module>
     16 
     17 a = Thresholder(clf, threshold=0.2, refit=True)
---> 18 a.fit(X, y)
     19 a.predict(X)

~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\sklego\meta\thresholder.py in fit(self, X, y, sample_weight)
     54         self.estimator_ = clone(self.model)
     55         if not isinstance(self.estimator_, ProbabilisticClassifier):
---> 56             raise ValueError(
     57                 "The Thresholder meta model only works on classifcation models with .predict_proba."
     58             )

ValueError: The Thresholder meta model only works on classifcation models with .predict_proba.

Enabling the refit parameter for Thresholder doesn't seem to resolve it though. Same trace given.

MBrouns · 2022-03-29T14:37:02Z

AH @L-Marriott I see! There is indeed a bug with our refit behaviour I think. We check if self.estimator_ is already fitted, but we produce it by cloning self.model. clone will always return an unfitted model, making that check fail and forcing a model refit.

Seems quite fixable though, we should only clone self.model into self.estimator_ if we actually want to refit. Else we should simply set self.estimator_ = self.model.

Just to check: @L-Marriott do you want to make a PR for this?

koaning · 2022-03-29T15:14:32Z

It'd be good to also add a unit test that confirms it works for the stacking classifier. There's a nice unit test in this issue.

L-Marriott · 2022-03-29T15:21:42Z

I'll be honest, I'm a very amateur programmer. I'm out of my depth writing the tests for it.

koaning · 2022-03-29T16:28:05Z

No worries. But you may appreciate calmcode.io

…not clone the model

MarkusDegen · 2022-03-30T14:46:22Z

Unit test shows same error. I guess i need to slim it down a bit and add some asserts. Need to learn what the stacking classifier actually does

MarkusDegen · 2022-03-30T17:36:56Z

I do not really get the refit tests

Does something like this make sense for retest?

def test_no_refit_does_not_fit_underlying():
    X = np.array([1,2,3,4]).reshape(-1,1)
    y_ones = np.array([0,1,1,1]).reshape(-1,)
    y_zeros = np.array([0,0,0,1]).reshape(-1,)
    
    clf = DummyClassifier(strategy="most_frequent")
    clf.fit(X, y_ones)
    a = Thresholder(clf, threshold=0.2, refit=False)      
    a.fit(X, y_zeros)

    assert a.predict(np.array([[1]])) == 1

and corresponding

def test_refit_fits_underlying():
    X = np.array([1,2,3,4]).reshape(-1,1)
    y_ones = np.array([0,1,1,1]).reshape(-1,)
    y_zeros = np.array([0,0,0,1]).reshape(-1,)
    
    clf = DummyClassifier(strategy="most_frequent")
    clf.fit(X, y_ones)
    a = Thresholder(clf, threshold=0.2, refit=True)      
    a.fit(X, y_zeros)
    
    assert a.predict(np.array([[1]])) == 0

Why do we have to clone in the refit case?

MarkusDegen · 2022-03-30T19:23:35Z

Do we also need to clone in _handle_refit in case we have refit = False and the underlying model was not fitted before (NotFittedError)?

MarkusDegen · 2022-03-31T06:31:29Z

So there are 3 cases when calling fit or?

Case	Action
refit is False and underlying model has not been fitted	fit the same underlying original model
refit is False and underlying model is fitted	underlying model stays the same and is not fitted
refit is True and underlying model is fitted	underlying model is cloned and fitted

MarkusDegen · 2022-03-31T07:22:23Z

I took a look whether test_passes_sample_weight is the right test for it. But his parameterized test always has an underlying model that is unfitted, so at least the case of a fitted underlying and refit = True should mabye tested additionally

koaning · 2022-04-02T07:26:49Z

Sorry for the delayed response; I've been dealing with a very narly flu this season.

I see you've made the relevant changes to the PR, which looked good to me!

koaning · 2022-04-02T10:29:15Z

The PR is now merged into the main branch. I'd like to give it another week of waiting to see if other PRs come in. If so, we may be able to batch together a few fixes into new releases on PyPI.

@MarkusDegen, thanks for the PR!

L-Marriott added the bug Something isn't working label Mar 28, 2022

MarkusDegen mentioned this issue Mar 30, 2022

PR fixes #501 (https://github.com/koaning/scikit-lego/issues/501) #503

Merged

MarkusDegen added a commit to MarkusDegen/scikit-lego that referenced this issue Mar 30, 2022

Added unit test as described in issue koaning#501. Added solution to …

e8f7ebd

…not clone the model

koaning closed this as completed in 0469bc8 Apr 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Stacking classifier cannot use Thresholder function - no .predict_proba #501

[BUG] Stacking classifier cannot use Thresholder function - no .predict_proba #501

L-Marriott commented Mar 28, 2022 •

edited

koaning commented Mar 29, 2022

MBrouns commented Mar 29, 2022

L-Marriott commented Mar 29, 2022 •

edited

koaning commented Mar 29, 2022

L-Marriott commented Mar 29, 2022 •

edited

MBrouns commented Mar 29, 2022

koaning commented Mar 29, 2022

L-Marriott commented Mar 29, 2022

koaning commented Mar 29, 2022

MarkusDegen commented Mar 30, 2022

MarkusDegen commented Mar 30, 2022 •

edited

MarkusDegen commented Mar 30, 2022

MarkusDegen commented Mar 31, 2022 •

edited

MarkusDegen commented Mar 31, 2022

koaning commented Apr 2, 2022

koaning commented Apr 2, 2022

[BUG] Stacking classifier cannot use Thresholder function - no .predict_proba #501

[BUG] Stacking classifier cannot use Thresholder function - no .predict_proba #501

Comments

L-Marriott commented Mar 28, 2022 • edited

koaning commented Mar 29, 2022

MBrouns commented Mar 29, 2022

L-Marriott commented Mar 29, 2022 • edited

koaning commented Mar 29, 2022

L-Marriott commented Mar 29, 2022 • edited

MBrouns commented Mar 29, 2022

koaning commented Mar 29, 2022

L-Marriott commented Mar 29, 2022

koaning commented Mar 29, 2022

MarkusDegen commented Mar 30, 2022

MarkusDegen commented Mar 30, 2022 • edited

MarkusDegen commented Mar 30, 2022

MarkusDegen commented Mar 31, 2022 • edited

MarkusDegen commented Mar 31, 2022

koaning commented Apr 2, 2022

koaning commented Apr 2, 2022

L-Marriott commented Mar 28, 2022 •

edited

L-Marriott commented Mar 29, 2022 •

edited

L-Marriott commented Mar 29, 2022 •

edited

MarkusDegen commented Mar 30, 2022 •

edited

MarkusDegen commented Mar 31, 2022 •

edited