Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Scorecard.fit(X, y) method should be more considerate with the sample_weight fit param #228

Closed
vruusmann opened this issue Dec 28, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request
Projects
Milestone

Comments

@vruusmann
Copy link

vruusmann commented Dec 28, 2022

I'm trying to combine OptBinning with StatsModels estimators, using the sklearn2pmml.statsmodels.StatsModelsClassifier estimator as a StatsModels/SkLearn bridge.

StatsModels estimators generally do not support sample weights. The ones that do, they expect sample weights as a constructor argument, not as a fit method argument.

Now, if we don't care about sample weights, then it should be possible to train a simple binary classification scorecard like this:

from optbinning import BinningProcess
from optbinning.scorecard import Scorecard
from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn2pmml.statsmodels import StatsModelsClassifier
from statsmodels.api import Logit

iris_X, iris_y = load_iris(return_X_y = True)

iris_X = DataFrame(iris_X, columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"])
iris_y = (iris_y == 1)

binning_process = BinningProcess(variable_names = iris_X.columns.values)
estimator = StatsModelsClassifier(Logit)

scorecard = Scorecard(binning_process = binning_process, estimator = estimator)
scorecard.fit(iris_X, iris_y)

When running this example with OptBinning 0.17.2, then the following TypeError is raised:

Traceback (most recent call last):
  File "main.py", line 25, in <module>
    scorecard.fit(iris_X, iris_y)
  File "/home/user/.local/lib/python3.9/site-packages/optbinning/scorecard/scorecard.py", line 303, in fit
    return self._fit(X, y, sample_weight, metric_special, metric_missing,
  File "/home/user/.local/lib/python3.9/site-packages/optbinning/scorecard/scorecard.py", line 560, in _fit
    self.estimator_.fit(X_t, y, sample_weight)
TypeError: fit() takes 3 positional arguments but 4 were given

Essentially, the problem is that the Scorecard.fit(X, y) method always passes a sample_weight as a third positional argument, even if such fit param is not present in the current context:
https://github.com/guillermo-navas-palencia/optbinning/blob/v0.17.2/optbinning/scorecard/scorecard.py#L559-L560

A quick fix would be to check if sample_weight fit param is set, and only then pass it forward to the estimator_ object:

self.estimator_ = clone(self.estimator)

if sample_weight is not None:
  self.estimator_.fit(X, y, sample_weight = sample_weight)
else:
  self.estimator_.fit(X, y)
@vruusmann
Copy link
Author

While we are at this, perhaps it would make sense to support a workflow, where sample weights are passed only to the binning_process_ object or the estimator_ object?

Something similar to Scikit-Learn's "double underscore prefix" mechanism?

For example:

scorecard.fit(X, y, binning_process__sample_weight = ..., estimator__sample_weight = None)

@guillermo-navas-palencia guillermo-navas-palencia added the enhancement New feature or request label Dec 29, 2022
@guillermo-navas-palencia guillermo-navas-palencia added this to the v0.17.3 milestone Dec 29, 2022
@guillermo-navas-palencia
Copy link
Owner

Thanks, @vruusmann.

I agree with the quick change you proposed for sample weights, although optbinning is designed mainly having in mind sklearn compatibility.

Regarding the "double underscore prefix", I have some doubts. To do so, I would rely on the sklearn Pipeline class to re-implement the scorecard class internals.

@guillermo-navas-palencia
Copy link
Owner

I close this issue. I will open a new one with the double underscore prefix.

@vruusmann
Copy link
Author

I will open a new one with the double underscore prefix.

@guillermo-navas-palencia There is no need for migrating the Scorecard class to Scikit-Learn pipelines in order to implement this functionality.

For example, I've been using this simple utility method for partitioning incoming fit params:
https://github.com/jpmml/sklearn2pmml/blob/0.90.2/sklearn2pmml/ensemble/__init__.py#L47-L54

In the Scorecard case, the algorithm would be something like this:

def fit(X, y, **fit_params):
  # Params that are exclusive to specific components - has either "binning_process" or "estimator" prefix
  binning_process_fit_params = _extract_step_params("binning_process", fit_params)
  estimator_fit_params = _extract_step_params("estimator", fit_params)
  # Params that are common between components - no prefix
  if len(fit_params) > 0:
    binning_process_fit_params.update(fit_params)
    estimator_fit_params.update(fit_params)
  # Actual fit
  binning_process_.fit(X, y, **binning_process_fit_params)
  estimator_.fit(X, y, **estimator_fit_params)

This is just an idea. You decide, whether it's worth having separate fit param sets for individual Scorecard components or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
ToDo
  
Awaiting triage
Development

No branches or pull requests

2 participants