The `Scorecard.fit(X, y)` method should be more considerate with the `sample_weight` fit param #228

vruusmann · 2022-12-28T17:12:08Z

I'm trying to combine OptBinning with StatsModels estimators, using the sklearn2pmml.statsmodels.StatsModelsClassifier estimator as a StatsModels/SkLearn bridge.

StatsModels estimators generally do not support sample weights. The ones that do, they expect sample weights as a constructor argument, not as a fit method argument.

Now, if we don't care about sample weights, then it should be possible to train a simple binary classification scorecard like this:

from optbinning import BinningProcess
from optbinning.scorecard import Scorecard
from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn2pmml.statsmodels import StatsModelsClassifier
from statsmodels.api import Logit

iris_X, iris_y = load_iris(return_X_y = True)

iris_X = DataFrame(iris_X, columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"])
iris_y = (iris_y == 1)

binning_process = BinningProcess(variable_names = iris_X.columns.values)
estimator = StatsModelsClassifier(Logit)

scorecard = Scorecard(binning_process = binning_process, estimator = estimator)
scorecard.fit(iris_X, iris_y)

When running this example with OptBinning 0.17.2, then the following TypeError is raised:

Traceback (most recent call last):
  File "main.py", line 25, in <module>
    scorecard.fit(iris_X, iris_y)
  File "/home/user/.local/lib/python3.9/site-packages/optbinning/scorecard/scorecard.py", line 303, in fit
    return self._fit(X, y, sample_weight, metric_special, metric_missing,
  File "/home/user/.local/lib/python3.9/site-packages/optbinning/scorecard/scorecard.py", line 560, in _fit
    self.estimator_.fit(X_t, y, sample_weight)
TypeError: fit() takes 3 positional arguments but 4 were given

Essentially, the problem is that the Scorecard.fit(X, y) method always passes a sample_weight as a third positional argument, even if such fit param is not present in the current context:
https://github.com/guillermo-navas-palencia/optbinning/blob/v0.17.2/optbinning/scorecard/scorecard.py#L559-L560

A quick fix would be to check if sample_weight fit param is set, and only then pass it forward to the estimator_ object:

self.estimator_ = clone(self.estimator)

if sample_weight is not None:
  self.estimator_.fit(X, y, sample_weight = sample_weight)
else:
  self.estimator_.fit(X, y)

The text was updated successfully, but these errors were encountered:

vruusmann · 2022-12-28T17:16:08Z

While we are at this, perhaps it would make sense to support a workflow, where sample weights are passed only to the binning_process_ object or the estimator_ object?

Something similar to Scikit-Learn's "double underscore prefix" mechanism?

For example:

scorecard.fit(X, y, binning_process__sample_weight = ..., estimator__sample_weight = None)

guillermo-navas-palencia · 2023-01-09T22:10:34Z

Thanks, @vruusmann.

I agree with the quick change you proposed for sample weights, although optbinning is designed mainly having in mind sklearn compatibility.

Regarding the "double underscore prefix", I have some doubts. To do so, I would rely on the sklearn Pipeline class to re-implement the scorecard class internals.

guillermo-navas-palencia · 2023-01-09T23:09:27Z

I close this issue. I will open a new one with the double underscore prefix.

vruusmann · 2023-01-10T08:02:09Z

I will open a new one with the double underscore prefix.

@guillermo-navas-palencia There is no need for migrating the Scorecard class to Scikit-Learn pipelines in order to implement this functionality.

For example, I've been using this simple utility method for partitioning incoming fit params:
https://github.com/jpmml/sklearn2pmml/blob/0.90.2/sklearn2pmml/ensemble/__init__.py#L47-L54

In the Scorecard case, the algorithm would be something like this:

def fit(X, y, **fit_params):
  # Params that are exclusive to specific components - has either "binning_process" or "estimator" prefix
  binning_process_fit_params = _extract_step_params("binning_process", fit_params)
  estimator_fit_params = _extract_step_params("estimator", fit_params)
  # Params that are common between components - no prefix
  if len(fit_params) > 0:
    binning_process_fit_params.update(fit_params)
    estimator_fit_params.update(fit_params)
  # Actual fit
  binning_process_.fit(X, y, **binning_process_fit_params)
  estimator_.fit(X, y, **estimator_fit_params)

This is just an idea. You decide, whether it's worth having separate fit param sets for individual Scorecard components or not.

guillermo-navas-palencia self-assigned this Dec 29, 2022

guillermo-navas-palencia added the enhancement New feature or request label Dec 29, 2022

guillermo-navas-palencia added this to the v0.17.3 milestone Dec 29, 2022

guillermo-navas-palencia added a commit that referenced this issue Jan 9, 2023

Implement sample_weight check in scorecard class #228

9b145f6

guillermo-navas-palencia closed this as completed Jan 9, 2023

guillermo-navas-palencia mentioned this issue Feb 12, 2023

Develop #231

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `Scorecard.fit(X, y)` method should be more considerate with the `sample_weight` fit param #228

The `Scorecard.fit(X, y)` method should be more considerate with the `sample_weight` fit param #228

vruusmann commented Dec 28, 2022 •

edited

Loading

vruusmann commented Dec 28, 2022

guillermo-navas-palencia commented Jan 9, 2023

guillermo-navas-palencia commented Jan 9, 2023

vruusmann commented Jan 10, 2023

The Scorecard.fit(X, y) method should be more considerate with the sample_weight fit param #228

The Scorecard.fit(X, y) method should be more considerate with the sample_weight fit param #228

Comments

vruusmann commented Dec 28, 2022 • edited Loading

vruusmann commented Dec 28, 2022

guillermo-navas-palencia commented Jan 9, 2023

guillermo-navas-palencia commented Jan 9, 2023

vruusmann commented Jan 10, 2023

The `Scorecard.fit(X, y)` method should be more considerate with the `sample_weight` fit param #228

The `Scorecard.fit(X, y)` method should be more considerate with the `sample_weight` fit param #228

vruusmann commented Dec 28, 2022 •

edited

Loading