Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SelectFirstClassifier will cause the predicted value of pmml to all become 0? #370

Closed
liuhuanshuo opened this issue Feb 4, 2023 · 9 comments

Comments

@liuhuanshuo
Copy link

When I use SelectFirstClassifier, I found that in some cases, it will cause the predicted value of pmml to be all 0

Use SelectFirstClassifier

As shown in the following code, I built a pipeline based on SelectFirstClassifier

pipeline_t = PMMLPipeline([
    ("mapper_3",mapper_3),
    ('conditional_pipeline',
    SelectFirstClassifier([
        ('model_1', model_1, "X[-2] is not None"), 
        ('model_2', model_2, "X[-2] is None")]))])

If use the pipeline_t here for prediction, I will get the following results

Now, I converted this pipeline_t to pmml and used the same data to make predictions, and found that the results all became 0!

Do not use SelectFirstClassifier

Obviously, the above results are not as expected, so I suspected SelectFirstClassifier, so I modified the code as follows

pipeline_t = PMMLPipeline([
            ("mapper_3",mapper_3),
            ('conditional_pipeline',model_1)])

Of course the only thing I did was remove the SelectFirstClassifier,The prediction result of the pipeline is still normal

At this time, it is converted to a pmml file, and the prediction result is also in line with expectations

So can it be explained that SelectFirstClassifier will affect the prediction results of the pmml file?

@vruusmann
Copy link
Member

You have two issues combined in this report.

First, about the predicted probability values being all 0. This is an OutputField mis-placement issue - recent versions of SkLearn2PMML package try to "optimize" the layout of PMML documents by moving "common" OutputField elements from the child model level (eg. /PMML/MiningModel/Segmentation/RegressionModel) to the parent model level (eg. /PMML/MiningModel).

The JPMML-lEvaluator library gets confused about this, and thinks that it is dealing with a non-probabilistic model, and doesn't need to provide predicted probability values at all.

Second, the precision of predicted probability values changing (eg. from 12 decimal places to 7 decimal places) when switching between language environments (direct Python, Scikit-Learn over Python, PMML).

This appears like a Python runtime configuration issue - something related to "console display preferences". The PMML engine is definitely performing any explicit rounding, it always return float32 values for single-precision models (ie. Model@mathContext="float") and float64 values for double-precision models (ie. Model@mathContext="double").

@vruusmann
Copy link
Member

Recent versions of SkLearn2PMML package try to "optimize" the layout of PMML documents by moving "common" OutputField elements from the child model level to the parent model level.

As a workaround, I'd suggest that you export the model ensemble as sklearn2pmml.ensemble.EstimatorChain (instead of a sklearn2pmml.ensemble.SelectFirstClassifier).

Keep a parallel SelectFirstEstimator object when you need to perform predict_proba(X) in Python environment.

Something like this:

# Shared steps - all SkLearn2PMML ensemble model types expect the same step layout
steps = [...]

# Fit and export
pmml_estimator = EstimatorChain(steps, multioutput = False)
pmml_estimator.fit(X, y)

sklearn2pmml(pmml_estimator)

# Predict probabilities
py_estimator = SelectFirstClassifier(steps)
# The steps have already been fitted, so the object is ready for predict_proba(X) as-is
py_estimator.predict_proba(X)

@vruusmann
Copy link
Member

The ultimate fix would be to add EstimatorChain.predict_proba(X) method for non-multioutput cases, where all child estimators are subclasses of ClassifierMixin.

@liuhuanshuo
Copy link
Author

Keep a parallel SelectFirstEstimator object when you need to perform predict_proba(X) in Python environment.

In fact, the reason I chose sklearn2pmml for conversion is to call pmml in Java for prediction.

Therefore, it is not a very reasonable method to keep a prediction scheme in the Python environment.

However, there doesn't seem to be any other way I can save the SelectFirstClassifier with the predicted probability values?

@vruusmann
Copy link
Member

I'm thinking about refactoring SkLearn2PMML ensemble models in the following way:

  1. Add EstimatorChain.predict_proba(X) method, which will work only multioutput = False mode, when all child models are classifiers.
  2. Make SelectFirstClassifier and SelectFirstRegressor subclasses of EstimatorChain(multioutput = False).

Right now you could fix the SelectFirstClassifier PMML file by relocating OutputField elements from the top-level back to member model-level. Takes less than a minute to do in text editor. But it's very difficult to give you the instructions for doing so via GitHub.

@liuhuanshuo
Copy link
Author

liuhuanshuo commented Feb 27, 2023

hi,villu

Nice to see you pushed a new version

But it seems that this problem is not implemented with the new version? After I save the pipeline containing SelectFirstClassifier as a pmml file, the predicted values are still all 0

Right now you could fix the SelectFirstClassifier PMML file by relocating OutputField elements from the top-level back to member model-level. Takes less than a minute to do in text editor.

So I need to modify the generated pmml now?

But it's very difficult to give you the instructions for doing so via GitHub.

I think I understand what you mean, do I need to move the OutputField below to another location?

I don't know if you can give a simple explanation, since your operation only takes 1 minute, I don't think it will be very complicated

@vruusmann
Copy link
Member

After I save the pipeline containing SelectFirstClassifier as a pmml file, the predicted values are still all 0

The fix was addressed towards the sklearn2pmml.ensemble.EstimatorChain model type, which now provides the predict_proba(X) method. Please consider migrating from SelectFirstClassifier(steps) to EstimatorChain(steps, multioutput = False).

The latest SkLearn2PMML 0.91.0 release was about completely refactoring Python-to-PMML translation functionality (affecting ExpressionTransformer, the step predicates of SelectFirstEstimator, EstimatorChain and RuleSetClassifier).

The new translator has some crazy new capabilities (which I will blog about in short time). Also, the Python side evaluation should be 10x faster, because the expression/predicate is "precompiled" once, and then reused across all rows.

@liuhuanshuo The pandas.isnull(X) nullability check is now also supported in the predicate context. There shouldn't be any module import issues anymore.

@liuhuanshuo
Copy link
Author

I observed many normal working pmml files myself, observed the position of OutputField, and tried some modifications.

I found that just moving the OutputField in front of the LocalTransformations seems to work

But this can only ensure that the predicted values are not all 0, and there are still problems in the prediction of rows with null values.

@liuhuanshuo
Copy link
Author

Surprised we both replied at the same time, I will research yours first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants