Skip to content

jpmml/jpmml-sklearn

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

JPMML-SkLearn Build Status

Java library and command-line application for converting Scikit-Learn pipelines to PMML.

Table of Contents

Features

Overview

  • Functionality:
    • Three times more supported Python packages, transformers and estimators than all the competitors combined!
    • Thorough collection, analysis and encoding of feature information:
      • Names.
      • Data and operational types.
      • Valid, invalid and missing value spaces.
      • Descriptive statistics.
    • Pipeline extensions:
      • Pruning.
      • Decision engineering (prediction post-processing).
      • Model verification.
    • Conversion options.
  • Extensibility:
    • Rich Java APIs for developing custom converters.
    • Automatic discovery and registration of custom converters based on META-INF/sklearn2pmml.properties resource files.
    • Direct interfacing with other JPMML conversion libraries such as JPMML-H2O, JPMML-LightGBM, JPMML-StatsModels and JPMML-XGBoost.
  • Production quality:
    • Complete test coverage.
    • Fully compliant with the JPMML-Evaluator library.

Supported packages

Scikit-Learn

Examples: main.py

Category Encoders

Examples: extensions/category_encoders.py and extensions/category_encoders-xgboost.py

H2O.ai

Examples: main-h2o.py

Imbalanced-Learn

Examples: extensions/imblearn.py

LightGBM

Examples: main-lightgbm.py

Mlxtend

Examples: N/A

OptBinning

Examples: extensions/optbinning.py

PyCaret

Examples: extensions/pycaret.py

  • pycaret.internal.pipeline.Pipeline
  • pycaret.internal.preprocess.transformers.CleanColumnNames
  • pycaret.internal.preprocess.transformers.FixImbalancer
  • pycaret.internal.preprocess.transformers.RareCategoryGrouping
  • pycaret.internal.preprocess.transformers.RemoveMulticollinearity
  • pycaret.internal.preprocess.transformers.TransformerWrapper
  • pycaret.internal.preprocess.transformers.TransformerWrapperWithInverse
Scikit-Lego

Examples: extensions/sklego.py

  • sklego.meta.EstimatorTransformer
    • Predict functions apply, decision_function, predict and predict_proba.
  • sklego.pipeline.DebugPipeline
  • sklego.preprocessing.IdentityTransformer
SkLearn2PMML

Examples: main.py and extensions/sklearn2pmml.py

  • Helpers:
    • sklearn2pmml.EstimatorProxy
    • sklearn2pmml.SelectorProxy
    • sklearn2pmml.h2o.H2OEstimatorProxy
  • Feature specification and decoration:
    • sklearn2pmml.decoration.Alias
    • sklearn2pmml.decoration.CategoricalDomain
    • sklearn2pmml.decoration.ContinuousDomain
    • sklearn2pmml.decoration.ContinuousDomainEraser
    • sklearn2pmml.decoration.DateDomain
    • sklearn2pmml.decoration.DateTimeDomain
    • sklearn2pmml.decoration.DiscreteDomainEraser
    • sklearn2pmml.decoration.MultiAlias
    • sklearn2pmml.decoration.MultiDomain
    • sklearn2pmml.decoration.OrdinalDomain
  • Ensemble methods:
    • sklearn2pmml.ensemble.EstimatorChain
    • sklearn2pmml.ensemble.GBDTLMRegressor
      • The GBDT side: All Scikit-Learn decision tree ensemble regressors, LGBMRegressor, XGBRegressor, XGBRFRegressor.
      • The LM side: A Scikit-Learn linear regressor (eg. ElasticNet, LinearRegression, SGDRegressor).
    • sklearn2pmml.ensemble.GBDTLRClassifier
      • The GBDT side: All Scikit-Learn decision tree ensemble classifiers, LGBMClassifier, XGBClassifier, XGBRFClassifier.
      • The LR side: A Scikit-Learn binary linear classifier (eg. LinearSVC, LogisticRegression, SGDClassifier).
    • sklearn2pmml.ensemble.SelectFirstClassifier
    • sklearn2pmml.ensemble.SelectFirstRegressor
  • Feature selection:
    • sklearn2pmml.feature_selection.SelectUnique
  • Linear models:
    • sklearn2pmml.statsmodels.StatsModelsClassifier
    • sklearn2pmml.statsmodels.StatsModelsRegressor
  • Neural networks:
    • sklearn2pmml.neural_network.MLPTransformer
  • Pipeline:
    • sklearn2pmml.pipeline.PMMLPipeline
  • Postprocessing:
    • sklearn2pmml.postprocessing.BusinessDecisionTransformer
  • Preprocessing:
    • sklearn2pmml.preprocessing.Aggregator
    • sklearn2pmml.preprocessing.BSplineTransformer
    • sklearn2pmml.preprocessing.CastTransformer
    • sklearn2pmml.preprocessing.ConcatTransformer
    • sklearn2pmml.preprocessing.CutTransformer
    • sklearn2pmml.preprocessing.DataFrameConstructor
    • sklearn2pmml.preprocessing.DateTimeFormatter
    • sklearn2pmml.preprocessing.DaysSinceYearTransformer
    • sklearn2pmml.preprocessing.ExpressionTransformer
      • Ternary conditional expression <expression_true> if <condition> else <expression_false>.
      • Array indexing expressions X[<column index>] and X[<column name>].
      • String concatenation expressions.
      • String slicing expressions <str>[<start>:<stop>].
      • Arithmetic operators +, -, *, / and %.
      • Identity comparison operators is None and is not None.
      • Comparison operators in <list>, not in <list>, <=, <, ==, !=, > and >=.
      • Logical operators and, or and not.
      • Numpy constant numpy.NaN.
      • Numpy function numpy.where.
      • Numpy universal functions (too numerous to list).
      • Pandas constants pandas.NA and pandas.NaT.
      • Pandas functions pandas.isna, pandas.isnull, pandas.notna and pandas.notnull.
      • Scipy functions scipy.special.expit and scipy.special.logit.
      • String functions startswith(<prefix>), endswith(<suffix>), lower, upper and strip.
      • String length function len(<str>)
    • sklearn2pmml.preprocessing.FilterLookupTransformer
    • sklearn2pmml.preprocessing.LookupTransformer
    • sklearn2pmml.preprocessing.MatchesTransformer
    • sklearn2pmml.preprocessing.MultiLookupTransformer
    • sklearn2pmml.preprocessing.NumberFormatter
    • sklearn2pmml.preprocessing.PMMLLabelBinarizer
    • sklearn2pmml.preprocessing.PMMLLabelEncoder
    • sklearn2pmml.preprocessing.PowerFunctionTransformer
    • sklearn2pmml.preprocessing.ReplaceTransformer
    • sklearn2pmml.preprocessing.SecondsSinceMidnightTransformer
    • sklearn2pmml.preprocessing.SecondsSinceYearTransformer
    • sklearn2pmml.preprocessing.StringNormalizer
    • sklearn2pmml.preprocessing.SubstringTransformer
    • sklearn2pmml.preprocessing.WordCountTransformer
    • sklearn2pmml.preprocessing.h2o.H2OFrameConstructor
    • sklearn2pmml.util.Reshaper
    • sklearn2pmml.util.Slicer
  • Rule sets:
    • sklearn2pmml.ruleset.RuleSetClassifier
  • Decision trees:
    • sklearn2pmml.tree.chaid.CHAIDClassifier
    • sklearn2pmml.tree.chaid.CHAIDRegressor
Sklearn-Pandas

Examples: main.py

  • sklearn_pandas.CategoricalImputer
  • sklearn_pandas.DataFrameMapper
StatsModels

Examples: main-statsmodels.py

TPOT

Examples: extensions/tpot.py

  • tpot.builtins.stacking_estimator.StackingEstimator
XGBoost

Examples: main-xgboost.py, extensions/category_encoders-xgboost.py and extensions/categorical.py

Prerequisites

The Python side of operations

Validating Python installation:

import joblib, sklearn, sklearn_pandas, sklearn2pmml

print(joblib.__version__)
print(sklearn.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)

The JPMML-SkLearn side of operations

  • Java 1.8 or newer.

Installation

Enter the project root directory and build using Apache Maven:

mvn clean install

The build produces a library JAR file pmml-sklearn/target/pmml-sklearn-1.7-SNAPSHOT.jar, and an executable uber-JAR file pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar.

Usage

A typical workflow can be summarized as follows:

  1. Use Python to train a model.
  2. Serialize the model in pickle data format to a file in a local filesystem.
  3. Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.

The Python side of operations

Loading data to a pandas.DataFrame object:

import pandas

df = pandas.read_csv("Iris.csv")

iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]

First, creating a sklearn_pandas.DataFrameMapper object, which performs column-oriented feature engineering and selection work:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain

column_preprocessor = DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])

Second, creating Transformer and Selector objects, which perform table-oriented feature engineering and selection work:

from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy

table_preprocessor = Pipeline([
    ("pca", PCA(n_components = 3)),
    ("selector", SelectorProxy(SelectKBest(k = 2)))
])

Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy object.

Third, creating an Estimator object:

from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(min_samples_leaf = 5)

Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline object, and running the experiment:

from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
    ("columns", column_preprocessor),
    ("table", table_preprocessor),
    ("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)

Recording feature importance information in a pickle data format-compatible manner:

classifier.pmml_feature_importances_ = classifier.feature_importances_

Embedding model verification data:

pipeline.verify(iris_X.sample(n = 15))

Storing the fitted PMMLPipeline object in pickle data format:

import joblib

joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)

Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.

The JPMML-SkLearn side of operations

Converting the pipeline pickle file pipeline.pkl.z to a PMML file pipeline.pmml:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml

Getting help:

java -jar pmml-sklearn-example/target/pmml-sklearn-example-executable-1.7-SNAPSHOT.jar --help

Documentation

Up-to-date:

Slightly outdated:

License

JPMML-SkLearn is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use JPMML-SkLearn in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SkLearn available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact info@openscoring.io

About

Java library and command-line application for converting Scikit-Learn pipelines to PMML

Resources

License

Stars

Watchers

Forks

Packages

No packages published