Python package for converting Apache Spark ML pipelines to PMML.
This package is a thin PySpark wrapper for the JPMML-SparkML library.
See the NEWS.md file.
- PySpark 3.0.X through 3.5.X, 4.0.X or 4.1.X.
- Python 3.8 or newer.
Install a release version from PyPI:
pip install pyspark2pmmlAlternatively, install the latest snapshot version from GitHub:
pip install --upgrade git+https://github.com/jpmml/pyspark2pmml.gitOne and the same PySpark2PMML version works across all supported PySpark release lines. Version variance is confined to the underlying JPMML-SparkML library, where each Apache Spark release line maps to a dedicated JPMML-SparkML release line.
PySpark2PMML must be paired with JPMML-SparkML based on the following compatibility matrix:
| Apache Spark version | JPMML-SparkML branch | Latest JPMML-SparkML version |
|---|---|---|
| 4.1.X | master |
3.3.3 |
| 4.0.X | 3.2.X |
3.2.10 |
| 3.5.X | 3.1.X |
3.1.11 |
| 3.4.X | 3.0.X |
3.0.11 |
Additionally, PySpark2PMML should be interoperable with now-legacy Apache Spark 3.0 through 3.3 release lines. Please see the JPMML-SparkML documentation for extended compatibility matrices.
PySpark2PMML version 0.11.0 and newer bundle JPMML-SparkML JAR files for quick programmatic setup.
Use the pyspark2pmml.spark_jars() utility function to obtain a PySpark-version dependent classpath string, and pass it as spark.jars configuration entry when building a Spark session:
from pyspark.sql import SparkSession
import pyspark2pmml
spark = SparkSession.builder \
.config("spark.jars", pyspark2pmml.spark_jars()) \
.getOrCreate()Use the pyspark2pmml.spark_jars_packages() utility function to obtain a PySpark-version dependent Apache Maven package coordinates string:
import pyspark2pmml
print(pyspark2pmml.spark_jars_packages())Pass this value to pyspark or spark-submit using the --packages command-line option:
$SPARK_HOME/bin/pyspark --packages $(python -c "import pyspark2pmml; print(pyspark2pmml.spark_jars_packages())")Fitting a Spark ML pipeline:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula
df = spark.read.csv("Iris.csv", header = True, inferSchema = True)
formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)Exporting the fitted Spark ML pipeline to a PMML file:
from pyspark2pmml import PMMLBuilder
pmmlBuilder = PMMLBuilder(df.schema, pipelineModel)
pmmlBuilder.buildFile("DecisionTreeIris.pmml")The representation of individual Spark ML pipeline stages can be customized via conversion options:
from pyspark2pmml import PMMLBuilder
classifierModel = pipelineModel.stages[1]
pmmlBuilder = PMMLBuilder(df.schema, pipelineModel) \
.putOption(classifierModel, "compact", False) \
.putOption(classifierModel, "estimate_featureImportances", True)
pmmlBuilder.buildFile("DecisionTreeIris.pmml")PySpark2PMML is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.
If you would like to use PySpark2PMML in a proprietary software project, then it is possible to enter into a licensing agreement which makes PySpark2PMML available under the terms and conditions of the BSD 3-Clause License instead.
PySpark2PMML is developed and maintained by Openscoring Ltd, Estonia.
Interested in using Java PMML API software in your company? Please contact info@openscoring.io