Python library for converting Apache Spark ML pipelines to PMML
Clone or download
Latest commit 50da793 Jul 3, 2018


Python library for converting Apache Spark ML pipelines to PMML.


This package provides Python wrapper classes and functions for the JPMML-SparkML library. For the full list of supported Apache Spark ML Estimator and Transformer types, please refer to JPMML-SparkML documentation.


  • Apache Spark 2.0.X, 2.1.X, 2.2.X or 2.3.X.
  • Python 2.7, 3.4 or newer.


Install the latest version from GitHub:

pip install --user --upgrade git+

Configuration and usage

PySpark2PMML must be paired with JPMML-SparkML based on the following compatibility matrix:

Apache Spark version JPMML-SparkML development branch JPMML-SparkML uber-JAR file
2.0.X 1.1.X 1.1.20
2.1.X 1.2.X 1.2.12
2.2.X 1.3.X 1.3.8
2.3.X master 1.4.5

Launch PySpark; use the --jars command-line option to specify the location of the JPMML-SparkML uber-JAR file:

pyspark --jars /path/to/jpmml-sparkml-executable-${version}.jar

Fitting an example pipeline model:

from import Pipeline
from import DecisionTreeClassifier
from import RFormula

df ="Iris.csv", header = True, inferSchema = True)

formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel =

Exporting the fitted example pipeline model to a PMML file:

from pyspark2pmml import PMMLBuilder

pmmlBuilder = PMMLBuilder(sc, df, pipelineModel) \
	.putOption(classifier, "compact", True)



PySpark2PMML is dual-licensed under the GNU Affero General Public License (AGPL) version 3.0, and a commercial license.

Additional information

PySpark2PMML is developed and maintained by Openscoring Ltd, Estonia.

Interested in using JPMML software in your application? Please contact