This repository has been archived by the owner. It is now read-only.
The simplest way to get started with a JPMML-SparkML powered software project (legacy codebase)
Java R
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
LICENSE.txt
README.md
pom.xml

README.md

JPMML-SparkML-Bootstrap

The simplest way to get started with a [JPMML-SparkML] (https://github.com/jpmml/jpmml-evaluator) powered software project.

IMPORTANT

This is a legacy codebase.

Starting from September 2016, this project has been superseded by the [JPMML-SparkML-Package] (https://github.com/jpmml/jpmml-sparkml-package) project.

Prerequisites

Installation

Check out the JPMML-SparkML-Bootstrap project and enter its directory:

git clone https://github.com/jpmml/jpmml-sparkml-bootstrap.git
cd jpmml-sparkml-bootstrap

Build the project:

mvn clean install

The build produces an uber-JAR file target/bootstrap-1.0-SNAPSHOT.jar.

Development

Initialize [Eclipse IDE] (https://eclipse.org/ide/) support files .project and .classpath:

mvn eclipse:eclipse

Launch the Eclipse IDE, and open the project import wizard via File > Import... > General / Existing Projects into Workspace. In the project wizard window, activate the radio button Select root directory and specify the location of the JPMML-SparkML-Bootstrap directory. Click Finish to close the project wizard window.

The Eclipse IDE will show the imported JPMML-SparkML-Bootstrap project in the package explorer view as jpmml-sparkml-bootstrap.

Usage

The uber-JAR file contains an executable class org.jpmml.sparkml.bootstrap.Main, which fits a simple two-stage Spark ML pipeline model where the first stage is a [RFormula] (https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/RFormula.html) feature selector and the second stage is either a [DecisionTreeRegressor] (https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/regression/DecisionTreeRegressor.html) or [DecisionTreeClassifier] (https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/classification/DecisionTreeClassifier.html) estimator.

This application is suitable for the quick exploration of datasets.

Launching this application using the [spark-submit] (http://spark.apache.org/docs/latest/submitting-applications.html) script:

spark-submit \
  --class org.jpmml.sparkml.bootstrap.Main \
  target/bootstrap-1.0-SNAPSHOT.jar \
  --csv-input <path to data CSV input file> \
  --formula <model formula in R formula notation> \
  --function <model function> \
  --pmml-output <path to model PMML output file>

Wine quality dataset

The [wine quality dataset] (https://archive.ics.uci.edu/ml/datasets/Wine+Quality) is suitable both for regression and classification analyses.

Predicting the quality score (integer in range 1 to 10) of wines:

spark-submit --master local --class org.jpmml.sparkml.bootstrap.Main target/bootstrap-1.0-SNAPSHOT.jar --csv-input src/test/resources/wine.csv --formula "quality ~ ." --function REGRESSION --pmml-output wine-quality.pmml

Predicting the color ("white" or "red") of wines:

spark-submit --master local --class org.jpmml.sparkml.bootstrap.Main target/bootstrap-1.0-SNAPSHOT.jar --csv-input src/test/resources/wine.csv --formula "color ~ . -quality" --function CLASSIFICATION --pmml-output wine-color.pmml

Adult (aka Census) dataset

The [adult dataset] (https://archive.ics.uci.edu/ml/datasets/Adult) is suitable for classification analyses.

Predicting the income level ("<=50K" or ">50K") of US residents:

spark-submit --master local --class org.jpmml.sparkml.bootstrap.Main target/bootstrap-1.0-SNAPSHOT.jar --csv-input src/test/resources/census.csv --formula "income ~ ." --function CLASSIFICATION --pmml-output census.pmml

License

JPMML-SparkML-Bootstrap is licensed under the [GNU Affero General Public License (AGPL) version 3.0] (http://www.gnu.org/licenses/agpl-3.0.html). Other licenses are available on request.

Additional information

Please contact [info@openscoring.io] (mailto:info@openscoring.io)