# Import a Spark MLlib model to IBM Watson Machine Learning

Importing a model into Watson Machine Learning means to store a trained model in your Watson Machine Learning repository and then deploy the stored model.  This notebook demonstrates importing an in-memory, Spark MLlib PipelineModel object.

See also: <a href="https://dataplatform.cloud.ibm.com/docs/content/analyze-data/ml-import-spark-mllib.html" target="_blank" rel="noopener noreferrer">Importing a Spark MLlib model</a>

This notebook runs on Spark Python 3.5.


### Notebook sections

[Setup](#setup)

1. [Load training data](#loadata)
2. [Build model](#buildmodel)
3. [Train and evaluate model](#trainmodel)
4. [Store and deploy model](#storedeploymodel)

# <a id="setup"></a> Set up
- Install packages
- Import libraries
- Instaiate a Watson Machine Learning client

In [None]:
!pip install wget # needed to download sample files

In [None]:
!pip install watson_machine_learning_client

Paste your Watson Machine Learning credentials in the following cell.

See: <a href="https://dataplatform.cloud.ibm.com/docs/content/analyze-data/ml-get-wml-credentials.html" target="_blank" rel="noopener noreferrer">Looking up credentials</a>

In [None]:
# Create a Watson Machine Learning client instance
from watson_machine_learning_client import WatsonMachineLearningAPIClient
wml_credentials = {
    "instance_id" : "",
    "password"    : "",
    "url"         : "",
    "username"    : ""
}
client = WatsonMachineLearningAPIClient( wml_credentials )

## <a id="loaddata"></a> 1. Load and prepare sample training data

**About the sample model**

The sample model built here is a logistic regression model for predicting whether or not a customer will purchase a tent from a fictional outdoor equipment store, based on the customer charateristics.

The data used to train the model is the "GoSales.csv" training data in the IBM Watson Studio community: <a href="https://dataplatform.cloud.ibm.com/exchange/public/entry/view/aa07a773f71cf1172a349f33e2028e4e" target="_blank" rel="noopener noreferrer">GoSales sample data</a>.

In [2]:
# Download sample training data to notebook working directory
import wget
training_data_url = 'https://dataplatform.cloud.ibm.com/data/exchange-api/v1/entries/aa07a773f71cf1172a349f33e2028e4e/data?accessKey=e98b7315f84e5448aa94c633ca66ea83'
filename = wget.download( training_data_url )
print( filename )

GoSales.csv


In [3]:
# Read sample data into a Spark DataFrame
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load( filename )
df.printSchema()

root
 |-- GENDER: string (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- MARITAL_STATUS: string (nullable = true)
 |-- PROFESSION: string (nullable = true)
 |-- IS_TENT: boolean (nullable = true)
 |-- PRODUCT_LINE: string (nullable = true)
 |-- PURCHASE_AMOUNT: double (nullable = true)



In [4]:
# Select columns of interest
from pyspark.sql.types import IntegerType
training_data = df.select( "GENDER", "AGE", "MARITAL_STATUS", "PROFESSION", df.IS_TENT.cast( IntegerType() ) )
training_data.printSchema()

root
 |-- GENDER: string (nullable = true)
 |-- AGE: integer (nullable = true)
 |-- MARITAL_STATUS: string (nullable = true)
 |-- PROFESSION: string (nullable = true)
 |-- IS_TENT: integer (nullable = true)



## <a id="buildmodel"></a> 2. Build a PipelineModel object

In [5]:
# Create indexers for string columns
from pyspark.ml.feature import StringIndexer
indexer_GENDER         = StringIndexer( inputCol="GENDER",         outputCol="GENDER_index"         )
indexer_MARITAL_STATUS = StringIndexer( inputCol="MARITAL_STATUS", outputCol="MARITAL_STATUS_index" )
indexer_PROFESSION     = StringIndexer( inputCol="PROFESSION",     outputCol="PROFESSION_index"     )

In [6]:
# Create an assembler that generates the feature vector column
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
feature_vector_assembler = VectorAssembler( inputCols=[ "GENDER_index", "AGE", "MARITAL_STATUS_index", "PROFESSION_index" ],  outputCol="features" )

In [7]:
# Create a logistic regression model
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression( featuresCol='features', labelCol='IS_TENT' )

In [8]:
# Create a PipelineModel
from pyspark.ml import Pipeline
pipeline = Pipeline( stages=[ indexer_GENDER, indexer_MARITAL_STATUS, indexer_PROFESSION, feature_vector_assembler, lr ] )

## <a id="trainmodel"></a> 3. Train and evaluate the model

In [9]:
# Split the training data into a training set and a test set
train, test = training_data.randomSplit( [ 0.75, 0.25 ], seed = 2019 )
print( "Train count: " + str( train.count() ) )
print( "Test count: "  + str( test.count()  ) )

Train count: 45186
Test count: 15066


In [10]:
# Train the PipelineModel
pipeline_model = pipeline.fit( train )

In [11]:
# Evaluate the model performance
predictions = pipeline_model.transform( test )
correct_false = predictions.filter( "IS_TENT == 0 AND prediction == 0.0" )
correct_true = predictions.filter( "IS_TENT == 1 AND prediction != 0.0" )
print( "Success rate: " + str( round( 100 * ( ( correct_false.count() + correct_true.count() ) / predictions.count() ) ) ) + "%" )

Success rate: 78%


## <a id="storedeploymodel"></a> 4. Store and deploy the model in Watson Machine Learning

In [26]:
# Store the PipelineModel in the Watson Machine Learning repository
model_details = client.repository.store_model( pipeline_model, 'Spark MLlib model', training_data=train, pipeline=pipeline )

In [None]:
# Deploy the stored model as an online web service deployment
model_id = model_details["metadata"]["guid"]
deployment_details = client.deployments.create( artifact_uid=model_id, name="Spark MLlib model deployment" )

In [29]:
# Test the deployment
model_endpoint_url = client.deployments.get_scoring_url( deployment_details )
payload = { "fields" : [ "GENDER", "AGE", "MARITAL_STATUS", "PROFESSION" ], "values" : [ [ "M", 27, "Single", "Professional" ] ] }
client.deployments.score( model_endpoint_url, payload )

{'fields': ['GENDER',
  'AGE',
  'MARITAL_STATUS',
  'PROFESSION',
  'GENDER_index',
  'MARITAL_STATUS_index',
  'PROFESSION_index',
  'features',
  'rawPrediction',
  'probability',
  'prediction'],
 'values': [['M',
   27,
   'Single',
   'Professional',
   0.0,
   1.0,
   1.0,
   [0.0, 27.0, 1.0, 1.0],
   [0.16773330636208073, -0.16773330636208073],
   [0.541835288108435, 0.458164711891565],
   0.0]]}

In [30]:
# Testing the model locally gets the same results
test_df = spark.createDataFrame( [ ( "M", 27, "Single", "Professional" ) ], [ "GENDER", "AGE", "MARITAL_STATUS", "PROFESSION" ] )
pipeline_model.transform( test_df ).show()

+------+---+--------------+------------+------------+--------------------+----------------+------------------+--------------------+--------------------+----------+
|GENDER|AGE|MARITAL_STATUS|  PROFESSION|GENDER_index|MARITAL_STATUS_index|PROFESSION_index|          features|       rawPrediction|         probability|prediction|
+------+---+--------------+------------+------------+--------------------+----------------+------------------+--------------------+--------------------+----------+
|     M| 27|        Single|Professional|         0.0|                 1.0|             1.0|[0.0,27.0,1.0,1.0]|[0.16773330636208...|[0.54183528810843...|       0.0|
+------+---+--------------+------------+------------+--------------------+----------------+------------------+--------------------+--------------------+----------+



## Summary
In this notebook, you imported an in-memory, Spark MLlib PipelineModel into Watson Machine Learning using the Watson Machine Learning Python client.

### <a id="authors"></a>Authors

**Sarah Packowski** is a member of the IBM Watson Studio Content Design team in Canada.


<hr>
Copyright &copy; IBM Corp. 2019. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>