# Save a scikit-learn model in PMML format

This notebook demonstrates saving a trained scikit-learn model in PMML format.

This notebook runs on Python 3.5.


## Notebook sections

1. [Load and prepare training data](#loadata)
2. [Train and evaluate model](#trainmodel)
3. [Save model in PMML format](#savemodel)



**About the sample model**

The sample model built here is a logistic regression model for predicting whether or not a customer will purchase a tent from a fictional outdoor equipment store, based on the customer charateristics.

The data used to train the model is the "GoSales.csv" training data in the IBM Watson Studio community: <a href="https://dataplatform.cloud.ibm.com/exchange/public/entry/view/aa07a773f71cf1172a349f33e2028e4e" target="_blank" rel="noopener noreferrer">GoSales sample data</a>.

### <a id="loaddata"></a> 1. Load and prepare sample training data

In [2]:
!pip install wget

In [3]:
# Download sample training data to notebook working directory
import wget
training_data_url = 'https://dataplatform.cloud.ibm.com/data/exchange-api/v1/entries/aa07a773f71cf1172a349f33e2028e4e/data?accessKey=e98b7315f84e5448aa94c633ca66ea83'
filename = wget.download( training_data_url )
print( filename )

GoSales.csv


In [4]:
# Read sample data into a pandas DataFrame
import pandas as pd
df = pd.read_csv( filename )
df[0:5]

Unnamed: 0,GENDER,AGE,MARITAL_STATUS,PROFESSION,IS_TENT,PRODUCT_LINE,PURCHASE_AMOUNT
0,M,27,Single,Professional,True,Camping Equipment,144.78
1,F,39,Married,Other,False,Outdoor Protection,144.83
2,F,39,Married,Other,False,Outdoor Protection,137.37
3,F,56,Unspecified,Hospitality,False,Personal Accessories,92.61
4,M,45,Married,Retired,False,Golf Equipment,119.04


In [5]:
# Select columns of interest
training_data = df[["GENDER","AGE","MARITAL_STATUS","PROFESSION","IS_TENT"]].copy()
print( training_data[0:5] )

  GENDER  AGE MARITAL_STATUS    PROFESSION  IS_TENT
0      M   27         Single  Professional     True
1      F   39        Married         Other    False
2      F   39        Married         Other    False
3      F   56    Unspecified   Hospitality    False
4      M   45        Married       Retired    False


In [6]:
# Create label encoders for string columns
from sklearn.preprocessing import LabelEncoder
import numpy as np
le_GENDER = LabelEncoder().fit( training_data["GENDER"] )
le_MARITAL_STATUS = LabelEncoder().fit( training_data["MARITAL_STATUS"] )
le_PROFESSION = LabelEncoder().fit( training_data["PROFESSION"] )

print( "le_GENDER:" )
print( np.sort( np.array( [ le_GENDER.transform(le_GENDER.classes_), le_GENDER.classes_ ] ).T, axis=0 ) )
print( "\nle_MARITAL_STATUS:" )
print( np.sort( np.array( [ le_MARITAL_STATUS.transform(le_MARITAL_STATUS.classes_), le_MARITAL_STATUS.classes_ ] ).T, axis=0 ) )
print( "\nle_PROFESSION:" )
print( np.sort( np.array( [ le_PROFESSION.transform(le_PROFESSION.classes_), le_PROFESSION.classes_ ] ).T, axis=0 ) )

le_GENDER:
[[0 'F']
 [1 'M']]

le_MARITAL_STATUS:
[[0 'Married']
 [1 'Single']
 [2 'Unspecified']]

le_PROFESSION:
[[0 'Executive']
 [1 'Hospitality']
 [2 'Other']
 [3 'Professional']
 [4 'Retail']
 [5 'Retired']
 [6 'Sales']
 [7 'Student']
 [8 'Trades']]


In [7]:
# Create encoded colums in the training data
training_data["GENDER_index"] = le_GENDER.transform( training_data["GENDER"] )
training_data["MARITAL_STATUS_index"] = le_MARITAL_STATUS.transform( training_data["MARITAL_STATUS"] )
training_data["PROFESSION_index"] = le_PROFESSION.transform( training_data["PROFESSION"] )
training_data[0:5]

Unnamed: 0,GENDER,AGE,MARITAL_STATUS,PROFESSION,IS_TENT,GENDER_index,MARITAL_STATUS_index,PROFESSION_index
0,M,27,Single,Professional,True,1,1,3
1,F,39,Married,Other,False,0,0,2
2,F,39,Married,Other,False,0,0,2
3,F,56,Unspecified,Hospitality,False,0,2,1
4,M,45,Married,Retired,False,1,0,5


### <a id="trainmodel"></a> 2. Create a logistic regression model and then train and evaluate the model

In [8]:
!pip install git+https://github.com/jpmml/sklearn2pmml.git

In [9]:
# Create a pipeline that can be saved in PMML format implementing a logistic regression model
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn.linear_model import LogisticRegression
pmml_pipeline = PMMLPipeline( [ ("classifier", LogisticRegression() ) ] )

In [10]:
# Split the training data into a training set and a test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( training_data[[ "AGE", "GENDER_index", "MARITAL_STATUS_index", "PROFESSION_index" ]], training_data["IS_TENT"].astype(int) )

In [11]:
# Train the model
pmml_pipeline.fit( X_train, y_train )

PMMLPipeline(steps=[('classifier', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [12]:
# Evaluate the model performance
predictions = pmml_pipeline.predict( X_test )
num_correct = ( ( predictions == y_test.values ) == True ).sum()
print( "Success rate: " + str( round( 100 * ( num_correct / len( predictions ) ) ) ) + "%" )

Success rate: 85.0%


In [13]:
# Grab some example data for quick test
df[13:15]

Unnamed: 0,GENDER,AGE,MARITAL_STATUS,PROFESSION,IS_TENT,PRODUCT_LINE,PURCHASE_AMOUNT
13,F,35,Married,Professional,False,Golf Equipment,152.95
14,M,20,Single,Sales,True,Mountaineering Equipment,124.66


In [15]:
negative_example_payload = [ 35, le_GENDER.transform( ["F"] )[0], le_MARITAL_STATUS.transform( ["Married"] )[0], le_PROFESSION.transform( ["Professional"] )[0] ]
print( "Negative_example_payload (did not buy a tent): " + str( negative_example_payload ) )

Negative_example_payload (did not buy a tent): [35, 0, 0, 3]


In [16]:
pmml_pipeline.predict( [ negative_example_payload ] )

array([0])

In [17]:
positive_example_payload = [ 20, le_GENDER.transform( ["M"] )[0], le_MARITAL_STATUS.transform( ["Single"] )[0], le_PROFESSION.transform( ["Sales"] )[0] ]
print( "Positive_example payload (did buy a tent): " + str( positive_example_payload ) )

Positive_example payload (did buy a tent): [20, 1, 1, 6]


In [19]:
pmml_pipeline.predict( [ positive_example_payload ] )

array([1])

### <a id="savemodel"></a> 3. Save the model in PMML format

In [25]:
# Save the model to a file in PMML format
from sklearn2pmml import sklearn2pmml
pmml_filename = "scikit-learn-lr-model-pmml.xml"
sklearn2pmml( pmml_pipeline, pmml_filename )

In [26]:
!cat scikit-learn-lr-model-pmml.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PMML xmlns="http://www.dmg.org/PMML-4_3" xmlns:data="http://jpmml.org/jpmml-model/InlineTable" version="4.3">
	<Header>
		<Application name="JPMML-SkLearn" version="1.5.10"/>
		<Timestamp>2019-01-22T20:19:47Z</Timestamp>
	</Header>
	<DataDictionary>
		<DataField name="IS_TENT" optype="categorical" dataType="integer">
			<Value value="0"/>
			<Value value="1"/>
		</DataField>
		<DataField name="AGE" optype="continuous" dataType="double"/>
		<DataField name="GENDER_index" optype="continuous" dataType="double"/>
		<DataField name="MARITAL_STATUS_index" optype="continuous" dataType="double"/>
		<DataField name="PROFESSION_index" optype="continuous" dataType="double"/>
	</DataDictionary>
	<RegressionModel functionName="classification" normalizationMethod="logit">
		<MiningSchema>
			<MiningField name="IS_TENT" usageType="target"/>
			<MiningField name="AGE"/>
			<MiningField name="GENDER_index"/>
			<MiningField n

**Tip**

You can use your mouse to highlight-copy the PMML content from running the previous cell, then paste the content into a text editor on your local computer, and then save the file on your local computer as "scikit-learn-lr-model-pmml.xml"

## Summary and next steps
In this notebook, you created a logistic regression model using scikit-learn and then saved the model to a file in PMML format.

To learn how you can import this model into Watson Machine Learning, see:
<a href="https://dataplatform.cloud.ibm.com/docs/content/analyze-data/ml-import-pmml.html" target="_blank" rel="noopener noreferrer">Importing models into Watson Machine Learning from PMML</a>

### <a id="authors"></a>Authors

**Sarah Packowski** is a member of the IBM Watson Studio Content Design team in Canada.


<hr>
Copyright &copy; IBM Corp. 2019. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>