Simple demo of how to create, save and reuse classification model. The ability to save and re-use a trained model is an important component in the productionizing of machine learning models.

The model is created in **scikit-learn**.

The model is saved and re-opened (serialized and deserialized) using **pickle**.

The data set used for the demo is the **Iris Flower Dataset**: The data set consists of 3 different types of irises, namely Setosa, Versicolour, and Virginica. The rows are the observations. The columns are the features, namely: sepal length, sepal width, petal length and petal width.

In [26]:
# Import packages
from sklearn import svm, datasets
import pickle 
import numpy as np

[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]]


The Iris Flower Dataset is essentially a dictionary of various types, such as:

** iris.data:** Numpy N-dimensional array containing 150 1x4 arrays of features

In [59]:
print(type(iris.data))
print(iris.data[:10])
print(len(iris.data))

<class 'numpy.ndarray'>
[[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]]
150


** iris.target:** Numpy N-dimensional array containing 150 target values

In [57]:
print(type(iris.target))
print(iris.target[:10])
print(len(iris.target))

<class 'numpy.ndarray'>
[0 0 0 0 0 0 0 0 0 0]
150


**iris.feature_names:** List of feature names

In [75]:
print(type(iris.feature_names))
print(iris.feature_names)

<class 'list'>
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


Here are the steps to build and serialize a simple SVM classication model using the scikit-learn library.

In [80]:
#Load iris observations.
iris = datasets.load_iris()

# Set iris features.
X = iris.data

# Set iris target values.
Y = iris.target

# Train model and fit model on X & Y.
# C=1.0: Optional penalty parameter of 1.0
# kernel='poly': Specifies the 'poly' kernel type to be used in the algorithm.
# degree=3: 3 degrees of the 'poly' kernel function
model = svm.SVC(C=1.0, kernel='poly', degree=3).fit(X, Y)

# Specify location to save model
path = 'C:\Temp\model.pckl'

# Save model.
# Create new file named 'model.pckl'.
# Open the file in 'wb' mode: Write Binary mode
f = open(path, 'wb')

# Serialize the model to the file.
pickle.dump(model, f)

# Close the file.
f.close()

In [83]:
# Open file and load saved model.
f = open(path, 'rb')
model = pickle.load(f)

# Close file.
f.close()

# Create matrix of observations.
observations = np.array(X).reshape(150,4)

# Use model to perform predictions from observations.
predictions = model.predict(observations)

# Compare the observed target variable (Y) to the predictions.
np.column_stack((Y, predictions))[:20]

array([[0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0],
       [0, 0]])

As demonstrated above, the target variable values match the prediction values. This is expected, since the model was trained on 100% of the data. Such an approach is fine for the purposes of this demo.

In practice, you should **not** train your model on 100% of your data. Train on a subset (say 80%) of your observations; then evaluate your model, comparing the remaining 20%  of your observations to the associated predictions.