# Deploy scikit Machine Learning Model using DSX

### Setup

In order to deploy models on DSX, you need to have a Watson Machine Learning Service instance first.

### Load Data

Here I used auto-mpg dataset for demo. Auto-MPG: https://archive.ics.uci.edu/ml/datasets/auto+mpg

In [1]:
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics


url="https://raw.githubusercontent.com/lcx813/data/master/auto-mpg.csv"
df=pd.read_csv(io.StringIO(requests.get(url).content.decode('utf-8')),na_values=['NA','?'])

df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
0,18,8,307,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15,8,350,165,3693,11.5,70,1,buick skylark 320
2,18,8,318,150,3436,11.0,70,1,plymouth satellite
3,16,8,304,150,3433,12.0,70,1,amc rebel sst
4,17,8,302,140,3449,10.5,70,1,ford torino


Several useful functions for data preprocessing, which is created by Dr.Jeff Heaton(https://www.linkedin.com/in/jeffheaton/) for his deep learning class in WashU. You can find it on Jeff's Github https://github.com/jeffheaton/t81_558_deep_learning/blob/master/jeffs_helpful.ipynb

In [2]:
import pandas as pd
from sklearn import preprocessing

# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name, x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)
    
# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df, name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_

# Convert all missing values in the specified column to the median
def missing_median(df, name):
    med = df[name].median()
    df[name] = df[name].fillna(med)
    
# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df, target):
    result = []
    for x in df.columns:
        if x != target:
            result.append(x)
    # find out the type of the target column.  Is it really this hard? :(
    target_type = df[target].dtypes
    target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
    # Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
    if target_type in (np.int64, np.int32):
        # Classification
        dummies = pd.get_dummies(df[target])
        return df.as_matrix(result).astype(np.float32), dummies.as_matrix().astype(np.float32)
    else:
        # Regression
        return df.as_matrix(result).astype(np.float32), df.as_matrix([target]).astype(np.float32)
    
# Encode a numeric column as zscores
def encode_numeric_zscore(df, name, mean=None, sd=None):
    if mean is None:
        mean = df[name].mean()

    if sd is None:
        sd = df[name].std()

    df[name] = (df[name] - mean) / sd

### Data preprocess

In [3]:
# Data preprocessing and create feature vector
missing_median(df, 'horsepower')

tmp = df['name']
df.drop('name',1,inplace=True)

encode_numeric_zscore(df, 'mpg')
encode_numeric_zscore(df, 'horsepower')
encode_numeric_zscore(df, 'weight')
encode_numeric_zscore(df, 'displacement')
encode_numeric_zscore(df, 'acceleration')

encode_text_dummy(df, 'origin')

cylinders = encode_text_index(df, 'cylinders')
num_classes = len(cylinders)


Prepare data and lables for all samples. Split dataset into training and tesing sets.

In [4]:
y = df.cylinders.tolist()

In [5]:
df.drop('cylinders',1,inplace=True)
x = df.values.tolist()
data = x
label = y

In [6]:
samples_count = 398
train_data = x[: int(0.7*samples_count)]
train_labels = y[: int(0.7*samples_count)]

test_data = x[int(0.7*samples_count): int(0.9*samples_count)]
test_labels = y[int(0.7*samples_count): int(0.9*samples_count)]

score_data = x[int(0.9*samples_count): ]

print("Number of training records: " + str(len(train_data)))
print("Number of testing records : " + str(len(test_data)))
print("Number of scoring records : " + str(len(score_data)))

Number of training records: 278
Number of testing records : 80
Number of scoring records : 40


### Training a model using scikit-learn

In [7]:
from sklearn import preprocessing
from sklearn import svm, metrics
from sklearn.ensemble import RandomForestClassifier

Define the classifier for model training. Here I used random forest classifier as an example.

In [8]:
clf = RandomForestClassifier(n_estimators=100)

Training the model

In [9]:
model = clf.fit(train_data, train_labels)

Make prediction

In [10]:
predicted = model.predict(test_data)
score = metrics.accuracy_score(test_labels, predicted)
print("Accuracy = {:.2f}".format(score))

Accuracy = 0.91


### Persist Model 

Persist model and store your pipeline and model in Watson Machine Learning repository

In [11]:
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact
from repository.mlrepository import MetaProps, MetaNames

In [12]:
wml_credentials={
  "url": "https://ibm-watson-ml.mybluemix.net",
  "access_key": "***",
  "username": "***",
  "password": "***",
  "instance_id": "***"
}

In [13]:
ml_repository_client = MLRepositoryClient(wml_credentials['url'])
ml_repository_client.authorize(wml_credentials['username'], wml_credentials['password'])

In [14]:
props = MetaProps({MetaNames.AUTHOR_NAME:"IBM", MetaNames.AUTHOR_EMAIL:"ibm@ibm.com"})

In [15]:
model_artifact = MLRepositoryArtifact(model, name="test", meta_props=props)

In [16]:
saved_model = ml_repository_client.models.save(model_artifact)

In [17]:
saved_model.meta.available_props()

['modelVersionHref',
 'pipelineVersionHref',
 'trainingDataRef',
 'creationTime',
 'lastUpdated',
 'authorEmail',
 'authorName',
 'version',
 'modelType',
 'runtime']

In [18]:
print("modelType: " + saved_model.meta.prop("modelType"))
print("runtime: " + saved_model.meta.prop("runtime"))
print("creationTime: " + str(saved_model.meta.prop("creationTime")))
print("modelVersionHref: " + saved_model.meta.prop("modelVersionHref"))

modelType: scikit-model-0.17
runtime: python-2.7
creationTime: 2017-11-01 19:35:53.509000+00:00
modelVersionHref: https://ibm-watson-ml.mybluemix.net/v2/artifacts/models/d22fa78c-e105-42db-9a8d-2d14339ce420/versions/faffc0cf-6e28-4913-950c-6abbb62faf60


### Load model

In [19]:
loadedModelArtifact = ml_repository_client.models.get(saved_model.uid)

In [20]:
print(loadedModelArtifact.name)
print(saved_model.uid)

test
d22fa78c-e105-42db-9a8d-2d14339ce420


Make local prediction

In [21]:
score_data = x[int(0.9*samples_count): ]
predictions = loadedModelArtifact.model_instance().predict(score_data)

In [22]:
print(predictions)

[1 1 1 3 1 3 4 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 4 1 3 1 1 1 1 1
 1 1 1]


### Deploy and score in a Cloud

To work with the Watson Machine Learning REST API you must generate an access token. To do that you can use the following sample code.

In [23]:
import urllib3, requests, json

headers = urllib3.util.make_headers(basic_auth='{username}:{password}'.format(username=wml_credentials['username'], password=wml_credentials['password']))
url = '{}/v3/identity/token'.format(wml_credentials['url'])
response = requests.get(url, headers=headers)
mltoken = json.loads(response.text).get('token')

Create an online scoring endpoint. Get the published_models URL from the instance details.

In [24]:
endpoint_instance = wml_credentials['url'] + "/v3/wml_instances/" + wml_credentials['instance_id']
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken} 

response_get_instance = requests.get(endpoint_instance, headers=header)
print(response_get_instance)
print(response_get_instance.text)

<Response [200]>
{"metadata":{"guid":"4bb6fb38-c1c7-4a92-87cc-5bba334836d1","url":"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1","created_at":"2017-08-07T20:11:09.647Z","modified_at":"2017-11-01T19:35:53.566Z"},"entity":{"source":"Bluemix","published_models":{"url":"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1/published_models"},"usage":{"expiration_date":"2017-12-01T00:00:00.000Z","computation_time":{"limit":18000,"current":0},"model_count":{"limit":200,"current":4},"prediction_count":{"limit":5000,"current":10},"deployment_count":{"limit":5,"current":3}},"plan_id":"3f6acf43-ede8-413a-ac69-f8af3bb0cbfe","status":"Active","organization_guid":"a9547942-9cdd-4663-bafe-280ce52533d0","region":"us-south","account":{"id":"3b8c2aec8e03a8f1b09ec6ccbddc0abe","name":"IBM","type":"TRIAL"},"owner":{"user_id":"c26275bd-c476-4eb4-9201-25382a43ea18","email":"chengxi.li@ibm.com","country_code":"USA","beta_user":t

In [25]:
endpoint_published_models = json.loads(response_get_instance.text).get('entity').get('published_models').get('url')

print(endpoint_published_models)


https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1/published_models


Execute the following sample code that uses the published_models endpoint to get deployments URL. Get the list of published models.

In [26]:
header = {'Content-Type': 'application/json', 'Authorization': 'Bearer ' + mltoken}
response_get = requests.get(endpoint_published_models, headers=header)

print(response_get)
print(response_get.text)

<Response [200]>
{"count":4,"resources":[{"metadata":{"guid":"28f0fcf8-4d28-4944-aea6-91cd35c5132c","url":"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1/published_models/28f0fcf8-4d28-4944-aea6-91cd35c5132c","created_at":"2017-11-01T18:57:53.852Z","modified_at":"2017-11-01T18:57:59.115Z"},"entity":{"runtime_environment":"python-2.7","learning_configuration_url":"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1/published_models/28f0fcf8-4d28-4944-aea6-91cd35c5132c/learning_configuration","author":{"name":"IBM","email":"ibm@ibm.com"},"name":"test","learning_iterations_url":"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1/published_models/28f0fcf8-4d28-4944-aea6-91cd35c5132c/learning_iterations","feedback_url":"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1/published_models/28f0fcf8-4d28-4944-aea6-91cd35c5132c/feedback","

Get the published model deployment URL.

In [27]:
[endpoint_deployments] = [x.get('entity').get('deployments').get('url') for x in json.loads(response_get.text).get('resources') if x.get('metadata').get('guid') == saved_model.uid]

print(endpoint_deployments)

https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1/published_models/d22fa78c-e105-42db-9a8d-2d14339ce420/deployments


Create an online deployment for the published model.

In [28]:
payload_online = {"name": "test", "description": "test", "type": "online"}
response_online = requests.post(endpoint_deployments, json=payload_online, headers=header)

print(response_online)
print(response_online.text)

<Response [201]>
{"metadata":{"guid":"20e45280-3736-4a34-a48a-fa94bc0fbfe9","url":"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1/published_models/d22fa78c-e105-42db-9a8d-2d14339ce420/deployments/20e45280-3736-4a34-a48a-fa94bc0fbfe9","created_at":"2017-11-01T19:35:58.231Z","modified_at":"2017-11-01T19:35:59.751Z"},"entity":{"runtime_environment":"python-2.7","name":"test","scoring_url":"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1/published_models/d22fa78c-e105-42db-9a8d-2d14339ce420/deployments/20e45280-3736-4a34-a48a-fa94bc0fbfe9/online","description":"test","published_model":{"author":{"name":"IBM","email":"ibm@ibm.com"},"name":"test","url":"https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1/published_models/d22fa78c-e105-42db-9a8d-2d14339ce420","guid":"d22fa78c-e105-42db-9a8d-2d14339ce420","created_at":"2017-11-01T19:35:58.205Z"},"model_type":"scikit-model-

Load the scoring URL. 

In [29]:
scoring_url = json.loads(response_online.text).get('entity').get('scoring_url')

print(scoring_url)

https://ibm-watson-ml.mybluemix.net/v3/wml_instances/4bb6fb38-c1c7-4a92-87cc-5bba334836d1/published_models/d22fa78c-e105-42db-9a8d-2d14339ce420/deployments/20e45280-3736-4a34-a48a-fa94bc0fbfe9/online


Using samples for prediction

In [30]:
test_1 = data[1]
test_2 = data[2]
label_1 = label[1]
label_2 = label[2]

In [31]:
payload_scoring = {"values": [test_1, test_2]}
print(payload_scoring)

{'values': [[-1.0893794720944747, 1.5016242793620063, 1.5879594901955474, 0.8532590135498572, -1.4751810504376373, 70.0, 1.0, 0.0, 0.0], [-0.7055506566787514, 1.1947282434492943, 1.19552176380289, 0.5497784722839334, -1.6564922906557151, 70.0, 1.0, 0.0, 0.0]]}


In [32]:
response_scoring = requests.post(scoring_url, json=payload_scoring, headers=header)
print(response_scoring.text)

{
  "values": [[4, [0.0, 0.0, 0.0, 0.0, 1.0]], [4, [0.0, 0.0, 0.0, 0.01, 0.99]]],
  "fields": ["prediction", "probability"]
}


Checking the original labels for testing data

In [33]:
print(label_1)
print(label_2)

4
4


Resources: 
1) https://dataplatform.ibm.com/docs/content/analyze-data/ml-deploy.html?context=analytics
2)https://dataplatform.ibm.com/analytics/notebooks/5215a61a-16d7-4fa2-b060-e3e243ceebe3/view?access_token=70f48c95c5571a614ce97484d3f168b1d9b6aeebce015187d3d77ce6038f025e#
