<table style="border: none" align="left">
   <tr style="border: none">
      <th style="border: none"><font face="verdana" size="5" color="black"><b>Build a Loan default PMML scoring model with scikit-learn in Watson ML </b></th>
      <th style="border: none"><img src="https://github.com/pmservice/customer-satisfaction-prediction/blob/master/app/static/images/ml_icon_gray.png?raw=true" alt="Watson Machine Learning icon" height="40" width="40"></th>
   </tr>
</table>

This notebook contains steps and code to get a loan dataset, create a predictive model, and start scoring new data. This notebook introduces commands for getting data and for basic data cleaning and exploration, model creation, model training, model persistence, model deployment, and scoring.

Some familiarity with Python is helpful. This notebook uses Python 3.


## Learning goals

You will learn how to:

-  Load a CSV file into a Pandas DataFrame.
-  Explore data.
-  Prepare data for training and evaluation.
-  Create a scikit-learn machine learning model.
-  Train and evaluate a model.
-  Save the model as PMML file.



## Contents

This notebook contains the following parts:

1.	[Set up](#setup)
2.	[Load and explore data](#load)
3.	[Create a Scikit learn machine learning model](#model)
4.	[Store the model in Watson Machine Learning provider](#provider)
5.	[Summary and next steps](#summary)

<a id="setup"></a>
## 1. Set up

Before you use the sample code in this notebook,you create a <a href="https://cloud.ibm.com/catalog?category=ai#services" target="_blank" rel="noopener no referrer">Watson Machine Learning (WML) Service</a> instance (a lite plan is offered and information about how to create the instance is <a href="https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-samples-overview.html" target="_blank" rel="noopener no referrer">here</a>)


<a id="load"></a>
## 2. Load and explore data

In this section you will load the data as a Pandas DataFrame and perform a basic exploration.

Load the data to the Pandas DataFrame by using *wget* to upload the data to gpfs and then use pandas *read* method to read data. 

In [None]:
# Install wget if you don't already have it.
!pip install wget

In [None]:
import wget
link_to_data = 'https://raw.githubusercontent.com/ODMDev/decisions-on-spark/master/data/miniloan/miniloan-payment-default-cases-v2.0.csv'
filename = wget.download(link_to_data)

print(filename)

Import required libraires to create our Panda DataFrame

In [None]:
import numpy as np
import pandas as pd

Load the file to Pandas DataFrame using code below

In [None]:
used_names = ['creditScore', 'income', 'loanAmount', 'monthDuration', 'rate', 'yearlyReimbursement', 'paymentDefault']

df = pd.read_csv(
    filename,
    header=0,
    delimiter=r'\s*,\s*',
    engine='python'
).replace(
    [np.inf, -np.inf], np.nan
).dropna().loc[:, used_names]

Explore the loaded data by using the following Pandas DataFrame methods:
-  print types
-  print top ten records
-  count all records

In [None]:
# convert all columns of DataFrame to float to avoid scaler warnings
df = df.astype({'creditScore': float, "income": np.float64, "loanAmount": np.float64, "monthDuration": np.float64, "yearlyReimbursement": np.float64, "paymentDefault": np.int64})
df.dtypes

As you can see, the data contains five fields. default field is the one you would like to predict (label).

In [None]:
df.head()

In [None]:
print("Number of records: " + str(len(df)))

<a id="model"></a>
## 3. Create a Scikit learn machine learning model

In this section you will learn how to:

- [3.1 Prepare data](#prep)
- [3.2 Create a model](#pipe)
- [3.3 Train a model](#train)
- [3.4 Save as PMML file](#save)


### 3.1 Prepare data<a id="prep"></a>

In this subsection you will split your data into: 
- train data set
- test data set
- predict data set

In [None]:
splitted_data = np.array_split(df.sample(frac=1, random_state=42), [int(.8*len(df)), int((.8+.18)*len(df))])
train_data = splitted_data[0]
test_data = splitted_data[1]
predict_data = splitted_data[2]

print("Number of training records: " + str(len(train_data)))
print("Number of testing records : " + str(len(test_data)))
print("Number of prediction records : " + str(len(predict_data)))

As you can see your data has been successfully split into three data sets: 

-  The train data set, which is the largest group, is used for training.
-  The test data set will be used for model evaluation and is used to test the assumptions of the model.
-  The predict data set will be used for prediction.

### 3.2 Create a ML model and pipeline<a id="pipe"></a>

In this section you will create a Scikit-Learn machine learning model and then train the model.

In the first step you need to import the Scikit-Learn machine learning packages that will be needed in the subsequent steps.

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.preprocessing import StandardScaler

Now construct the model. A linear model with Stochastic Gradient Descent is used in the following example. We use a pipeline to add an input scaling step.

In [None]:
clf = SGDClassifier(loss="log_loss", penalty="l2", random_state=42, tol=1e-3)
scaler = StandardScaler()

You then create a simple pipeline to first scale the input parameter values and then apply the model.

In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('standardize', scaler),
    ("classifier", clf)
])

### 3.3 Train the model<a id="train"></a>
Now, you can train your Random Forest model by using the previously defined **pipeline** and **train data**.

In [None]:
train_data.dtypes

In [None]:
x_train_data = train_data.loc[:, used_names[:-1]]
y_train_data = train_data.loc[:, used_names[-1]]

In [None]:
pipeline.fit(x_train_data, y_train_data)

# we defined a variable trainedAt to keep track of when the model was trained
import datetime;
ts = datetime.datetime.now()
trainedAt = ts.strftime("%Y-%m-%dT%H:%M:%S.000Z")

You can check your **model accuracy** now. Use **test data** to evaluate the model.

In [None]:
x_test_data = test_data.loc[:, used_names[:-1]]
y_test_data = test_data.loc[:, used_names[-1]]

predictions = pipeline.predict(x_test_data)

We define a **metrics** variable to keep track of the metrics values

In [None]:
from sklearn.metrics import mean_squared_error, classification_report, balanced_accuracy_score, accuracy_score, confusion_matrix

metrics = []

name = "Coefficient of determination R^2"
r2 = pipeline.score(x_test_data, y_test_data)
metrics.append({ "name": name, "value": r2 })

name = "Root Mean Squared Error (RMSE)"
rmse = mean_squared_error(y_test_data, predictions)
metrics.append({ "name": name, "value": rmse })

name = "Accuracy"
acc = accuracy_score(y_test_data, predictions)
metrics.append({ "name": name, "value": acc })

name = "Balanced accuracy"
balanced_acc = balanced_accuracy_score(y_test_data, predictions)
metrics.append({ "name": name, "value": balanced_acc })

name = "Confusion Matrix"
confusion_mat = confusion_matrix(y_test_data, predictions, labels=[0, 1])
metrics.append({ "name": name, "value": str(confusion_mat.tolist()) })

for metric in metrics:
    print(metric["name"], "on test data =", metric["value"])

In [None]:
print(classification_report(y_test_data, predictions))

### 3.4 Save as pmml file <a id="save"></a>

In [None]:
!pip install nyoka==4.3.0

In [None]:
model_name = type(clf).__name__
scaler_name = type(scaler).__name__

from nyoka import skl_to_pmml
features=x_train_data.columns
target="paymentDefault"
pmml_filename = "ML-Sample-" + model_name + '-' + scaler_name + "-pmml.xml"
skl_to_pmml(pipeline, features, target, pmml_filename)
print(pmml_filename)

<a id="provider"></a>
## 4. Store the model in Watson Machine Learning Provider


In this section you will learn how to use Python client libraries to store your pipeline and model in WML repository.

- [4.1 Import the libraries](#lib)
- [4.2 Save model](#save)
- [4.3 Invoke model](#local)

### 4.1 Import the libraries<a id="lib"></a>

Authenticate to the Watson Machine Learning service on IBM Cloud.

**Tip**: Authentication information (your credentials) can be found in the <a href="https://cloud.ibm.com/iam/apikeys" target="_blank" rel="noopener no referrer">Service credentials</a> tab of the service instance that you created on IBM Cloud. 

If you cannot see the **instance_id** field in **Service Credentials**, click **New credential (+)** to generate new authentication information. 

**Action**: Enter your Watson Machine Learning service instance credentials here.

In [None]:
from ibm_watson_machine_learning import APIClient

wml_credentials = {
                   "url": "TO BE SET",  # example: "https://eu-gb.ml.cloud.ibm.com"
                   "apikey":"TO BE SET"
                  }

client = APIClient(wml_credentials)

### 4.2 Save the pipeline and deploy model<a id="save"></a>

In this subsection you will learn how to save pipeline and model artifacts to your Watson Machine Learning instance.

First, you need to create a space that will be used for deploying models. If you do not have space already created, you can use  <a href="https://dataplatform.cloud.ibm.com/ml-runtime/spaces?context=cpdaas" target="_blank" rel="noopener no referrer">Deployment Spaces Dashboard</a> to create one.

- Click New Deployment Space
- Create an empty space
- Select Cloud Object Storage
- Select Watson Machine Learning instance and press Create
- Copy space_id and paste it below

In [None]:
space_id ='TO BE SET'
client.set.default_space(space_id)

Publish model directly from pipeline.

In [None]:
input_data_schema={
    'id': '1', 
    'type': 'struct', 
    'fields': [
        {  
            'name': 'creditScore',
            'nullable': True,
            'type': 'float64'
        },
        {   
            'name': 'income',
            'nullable': True,
            'type': 'float64'
        },
        {   
            'name': 'loanAmount',
            'nullable': True,
            'type': 'float64'
        },
        {   
            'name': 'monthDuration',
            'nullable': True,
            'type': 'float64'
        },
        {  
            'name': 'rate',
            'nullable': True,
            'type': 'float64'
        },
        {   
            'name': 'yearlyReimbursement',
            'nullable': True,
            'type': 'float64'
        }
]}

In [None]:
sofware_spec_uid = client.software_specifications.get_id_by_name("pmml-3.0_4.3")

metadata = {
            client.repository.ModelMetaNames.NAME: 'Payment Default - PMML',
            client.repository.ModelMetaNames.TYPE: 'pmml_4.3',
            client.repository.ModelMetaNames.INPUT_DATA_SCHEMA: input_data_schema,
            client.repository.ModelMetaNames.SOFTWARE_SPEC_UID: sofware_spec_uid,
            client.repository.ModelMetaNames.LABEL_FIELD: 'paymentDefault',

}

published_model_details = client.repository.store_model(model=pmml_filename, meta_props=metadata)

In [None]:
model_uid = client.repository.get_model_id( published_model_details )

print( "model_uid: ", model_uid )

In [None]:
deployment_name  = "Payment Default deployment"
deployment_desc  = "Online deployment of Loan payment default predictive service in pmml"
deployment_metadata = {
                        client.deployments.ConfigurationMetaNames.NAME: deployment_name, 
                        client.deployments.ConfigurationMetaNames.DESCRIPTION: deployment_desc,
                        client.deployments.ConfigurationMetaNames.ONLINE: {}
}
deployment       = client.deployments.create(artifact_uid=model_uid, meta_props=deployment_metadata)
scoring_endpoint = client.deployments.get_scoring_href( deployment )
print( "scoring_endpoint: ", scoring_endpoint )

**Tip**: Use `client.repository.ModelMetaNames.show()` to get the list of available props.

In [None]:
client.repository.ModelMetaNames.show()

<a id="local"></a>
### 4.3 Invoke model


In this subsection you will score the *predict_data* data set.
You will learn how to invoke a saved model from a specified instance of Watson Machine Learning.

In [None]:
deployment_id = client.deployments.get_id(deployment)

x_predict_data = predict_data.loc[:, used_names[:-1]]
y_predict_data = predict_data.loc[:, used_names[-1]]

#scoring_payload = {
#    "fields": x_predict_data.columns.values.tolist(),
#    "values": x_predict_data.values.tolist()
#}

scoring_payload = {
    client.deployments.ScoringMetaNames.INPUT_DATA: [
        {
            'fields': x_predict_data.columns.values.tolist(),
            'values': x_predict_data.values.tolist()
        }]
}
predictions_predict_data = client.deployments.score(deployment_id, scoring_payload)

#print(json.dumps(predictions_predict_data, indent=4))
predictions_predict_data

Preview some results metrics

In [None]:
label_predictions = []
for result in predictions_predict_data['predictions'][0].get('values'):
    if result[0] >= 0.5:
        label_predictions.append(0)
    elif result[0] < 0.5:
        label_predictions.append(1)
        
balanced_acc = balanced_accuracy_score(y_predict_data, label_predictions)

confusion_mat = confusion_matrix(y_predict_data, label_predictions, labels=[0, 1])

acc = accuracy_score(y_predict_data, label_predictions)

print('Accuracy', acc)
print('Balanced accuracy', balanced_acc)
print('Confusion Matrix', confusion_mat)

<a id="summary"></a>
## 5. Summary and next steps
You successfully completed this notebook!   
Check out the [Online Documentation](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ml-samples-overview.html) for more samples, tutorials, documentation, how-tos, and blog posts. 

### Authors

This notebook was inspired by original notebook written by Pierre Feillet using Apache Spark and Watson Machine Learning.
It was adapted for Scikit Learn by Marine Collery.