## Modeling / Evaluation /Deployment to WML using Pyspark
<img src="https://github.com/CatherineCao2016/lendingclub/raw/master/modeling.png" width="800" height="500" align="middle"/>


We are trying to predict the likliehood of default given borrowers data.  
Here are three ML algorithms are tested using Spark and Pipelines API in pyspark.

1. Logistic Regression
2. Decision Tree
3. Random Forest

## Import Libraries

In [1]:
import ibmdbpy
from ibmdbpy import IdaDataBase, IdaDataFrame
import pandas as pd
pd.options.display.max_columns = 999
pd.set_option('display.max_colwidth', 200)

import warnings
warnings.filterwarnings('ignore')
import time
from datetime import datetime
import math
import urllib3, requests, json

from pyspark.sql import SparkSession
from pyspark.sql.types import DoubleType
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.feature import StandardScaler


## Load Cleaned Data

In [3]:
loan_spark_read = spark.read.parquet("home/lending_club/loan_sub_kp").cache()
# loan_spark = spark.createDataFrame(loan_sub_kp.drop(['LOAN_STATUS', 'ISSUE_D', 'EMP_TITLE', 'DESC', 'MTHS_SINCE_LAST_DELINQ'], 1)).cache()
print loan_spark_read.printSchema
loan_spark_read.toPandas()
loan_spark_read.describe().toPandas()


<bound method DataFrame.printSchema of DataFrame[LOAN_STATUS: string, ISSUE_D: bigint, LOAN_AMNT: bigint, EMP_TITLE: string, EMP_LENGTH: string, VERIFICATION_STATUS: string, HOME_OWNERSHIP: string, ANNUAL_INC: double, PURPOSE: string, INQ_LAST_6MTHS: bigint, DESC: string, OPEN_ACC: bigint, PUB_REC: bigint, REVOL_UTIL: double, DTI: double, TOTAL_ACC: bigint, DELINQ_2YRS: bigint, EARLIEST_CR_LINE: bigint, MTHS_SINCE_LAST_DELINQ: double, ADDR_STATE: string, TERM: string, DEFAULT: bigint, EMP_LISTED: bigint, EMPTY_DESC: bigint, EMP_NA: bigint, DELING_EVER: bigint, TIME_HISTORY: bigint]>


Unnamed: 0,summary,ISSUE_D,LOAN_AMNT,ANNUAL_INC,INQ_LAST_6MTHS,OPEN_ACC,PUB_REC,REVOL_UTIL,DTI,TOTAL_ACC,DELINQ_2YRS,EARLIEST_CR_LINE,MTHS_SINCE_LAST_DELINQ,DEFAULT,EMP_LISTED,EMPTY_DESC,EMP_NA,DELING_EVER,TIME_HISTORY
0,count,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0,39999.0
1,mean,1.2888623348383352e+18,11220.381759543989,69005.63250381262,0.889347233680842,9.304557613940348,0.0554513862846571,48.87473636840911,13.328587714692953,22.113227830695767,0.1474036850921273,8.536274401260093e+17,35.90692017301288,0.1426285657141428,0.9379734493362334,0.3314582864571614,0.0271756793919848,0.3539588489712242,5037.440911022775
2,stddev,2.880994660035285e+16,7458.321880039553,63903.73691587774,1.1088136654975094,4.414574883524405,0.238176524941833,28.31348572815753,6.680935935424673,11.4190903560703,0.4959183335662212,2.156149005089904e+17,13.098574359254316,0.3496980343822131,0.241206783135738,0.4707432749549588,0.1625971180669052,0.478202571397938,2501.1879606625066
3,min,1.180656e+18,500.0,4000.0,0.0,2.0,0.0,0.0,0.0,2.0,0.0,-7.573824e+17,0.0,0.0,0.0,0.0,0.0,0.0,1095.0
4,max,1.3226976e+18,35000.0,6000000.0,8.0,44.0,4.0,99.9,29.99,90.0,11.0,1.2254976e+18,120.0,1.0,1.0,1.0,1.0,1.0,23892.0


**Preprocess the data**

In [4]:
# create label column: covert long to double to avoid RF fit error
loan_spark_read = loan_spark_read.withColumn('label', loan_spark_read['DEFAULT'].cast(DoubleType()))

In [5]:
# One-hot encoder for all categorical varaibles
catCols = ['EMP_LENGTH', 'VERIFICATION_STATUS', 'HOME_OWNERSHIP', 'PURPOSE', 'ADDR_STATE', 'TERM']
for catCol in catCols:
    loan_spark_read = StringIndexer(inputCol=catCol, outputCol=catCol+"Index").fit(loan_spark_read).transform(loan_spark_read)
    loan_spark_read = OneHotEncoder(inputCol=catCol+"Index", outputCol=catCol+"classVec").transform(loan_spark_read)  

In [8]:
# Assemble feature vector
numCols = ['LOAN_AMNT', 'ANNUAL_INC', 'INQ_LAST_6MTHS', 'OPEN_ACC', 'PUB_REC', 'REVOL_UTIL', 'DTI', 'TOTAL_ACC', 'DELINQ_2YRS', 'EMP_LISTED', 'EMPTY_DESC', 'EMP_NA', 'DELING_EVER', 'TIME_HISTORY']

# Concatenate Numerical and Categorical Features, and then add to Vector Assembler 
assemblerInputs = map(lambda c: c + "classVec", catCols) + numCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features_non_scaled")
# Debug
# assemblerInputs = numCols
#assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features_non_scaled")
loan_spark = assembler.transform(loan_spark_read)


scaler = StandardScaler(withMean=False, withStd=True, inputCol="features_non_scaled", outputCol="features")
scalerModel = scaler.fit(loan_spark)
loan_spark = scalerModel.transform(loan_spark)

# keep useful variables
selectedcols = ["label", "features"]
loan_model = loan_spark.select(selectedcols)


** Split the data into training and testing sets  **

In [9]:
trainingData, testData = loan_model.randomSplit([0.7, 0.3], seed = 82)
print "Training set size: " + str(trainingData.count())
print "Testing set size: " + str(testData.count())
print "Distribution of Default and Non-Default in trainingData is: ", trainingData.groupBy("label").count().take(3)

Training set size: 27877
Testing set size: 12122
Distribution of Default and Non-Default in trainingData is:  [Row(label=0.0, count=23913), Row(label=1.0, count=3964)]


## Build Models

Here is the method that was used to create model // model pipelines

1.  Create Manual Logistic Regression run with Grid parameter search
2.  Create Manual Decision Tree run with Grid parameter search
3.  Create Manual Random Forest run with Grid parameter search

4. Then build a pipeline using the best models found from grid search.

5. Proceed to WML deployment 

### Logistic Regression
** Use CrossValidator and ParamGridBuilder to search for best model **

In [17]:
# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="label", featuresCol="features", threshold=0.3)

# Create ParamGrid for Cross Validation
paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.001, 0.1])
             .addGrid(lr.elasticNetParam, [0.0,1.0])
             .addGrid(lr.maxIter, [100])
             .build())


evaluator = BinaryClassificationEvaluator()

# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=2)

# Run cross validations
lrCvModel = cv.fit(trainingData)

** Use BinaryClassificationEvaluator to evaluate the model **

Note that the default metric for the BinaryClassificationEvaluator is areaUnderROC

In [18]:
# Use test set here so we can measure the accuracy of our model on new data
lr_uroc = evaluator.evaluate(lrCvModel.transform(testData))

print "areaUnderROC for LR: " + str(lr_uroc)

print "Cross tab for prediction vs actual table"
lrCvModel.transform(testData).stat.crosstab("label", "prediction").show()
# lrCvModel.bestModel.transform(testData).toPandas()
#print lrCvModel.bestModel.coefficients
#print lrCvModel.bestModel.intercept


areaUnderROC for LR: 0.690960338722
Cross tab for prediction vs actual table
+----------------+----+---+
|label_prediction| 0.0|1.0|
+----------------+----+---+
|             1.0|1512|229|
|             0.0|9958|423|
+----------------+----+---+



### Decision Tree

In [22]:
from pyspark.ml.classification import DecisionTreeClassifier

# Create initial Decision Tree Model
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features", maxDepth=5)


# Hyperparameter Tuning
paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [10])
             .addGrid(dt.maxBins, [40])
             .build())


# Create 5-fold CrossValidator
cv = CrossValidator(estimator=dt, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations
dtCvModel = cv.fit(trainingData)

print "numNodes = ", dtCvModel.bestModel.numNodes
print "depth = ", dtCvModel.bestModel.depth


# Evaluate the model

predictions = dtCvModel.transform(testData)

evaluator = BinaryClassificationEvaluator()
dt_uroc = evaluator.evaluate(predictions)

print "areaUnderROC for DT: " + str(dt_uroc)
print "Cross tab for prediction vs actual table"
dtCvModel.transform(testData).stat.crosstab("label", "prediction").show()

numNodes =  921
depth =  10
areaUnderROC for DT: 0.47766473024
Cross tab for prediction vs actual table
+----------------+-----+---+
|label_prediction|  0.0|1.0|
+----------------+-----+---+
|             1.0| 1651| 90|
|             0.0|10136|245|
+----------------+-----+---+



### Random Forest

In [19]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

paramGrid = (ParamGridBuilder()
             .addGrid(rf.maxDepth, [2, 6])
             .addGrid(rf.maxBins, [20, 40])
             .addGrid(rf.numTrees, [5, 20])
             .build())

cv = CrossValidator(estimator=rf, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

# Run cross validations.  This can take about 6 minutes since it is training over 20 trees!
rfCvModel = cv.fit(trainingData)

predictions = rfCvModel.transform(testData)

rf_uroc = evaluator.evaluate(predictions)

print "areaUnderROC for RF: " + str(rf_uroc) #0.6918242957971713
print "Cross tab for prediction vs actual table"
rfCvModel.transform(testData).stat.crosstab("label", "prediction").show()

areaUnderROC for RF: 0.677956226197
Cross tab for prediction vs actual table
+----------------+-----+
|label_prediction|  0.0|
+----------------+-----+
|             1.0| 1741|
|             0.0|10381|
+----------------+-----+



In [23]:
print "areaUnderROC for LR: " + str(lr_uroc)
print "areaUnderROC for DT: " + str(dt_uroc)
print "areaUnderROC for RF: " + str(rf_uroc) #0.6918242957971713


areaUnderROC for LR: 0.690960338722
areaUnderROC for DT: 0.47766473024
areaUnderROC for RF: 0.677956226197


In [24]:
rfCvModel.bestModel.featureImportances

SparseVector(94, {0: 0.002, 1: 0.0024, 2: 0.0034, 3: 0.0023, 4: 0.0056, 5: 0.0018, 6: 0.002, 7: 0.0026, 8: 0.0, 9: 0.0019, 10: 0.0038, 11: 0.0103, 12: 0.0108, 13: 0.0059, 14: 0.0127, 15: 0.0021, 16: 0.002, 17: 0.0018, 18: 0.0221, 19: 0.0028, 20: 0.0043, 21: 0.0058, 22: 0.0324, 23: 0.0027, 24: 0.0006, 25: 0.0019, 26: 0.0028, 27: 0.0007, 28: 0.001, 29: 0.0038, 30: 0.0044, 31: 0.003, 32: 0.0058, 33: 0.0004, 34: 0.0027, 35: 0.001, 36: 0.0016, 37: 0.0022, 38: 0.0038, 39: 0.002, 40: 0.001, 41: 0.0025, 42: 0.0014, 43: 0.0005, 44: 0.0007, 45: 0.0012, 46: 0.0031, 47: 0.0011, 48: 0.0018, 49: 0.0016, 50: 0.0054, 51: 0.0006, 52: 0.0003, 53: 0.0002, 54: 0.0029, 55: 0.0017, 56: 0.0012, 58: 0.0007, 59: 0.0007, 60: 0.001, 61: 0.0018, 62: 0.0006, 63: 0.0014, 64: 0.0008, 65: 0.0007, 66: 0.0017, 67: 0.0009, 68: 0.0005, 69: 0.0012, 70: 0.0011, 71: 0.0013, 72: 0.001, 74: 0.0003, 78: 0.0011, 79: 0.2323, 80: 0.055, 81: 0.0835, 82: 0.0698, 83: 0.034, 84: 0.0266, 85: 0.1285, 86: 0.0304, 87: 0.0303, 88: 0.0105,

## Model Deployment via Watson Machine Learning Service(WML)

<img src="https://github.com/CatherineCao2016/lendingclub/raw/master/depolyment.png" width="800" height="500" align="middle"/>

**Create Pipeline for WML**

In [54]:
# inputdf should have a non-doubt DEFAULT column

def build_model(inputdf):
    
    inputdf = inputdf.withColumn('label', inputdf['DEFAULT'].cast(DoubleType()))
    
    catCols = ['EMP_LENGTH', 'VERIFICATION_STATUS', 'HOME_OWNERSHIP', 'PURPOSE', 'ADDR_STATE', 'TERM']
    
    # to_do: is it possible use for loop to produce the following? so we could user-define variable list
    SI1 = StringIndexer(inputCol='EMP_LENGTH', outputCol='EMP_LENGTH'+"Index")
    SI2 = StringIndexer(inputCol='VERIFICATION_STATUS', outputCol='VERIFICATION_STATUS'+'Index')
    SI3 = StringIndexer(inputCol='HOME_OWNERSHIP', outputCol='HOME_OWNERSHIP'+'Index')
    SI4 = StringIndexer(inputCol='PURPOSE', outputCol='PURPOSE'+'Index')
    SI5 = StringIndexer(inputCol='ADDR_STATE', outputCol='ADDR_STATE'+'Index')
    SI6 = StringIndexer(inputCol='TERM', outputCol='TERM'+'Index')

    OH1 = OneHotEncoder(inputCol='EMP_LENGTH' + 'Index', outputCol='EMP_LENGTH' + 'classVec')
    OH2 = OneHotEncoder(inputCol='VERIFICATION_STATUS' + 'Index', outputCol='VERIFICATION_STATUS' + 'classVec')
    OH3 = OneHotEncoder(inputCol='HOME_OWNERSHIP' + 'Index', outputCol='HOME_OWNERSHIP' + 'classVec')
    OH4 = OneHotEncoder(inputCol='PURPOSE' + 'Index', outputCol='PURPOSE' + 'classVec')
    OH5 = OneHotEncoder(inputCol='ADDR_STATE' + 'Index', outputCol='ADDR_STATE' + 'classVec')
    OH6 = OneHotEncoder(inputCol='TERM' + 'Index', outputCol='TERM' + 'classVec')
    
    numCols = ['LOAN_AMNT', 'ANNUAL_INC', 'INQ_LAST_6MTHS', 'OPEN_ACC', 'PUB_REC', 'REVOL_UTIL', 'DTI', 'TOTAL_ACC', 'DELINQ_2YRS', 'EMP_LISTED', 'EMPTY_DESC', 'EMP_NA', 'DELING_EVER', 'TIME_HISTORY']
    
    assemblerInputs = map(lambda c: c + "classVec", catCols) + numCols
    
    assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features_non_scaled")
    
    scaler = StandardScaler(withMean=False, withStd=True, inputCol="features_non_scaled", outputCol="features")
    #scalerModel = scaler.fit(loan_spark)
    #loan_spark = scalerModel.transform(loan_spark)
    
    
    
    print "Training Model..."
    
    #lr_final = LogisticRegression(maxIter=10, regParam=0.1, elasticNetParam=0.0, threshold = 0.5, labelCol="label", featuresCol="features")
    
    rf_final_model = RandomForestClassifier(labelCol="label", featuresCol="features", maxDepth = 6, maxBins = 60, numTrees = 20)

    # Use the best model from your cross validation runs above ...
    pipeline_lr = Pipeline(stages=[SI1, SI2, SI3, SI4, SI5, SI6, OH1, OH2, OH3, OH4, OH5, OH6, assembler,scaler, lrCvModel.bestModel])
    pipeline_rf = Pipeline(stages=[SI1, SI2, SI3, SI4, SI5, SI6, OH1, OH2, OH3, OH4, OH5, OH6, assembler,scaler, rfCvModel.bestModel])
     
    model_lr = pipeline_lr.fit(inputdf)
    model_rf = pipeline_rf.fit(inputdf)

    
    print "Model built!!"
    
    return pipeline_lr, model_lr, pipeline_rf, model_rf

In [55]:
# loan_spark as training dataset with label, cv not applicatiable
loan_spark_read = spark.read.parquet("home/lending_club/loan_sub_kp").cache()
pipeline_lr, model_lr, pipeline_rf, model_rf = build_model(loan_spark_read)

Training Model...
Model built!!


** Set up Watson Machine Learning Credentials **

In [56]:
# 
cc_creds = {
  "url": "https://ibm-watson-ml.mybluemix.net",
  "access_key": "8I7slbLraBwPGRVdAvhVBs4quUlHxQBfVh9AcsReS3CEYVe+pQs2Lmppeo/ZVIpYHxGxQ3pIogjgEOjN0TGDTcL0h32gVzPkwMbmHXNpi+FQYUqQmv73SQJrb1WXWeZv",
  "username": "0b45b40e-f2e5-43a4-bc0a-55cb076a4ee6",
  "password": "813db8af-b707-4e59-a676-357cfe1ac299"
}


dv_creds = {
  "url": "https://ibm-watson-ml.mybluemix.net",
  "access_key": "kbXV3OOJ0i2mjGVhB461icjYpZlBFyiIjIpOn/ys0bSNe4rD50whFt1EcTocKgHvHxGxQ3pIogjgEOjN0TGDTcL0h32gVzPkwMbmHXNpi+FQYUqQmv73SQJrb1WXWeZv",
  "username": "7ddbfc51-2af5-4029-8e7f-f609a255fd5b",
  "password": "f5604e9e-7220-4f23-8a42-1ff814a72362",
  "instance_id": "d51854a2-84b2-41db-90f0-ac2419a944f2"
}
# Using Dustin's WML creds for now
creds = dv_creds

## UDFs

In [57]:
def download(url):
    filename = url.split('/')[-1]
    print 'Downloading', filename
    http = urllib3.PoolManager()
    response = http.request('GET', url)
    data = response.data
    with open(filename, 'w') as myfile:
        myfile.write(data)

#download('https://raw.githubusercontent.com/CatherineCao2016/lendingclub/master/deployfuncs.py')

In [71]:
%%bash
touch __init__.py
rm -rf ./wml_deployfuncs.py
wget https://github.com/dustinvanstee/lendingclub/raw/master/lendingclub-flask-demo/wml_deployfuncs.py

--2017-10-03 14:40:49--  https://github.com/dustinvanstee/lendingclub/raw/master/lendingclub-flask-demo/wml_deployfuncs.py
Resolving github.com (github.com)... 192.30.253.113, 192.30.253.112
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/dustinvanstee/lendingclub/master/lendingclub-flask-demo/wml_deployfuncs.py [following]
--2017-10-03 14:40:50--  https://raw.githubusercontent.com/dustinvanstee/lendingclub/master/lendingclub-flask-demo/wml_deployfuncs.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6962 (6.8K) [text/plain]
Saving to: ‘wml_deployfuncs.py’

     0K ......                                                100% 12.7M=0.001s

2017-10-03 14:40:50 (12.7 MB/s) - ‘wml_deplo

In [72]:
import wml_deployfuncs

### Save the model to WML repository

In [69]:
loan_spark_read = loan_spark_read.withColumn('label', loan_spark_read['DEFAULT'].cast(DoubleType()))
print "Saving Modeling...Model ID:"
published_model_name_or_id = save_model_by_name(creds, "lc_rf_defaultprediction", model_rf, loan_spark_read)
published_model_name_or_id = save_model_by_name(creds, "lc_lr_defaultprediction", model_lr, loan_spark_read)


Saving Modeling...Model ID:
## Published Model Summary ##
# Published Model 0 0c7052b8-fe40-49b1-8ce9-19a8ea964e99 lc_lr_defaultprediction
# Published Model 1 33a64930-1e16-46d8-bb46-762426a1846f dv2
# Published Model 2 5bfab4df-4343-4b11-a3f0-347730135a69 lc_rf_defaultprediction
# Published Model 3 82eff54f-8fb9-41f1-87fa-6fec32e5dcd5 Probability Model - CV+Bin2
# Published Model 4 d6feb26f-5ba3-4446-a40b-1176948c2cf8 Driver Ranking - CV+Bin2
Deleting Model lc_rf_defaultprediction 5bfab4df-4343-4b11-a3f0-347730135a69
Successfully deleted model
status = 204
## Published Model Summary ##
# Published Model 0 0c7052b8-fe40-49b1-8ce9-19a8ea964e99 lc_lr_defaultprediction
# Published Model 1 32b184b6-2208-4598-a813-58bbbf9d7921 lc_rf_defaultprediction
# Published Model 2 33a64930-1e16-46d8-bb46-762426a1846f dv2
# Published Model 3 82eff54f-8fb9-41f1-87fa-6fec32e5dcd5 Probability Model - CV+Bin2
# Published Model 4 d6feb26f-5ba3-4446-a40b-1176948c2cf8 Driver Ranking - CV+Bin2
Deleting Model l

### Deploy the saved model

In [70]:
published_models_json = get_published_models(creds)
rf_scoring_url = deploy_model(creds, published_models_json, "lc_rf_defaultprediction")
lr_scoring_url = deploy_model(creds, published_models_json, "lc_lr_defaultprediction")



## Published Model Summary ##
# Published Model 0 32b184b6-2208-4598-a813-58bbbf9d7921 lc_rf_defaultprediction
# Published Model 1 33a64930-1e16-46d8-bb46-762426a1846f dv2
# Published Model 2 82eff54f-8fb9-41f1-87fa-6fec32e5dcd5 Probability Model - CV+Bin2
# Published Model 3 d6feb26f-5ba3-4446-a40b-1176948c2cf8 Driver Ranking - CV+Bin2
# Published Model 4 e8915a60-0f53-4d34-8a9f-eef64e2a5103 lc_lr_defaultprediction
https://ibm-watson-ml.mybluemix.net/v3/wml_instances/d51854a2-84b2-41db-90f0-ac2419a944f2/published_models/32b184b6-2208-4598-a813-58bbbf9d7921/deployments/d56a7a94-8c01-4d01-ba0d-5c481c8bd625/online
https://ibm-watson-ml.mybluemix.net/v3/wml_instances/d51854a2-84b2-41db-90f0-ac2419a944f2/published_models/e8915a60-0f53-4d34-8a9f-eef64e2a5103/deployments/249ecd9e-c21c-4c9b-bbfc-cc1e7d3fcd61/online


### Scoring: Call REST API

** Create a JSON Sample record for scoring **

In [61]:
sample_data = {
  "fields": ['LOAN_AMNT',
 'EMP_LENGTH',
 'VERIFICATION_STATUS',
 'HOME_OWNERSHIP',
 'ANNUAL_INC',
 'PURPOSE',
 'INQ_LAST_6MTHS',
 'OPEN_ACC',
 'PUB_REC',
 'REVOL_UTIL',
 'DTI',
 'TOTAL_ACC',
 'DELINQ_2YRS',
 'EARLIEST_CR_LINE',
 'ADDR_STATE',
 'TERM',
 'DEFAULT',
 'EMP_LISTED',
 'EMPTY_DESC',
 'EMP_NA',
 'DELING_EVER',
 'TIME_HISTORY'],
  "values": [
    [4500, '< 1 year', 'Verified', 'RENT', 80000, 'major_purchase', 1, 9, 0, 18.3, 5.39, 16, 0, 780969600000000000, 'CA', '36 months', 0, 1, 0, 0, 0, 6148, 0]
  ]
}

sample_json = json.dumps(sample_data)

** Make API call for scoring **

In [62]:
print lr_scoring_url

https://ibm-watson-ml.mybluemix.net/v3/wml_instances/d51854a2-84b2-41db-90f0-ac2419a944f2/published_models/0c7052b8-fe40-49b1-8ce9-19a8ea964e99/deployments/da02c490-7aa6-44bb-854e-de85becea066/online


In [63]:
# Get the scoring endpoint from the WML service
scoring_response = score_example(creds, lr_scoring_url, sample_json)


{
  "fields": ["LOAN_AMNT", "EMP_LENGTH", "VERIFICATION_STATUS", "HOME_OWNERSHIP", "ANNUAL_INC", "PURPOSE", "INQ_LAST_6MTHS", "OPEN_ACC", "PUB_REC", "REVOL_UTIL", "DTI", "TOTAL_ACC", "DELINQ_2YRS", "EARLIEST_CR_LINE", "ADDR_STATE", "TERM", "DEFAULT", "EMP_LISTED", "EMPTY_DESC", "EMP_NA", "DELING_EVER", "TIME_HISTORY", "EMP_LENGTHIndex", "VERIFICATION_STATUSIndex", "HOME_OWNERSHIPIndex", "PURPOSEIndex", "ADDR_STATEIndex", "TERMIndex", "EMP_LENGTHclassVec", "VERIFICATION_STATUSclassVec", "HOME_OWNERSHIPclassVec", "PURPOSEclassVec", "ADDR_STATEclassVec", "TERMclassVec", "features_non_scaled", "features", "rawPrediction", "probability", "prediction"],
  "values": [[4500, "< 1 year", "Verified", "RENT", 80000.0, "major_purchase", 1, 9, 0, 18.3, 5.39, 16, 0, 780969600000000000, "CA", "36 months", 0, 1, 0, 0, 0, 6148, 1.0, 1.0, 0.0, 4.0, 0.0, 0.0, [11, [1], [1.0]], [2, [1], [1.0]], [4, [0], [1.0]], [13, [4], [1.0]], [49, [0], [1.0]], [1, [0], [1.0]], [94, [1, 12, 13, 21, 30, 79, 80, 81, 82, 8

** Grab Prediction Value  **

In [64]:
wml = json.loads(scoring_response)

# First zip the fields and values together
zipped_wml = zip(wml['fields'], wml['values'].pop())

# Next iterate through items and grab the prediction value
print "Default Prediction for this borrower is: " + str([v for (k,v) in zipped_wml if k == 'prediction'].pop())
print "Default Probability for this borrower is: " + str([v for (k,v) in zipped_wml if k == 'probability'].pop())

Default Prediction for this borrower is: 0.0
Default Probability for this borrower is: [0.9303301742353534, 0.06966982576464667]


## Default Prediction App Powered by Watson Machine Learning

Go to the web app: https://lendingclub-flask-demo.mybluemix.net/#

To view the source of this web app, go here https://github.com/dustinvanstee/lendingclub

## Model Retraining and Redeploying -> [WIP]

In [73]:
# retrain_and_deploy(creds, loan_spark_read, "Updated_LR_Model")