<h1 id="tocheading">Attrition Demo</h1>
<div id="toc"></div>

<img src="https://github.com/elenalowery/DSX_Local_Workshop/blob/master/img/CC_Intro.JPG?raw=true" width="800" height="500" align="middle"/>

The Attrition demo focuses on retaining Merchants that are using company network for credit card processing. Here is the description of the case:

A client approved many low value merchant accounts without much scrutiny.  Many of those merchant accounts resulted in default. The client thinks that they should have put more of an emphasis on their applicant screening process. IBM suggests to enable fact based decision making for performance of its joint marketing programs.

This notebook will demostrate how to

1. Use Brunel and Seaborn library for visualizations

2. Use regular python Machine Learning libary scikit-learn and Spark's Machine Learning library(MLlib) for predicitive modeling in an intergrated environment on DSX.
3. Deploy SparkML model using Machine Learning Service

## Set up environment

In [None]:
import sklearn

import pandas as pd
pd.options.display.max_columns = 999

import brunel

import warnings
warnings.filterwarnings('ignore')

from scipy.stats import chi2_contingency,ttest_ind
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


from sklearn.cross_validation import train_test_split, StratifiedKFold
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, accuracy_score, roc_curve, roc_auc_score

import numpy as np

import urllib3, requests, json

## Load Customer History Data

In [None]:
cust_pd = pd.read_csv('../datasets/customer_history.csv')
cust_pd.head()

### Dataset Overview

Let's take a quick look at the dataset.

In [None]:
print "There are " + str(len(cust_pd)) + " observations in the customer history dataset."
print "There are " + str(len(cust_pd.columns)) + " variables in the dataset."

print "\n******************Descriptive statistics*****************************\n"
print cust_pd.describe()

print "\n******************Dataset Quick View*****************************\n"
cust_pd.head()

## Exploratory Data Analysis

In this section, we will explore the dataset further with some visualizations.

Two open source libraries are used:
* <a href="https://github.com/Brunel-Visualization/Brunel">Brunel</a> is a high-level language that describes visualizations in terms of composable actions. It drives a visualization engine (D3) that performs the actual rendering and interactivity. Brunel makes it much easier to build fun and inventive visualizations in Jupyter notebooks.

* <a href="https://seaborn.pydata.org/">Seaborn</a> is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.

### Outcome Variable: Account Default

In [None]:
%brunel data('cust_pd') x(IS_DEFAULT) y(#count) color(IS_DEFAULT) bar tooltip(#all)

As you can see from the bar chart, 300 out of 1000 accounts are in default.

### Default by Credit Program

In [None]:
%brunel data('cust_pd') polar stack bar y(#count) color(CREDIT_PROGRAM) percent(#count) tooltip(#all) | stack bar x(CREDIT_PROGRAM) y(#count) color(IS_DEFAULT) bin(CREDIT_PROGRAM) percent(#count) label(#count) tooltip(#all) :: width=1200, height=350 

* Top 3 credit programs with most Merchants are Electronics(28%), New Car(23.4%) and Furniture(18.1%).
* Top 3 credit programs with high default rate are Education(44%), New Car(38%), and Retraining(35.1%)

### HISTORY vs. IS_DEFAULT

In [None]:
%brunel data('cust_pd') bar x(HISTORY) y(#count) color(HISTORY) tooltip(#all) | stack bar x(HISTORY) y(#count) color(IS_DEFAULT: green-red) bin(HISTORY) sort(HISTORY) percent(#count) label(#count) tooltip(#all) :: width=1200, height=350 

### AMOUNT_K_USD vs. IS_DEFAULT

In [None]:
sub_yes = cust_pd[cust_pd["IS_DEFAULT"] == "Yes"]
sub_no = cust_pd[cust_pd["IS_DEFAULT"] == "No"]
    
p_value = ttest_ind(sub_yes['AMOUNT_K_USD'], sub_no["AMOUNT_K_USD"], equal_var = False)[1]

fig, axs = plt.subplots(nrows= 1, figsize=(13, 5))
sns.boxplot(x = "IS_DEFAULT", y = "AMOUNT_K_USD", data = cust_pd, showfliers=False, palette="Set2")
if p_value < .05:
    plt.title("AMOUNT_K_USD" + "\n P value:" + str(p_value) + "\n The distributions for the two groups are significantly different!" + "\n Default: mean/std.: " + str(sub_yes["AMOUNT_K_USD"].describe()[1]) + "/" + str(sub_yes["AMOUNT_K_USD"].describe()[2]) + "\n Non-default: mean/std.: " + str(sub_no["AMOUNT_K_USD"].describe()[1]) + "/" + str(sub_no["AMOUNT_K_USD"].describe()[2]))
else:
    plt.title("AMOUNT_K_USD" + "\n P value:" + str(p_value) + "\n Default: mean/std.: " + str(sub_yes["AMOUNT_K_USD"].describe()[1]) + "/" + str(sub_yes["AMOUNT_K_USD"].describe()[2]) + "\n Non-default: mean/std.: " + str(sub_safe["AMOUNT_K_USD"].describe()[1]) + "/" + str(sub_no["AMOUNT_K_USD"].describe()[2]))           

In this box plot, the visualization is enhanced by T-test statistics. The result is significant which indicates that the average credit amount for the non-default group and default group are different. Default group has larger average credit amount.



### Default rate by state

In [None]:
default_rate = pd.crosstab(cust_pd.IS_DEFAULT, cust_pd.STATE).apply(lambda r: r/r.sum(), axis=0)

default_rate2 = default_rate.T

%brunel data('default_rate2') map color(Yes) key(STATE) label(STATE)


Brunel also provides a very neat way for map visualization. So for this use case, all the Merchants come from 4 states: NY, NJ, PA and CT.

### Correlation Matrix

A heatmap is used to visualize the correlations between all continuous variables.

In [None]:
plt.figure(figsize=(12, 8))

corr_df = cust_pd.iloc[:,1:].corr()

sns.heatmap(corr_df, 
            xticklabels = corr_df.columns.values,
            yticklabels = corr_df.columns.values,
            annot = True);


* There is no strong correlation between most variables.
* The correlation between AMOUNT_K_USD and CONTRACT_DURATION_MONTH is moderate.

## Modeling And Evaluation

### Sklearn Random Forest

In [None]:
# convert IS_DEFAULT to 1/0
le = LabelEncoder()

cust_pd.loc[:,'IS_DEFAULT']= le.fit_transform(cust_pd.loc[:,'IS_DEFAULT'])

y = np.float32(cust_pd.IS_DEFAULT)

# drop y and merchant
X = cust_pd.drop(['IS_DEFAULT', 'MERCHANT'], axis = 1)


In [None]:
from sklearn_pandas import DataFrameMapper

mapper = DataFrameMapper(
    [('ACCT_STATUS_K_USD', LabelEncoder()),
     ('CONTRACT_DURATION_MONTH', None),
     ('HISTORY',LabelEncoder()),
     ('CREDIT_PROGRAM', LabelEncoder()),
     ('AMOUNT_K_USD',None),
     ('ACCOUNT_TYPE',LabelEncoder()),
     ('ACCT_AGE',LabelEncoder()),
     ('STATE',LabelEncoder()),
     ('PRESENT_RESIDENT',LabelEncoder()),
     ('ESTABLISHED_MONTH',None),
     ('NUMBER_CREDITS',None)]
)

In [None]:
# split the data to training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
import sklearn.pipeline
from sklearn.preprocessing import OneHotEncoder

random_forest = RandomForestClassifier()
steps = [('mapper', mapper),('RandonForestClassifier', random_forest)]
pipeline = sklearn.pipeline.Pipeline(steps)
model=pipeline.fit( X_train, y_train )
model

In [None]:
### call pipeline.predict() on your X_test data to make a set of test predictions
y_prediction = pipeline.predict( X_test )
### test your predictions using sklearn.classification_report()
report = sklearn.metrics.classification_report( y_test, y_prediction )
### and print the report
print(report)

### Save model to ML Repository

In [None]:
#convert the y_test array into a pandas dataframe
y_test_df = pd.DataFrame(y_test,columns=['IS_DEFAULT'])

In [None]:
from dsx_ml.ml import save

model_name = "CreditCard_Default_Model"
save(model = model, name = model_name, x_test=X_test, y_test=y_test_df, algorithm_type = 'Classification')

### Test Saved Model with Test UI
1. Save the notebook and switch to the **Models** tab of the project (**hint**: right click the project name link, DSX_Local_Workshop, at the top, and open with another tab in your browser). 
2. Under **Models**, find and click into your saved model. 
3. Click the **Test** link to test the model. 
4. When you enter values for String variabes, **don't** include quotes. 

You can use the following data for testing (please note that the order of fields may be different in the UI):<br/>
`ACCT_STATUS_K_USD='0 to 200 USD', HISTORY='CRITICAL ACCOUNT', CREDIT_PROGRAM='NEW CAR', ACCOUNT_TYPE='UNKNOWN/NONE', ACCT_AGE='4 to 7 YRS', STATE='NY', PRESENT_RESIDENT='2 to 3 YRS', CONTRACT_DURATION_MONTH=3, AMOUNT_K_USD=10000, ESTABLISHED_MONTH=40, NUMBER_CREDITS=2`

The results of the test is displayed as follows:<br/>
<img style="float: left;" src="https://github.com/elenalowery/DSX_Local_Workshop/blob/master/img/CC_Test.JPG?raw=true" alt="Test API" width=900 />

### Test the model with REST API (Optional)
This step demonstrates an "internal REST API" call to test the model (for an unpublished model). Notice that we are using DSX variables for the model endpoint and token. See documentation for external REST call syntax. An exernal REST call will have a different end point and will require authentication. 

In [3]:
json_payload =[{
    "MERCHANT":999,
    "ACCT_STATUS_K_USD":"0 USD",
    "CONTRACT_DURATION_MONTH":12,
    "HISTORY":"CRITICAL ACCOUNT",
    "CREDIT_PROGRAM":"NEW CAR",
    "AMOUNT_K_USD":2171,
    "ACCOUNT_TYPE":"up to 100 K USD",
    "ACCT_AGE":"1 to 4 YRS",
    "STATE":"NY",
    "IS_URBAN":"NO",
    "IS_XBORDER":"NO",
    "SELF_REPORTED_ASMT":"NO",
    "CO_APPLICANT":"YES",
    "GUARANTOR":"NO",
    "PRESENT_RESIDENT":"4",
    "OWN_REAL_ESTATE":"NO",
    "PROP_UNKN":"NO",
    "ESTABLISHED_MONTH":38,
    "OTHER_INSTALL_PLAN":"NO",
    "RENT":"NO",
    "OWN_RESIDENCE":"YES",
    "NUMBER_CREDITS":2,
    "RFM_SCORE":2,
    "BRANCHES":1,
    "TELEPHONE":"YES",
    "SHIP_INTERNATIONAL":"NO"}]

**Action Required**: Change the *scoring_endpoint* to the value that's shown as the *scoring_endpoint* afer running Save to ML repository function, for example *'scoring_endpoint': 'https://dsxl-api.ibm-private-cloud.svc.cluster.local/v3/project/score/Python27/scikit-learn-0.19/DSX_Local_Workshop_el/CreditCard_Default_Model/1'*. 

In [4]:
import requests, json, os
from pprint import pprint

online_path = 'https://ibm-nginx-svc.ibm-private-cloud.svc.cluster.local/v3/project/score/Python27/scikit-learn-0.19/DSX_Local_Workshop_el/CreditCard_Default_Model/1'

header_online = {'Content-Type': 'application/json', 'Authorization':os.environ['DSX_TOKEN']}

response_scoring = requests.post(online_path, json=json_payload, headers=header_online)

response_dict = json.loads(response_scoring.content)

n = 1
for response in response_dict['object']['output']['predictions']:
    print("{}. {}".format(n,response))
    n+=1

ConnectionError: HTTPSConnectionPool(host='ibm-nginx-svc.ibm-private-cloud.svc.cluster.local', port=443): Max retries exceeded with url: /v3/project/score/Python27/scikit-learn-0.19/DSX_Local_Workshop_el/CreditCard_Default_Model/1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6c0d825d10>: Failed to establish a new connection: [Errno -2] Name or service not known',))

**The prediction of 1 means that the cusotmer is likely to default on the credit card, and 0 that they will not.**

Created by **Catherine Cao** and **Sidney Phoon**
<br/>
catherine.cao@ibm.com<br/>
yfphoon@us.ibm.com<br/>

Dec 29, 2017