## Predicting Customer Churn with Machine Learning 
The objective of this notebook is to follow the CRISP-DM methodology to build a model to predict customer churn, and operationalize the model by deploying it into Watson Machine Learning. CRISP-DM stands for cross-industry process for data mining. This methodology provides a structured approach to planning a data mining project.

![CRISP-DM](https://raw.githubusercontent.com/yfphoon/dsx_demo/master/crisp_dm.png)

### Step 1: Load in the data
In this section, we will be using our customer data which is being sourced from our S3 connection as well as the churn data which we received as a CSV. Because these data assets have been added to our project, we can easily load them into dataframes with the 'Insert to code' button. Important to note here that we can add data to our project regardless of where it resides, and merge it together for analysis.

DSX also provides connector code to load data from and save data to your connected data sources (S3, Apache Hive, IBM Cloudant, IBM DB2, Oracle, Teradata, and more https://datascience.ibm.com/docs/content/analyze-data/python_load.html).

Note: You may also want to reference the Spark DataFrame API to learn more about the supported operations, https://spark.apache.org/docs/2.0.0-preview/api/python/pyspark.sql.html#pyspark.sql.DataFrame

In [None]:
from ingest.Connectors import Connectors
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

S3loadoptions = {
                  Connectors.AmazonS3.ACCESS_KEY          : 'AKIAIYAF6B7L52RTDJPQ',
                  Connectors.AmazonS3.SECRET_KEY          : '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',
                  Connectors.AmazonS3.SOURCE_BUCKET       : 'demolmw',
                  Connectors.AmazonS3.SOURCE_FILE_NAME    : 'customer.csv',
                  Connectors.AmazonS3.SOURCE_INFER_SCHEMA : '1',
                  Connectors.AmazonS3.SOURCE_FILE_FORMAT  : 'csv'}

customer_DF = sqlContext.read.format('com.ibm.spark.discover').options(**S3loadoptions).load()
customer_DF.show(5)

In [None]:
S3loadoptions2 = {
                  Connectors.AmazonS3.ACCESS_KEY          : 'AKIAIYAF6B7L52RTDJPQ',
                  Connectors.AmazonS3.SECRET_KEY          : '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',
                  Connectors.AmazonS3.SOURCE_BUCKET       : 'demolmw',
                  Connectors.AmazonS3.SOURCE_FILE_NAME    : 'churn.csv',
                  Connectors.AmazonS3.SOURCE_INFER_SCHEMA : '1',
                  Connectors.AmazonS3.SOURCE_FILE_FORMAT  : 'csv'}

churn_DF = sqlContext.read.format('com.ibm.spark.discover').options(**S3loadoptions2).load()
churn_DF.printSchema()
churn_DF.show(5)

### Step 3: Merge Files

In [None]:
data=customer_DF.join(churn_DF,customer_DF['ID']==churn_DF['ID']).select(customer_DF['*'],churn_DF['CHURN'])

data.toPandas().head()

### Step 4: Rename some columns
This step is not a requirement, it just makes some column names easier to type with no spaces

In [None]:
# withColumnRenamed renames an existing column in a Spark DataFrame and returns a new Spark DataFrame

data = data.withColumnRenamed("Est Income", "EstIncome").withColumnRenamed("Car Owner","CarOwner")
data.toPandas().head()

### Step 5: Data understanding

### Dataset Overview

In [None]:
df_pandas = data.toPandas()
print "There are " + str(len(df_pandas)) + " observations in the customer history dataset."
print "There are " + str(len(df_pandas.columns)) + " variables in the dataset."

print "\n******************Descriptive statistics*****************************\n"
print df_pandas.drop(['ID'], axis = 1).describe()

### Exploratory Data Analysis

The **Brunel** Visualization Language is a highly succinct and novel language that defines interactive data visualizations based on tabular data. The language is well suited for both data scientists and more aggressive business users. The system interprets the language and produces visualizations using the user's choice of existing lower-level visualization technologies typically used by application engineers such as RAVE or D3. 

More information about Brunel Visualization: https://github.com/Brunel-Visualization/Brunel/wiki

Try Brunel visualization here:  http://brunel.mybluemix.net/gallery_app/renderer

In [None]:
import brunel
df_pandas = data.toPandas()
%brunel data('df_pandas') stack bar x(Paymethod) y(#count) color(CHURN) bin(Paymethod) percent(#count) label(#count) tooltip(#all) | x(LongDistance) y(Usage) point color(Paymethod) tooltip(LongDistance, Usage) :: width=1100, height=400 

In [None]:
# Heat map
%brunel data('df_pandas') x(LocalBilltype) y(Dropped) color(#count:red) style('symbol:rect; size:100%; stroke:none') tooltip(Dropped,#count)

**PixieDust** is a Python Helper library for Spark IPython Notebooks. One of it's main features are visualizations. You'll notice that unlike other APIs which produce just output, PixieDust creates an interactive UI in which you can explore data.<br/>
More information about PixieDust: https://github.com/ibm-cds-labs/pixiedust?cm_mc_uid=78151411419314871783930&cm_mc_sid_50200000=1487962969

**If you haven't already installed it, uncomment and run the following cell to install the pixiedust Python library in your notebook environment. You only need to run it once**


In [None]:
# !pip install --user --upgrade pixiedust

In [None]:
from pixiedust.display import *
display(data)

### Step 6: Build the Spark pipeline and the Random Forest model
"Pipeline" is an API in SparkML that's used for building models. A pipeline defines a sequence of transformers and estimators to perform tha analysis in stages.<br/>
Additional information on SparkML: https://spark.apache.org/docs/2.0.2/ml-guide.html

In [None]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorIndexer, IndexToString
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

# StringIndexer encodes a string column of labels to a column of label indices. 
SI1 = StringIndexer(inputCol='Gender', outputCol='GenderEncoded')
SI2 = StringIndexer(inputCol='Status',outputCol='StatusEncoded')
SI3 = StringIndexer(inputCol='CarOwner',outputCol='CarOwnerEncoded')
SI4 = StringIndexer(inputCol='Paymethod',outputCol='PaymethodEncoded')
SI5 = StringIndexer(inputCol='LocalBilltype',outputCol='LocalBilltypeEncoded')
SI6 = StringIndexer(inputCol='LongDistanceBilltype',outputCol='LongDistanceBilltypeEncoded')


# Pipelines API requires that input variables are passed in  a vector
assembler = VectorAssembler(inputCols=["GenderEncoded", "StatusEncoded", "CarOwnerEncoded", "PaymethodEncoded", "LocalBilltypeEncoded", \
                                       "LongDistanceBilltypeEncoded", "Children", "EstIncome", "Age", "LongDistance", "International", "Local",\
                                      "Dropped","Usage","RatePlan"], outputCol="features")

In [None]:
# encode the label column
labelIndexer = StringIndexer(inputCol='CHURN', outputCol='label').fit(data)

In [None]:
# instantiate the algorithm, take the default settings
rf=RandomForestClassifier(labelCol="label", featuresCol="features")

In [None]:
# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)

In [None]:
# build the pipeline
pipeline = Pipeline(stages=[SI1,SI2,SI3,SI4,SI5,SI6, labelIndexer, assembler, rf, labelConverter])

In [None]:
# Split data into train and test datasets
(trainingData, testingData) = data.randomSplit([0.7, 0.3],seed=9)
trainingData.cache()
testingData.cache()

In [None]:
# Build model. The fitted model from a Pipeline is a PipelineModel, which consists of fitted models and transformers, corresponding to the pipeline stages.
model = pipeline.fit(trainingData)

### Step 7: Score the test data set

In [None]:
result=model.transform(testingData)
result_display=result.select(result["ID"],result["CHURN"],result["Label"],result["predictedLabel"],result["prediction"],result["probability"])
result_display.toPandas().head(6)

### Step 8: Model Evaluation
Find accuracy of the models and the Area Under the ROC Curve 

In [None]:
print 'Model Accuracy = {:.2f}.'.format(result.filter(result.label == result.prediction).count() / float(result.count()))

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label", metricName="areaUnderROC")
print 'Area under ROC curve = {:.2f}.'.format(evaluator.evaluate(result))

###  Step 9:  Tune the model to find the best model

#### Build a Parameter Grid specifying the parameters to be evaluated to determine the best combination

In [None]:
# set different levels for the maxDepth
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
paramGrid = (ParamGridBuilder().addGrid(rf.maxDepth,[4,6,8]).build())

#### Create a cross validator to tune the pipeline with the generated parameter grid
Cross-validation attempts to fit the underlying estimator with user-specified combinations of parameters, cross-evaluate the fitted models, and output the best one.

In [None]:
# perform 3 fold cross validation
cv = CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3)

In [None]:
# train the model
cvModel = cv.fit(trainingData)

# pick the best model
best_rfModel = cvModel.bestModel

In [None]:
# score the test data set with the best model
cvresult=best_rfModel.transform(testingData)
cvresults_show=cvresult.select(cvresult["ID"],cvresult["CHURN"],cvresult["Label"],cvresult["predictedLabel"],cvresult["prediction"],cvresult["probability"])
cvresults_show.toPandas().head()

In [None]:

print 'Model Accuracy of the best fitted model = {:.2f}.'.format(cvresult.filter(cvresult.label == cvresult.prediction).count()/ float(cvresult.count()))
print 'Model Accuracy of the default model = {:.2f}.'.format(result.filter(result.label == result.prediction).count() / float(result.count()))
print '   '
print('Area under the ROC curve of best fitted model = {:.2f}.'.format(evaluator.evaluate(cvresult)))
print 'Area under the ROC curve of the default model = {:.2f}.'.format(evaluator.evaluate(result))

### Step 10: Save Model in WML repository

In this section you will store your model in the Watson Machine Learning (WML) repository by using Python client libraries.
* <a href="https://console.ng.bluemix.net/docs/services/PredictiveModeling/index.html">WML Documentation</a>
* <a href="http://watson-ml-api.mybluemix.net/">WML REST API</a> 
* <a href="https://watson-ml-staging-libs.mybluemix.net/repository-python/">WML Repository API</a>
<br/>

First, you must import client libraries.

In [None]:
from repository.mlrepositoryclient import MLRepositoryClient
from repository.mlrepositoryartifact import MLRepositoryArtifact

### <span style="color:blue">Action Required</span>

If you do not already have an instance of the Machine Learning service in IBM Cloud, go to <a href="https://console.ng.bluemix.net/dashboard/apps/" target="_blank">IBM Cloud</a>, click **Catalog** on the top right of the menu, search for "Machine Learning", and create an instance.

If you have an existing instance of the Machine Learning service in <a href="https://console.ng.bluemix.net/dashboard/apps/" target="_blank">IBM Cloud</a>, click into the service.

* Click **Service credentials** on the left navigation bar
* Click **New credentials** and then the **Add** button to create new credentials
* Under **ACTIONS** click **View credentials**
* Click the **copy** icon to copy the credentials
* Paste the credentials into the code cell below

![WML Credentials](https://raw.githubusercontent.com/SidneyPhoon/IntroToWMLLab/master/images/WML_Credentials_Jan2018.jpg)


#### <span style="color:blue">Action Required</span>
Paste credentials in the code cell below

In [None]:
# @hidden_cell
wml_credentials={
  "url": "https://ibm-watson-ml.mybluemix.net",
  "access_key": "<Insert your WML access key here>",
  "username": "<Insert your WML username here>",
  "password": "<Insert your WML password here>",
  "instance_id": "<Insert your WML instance ID here>"
}

Authorize the repository client:

In [None]:
ml_repository_client = MLRepositoryClient(wml_credentials.get('url'))
ml_repository_client.authorize(wml_credentials.get('username'), wml_credentials.get('password'))

Create the model artifact.

<b>Tip:</b> The MLRepositoryArtifact method expects a trained model object, training data, and a model name. (It is this model name that is displayed by the Watson Machine Learning service).

In [None]:
pipeline_artifact = MLRepositoryArtifact(pipeline, name="pipelineATF")

In [None]:
model_artifact = MLRepositoryArtifact(model, training_data=trainingData, name="Predict Customer Churn", pipeline_artifact=pipeline_artifact)

Save model artifact to your Watson Machine Learning instance:

In [None]:
saved_model = ml_repository_client.models.save(model_artifact)

In [None]:
# Print the saved model properties
print "modelType: " + saved_model.meta.prop("modelType")
print "creationTime: " + str(saved_model.meta.prop("creationTime"))
print "modelVersionHref: " + saved_model.meta.prop("modelVersionHref")
print "label: " + saved_model.meta.prop("label")

### Step 11: Generate the Authorization Token for Invoking the model

In [None]:
import urllib3, requests, json

headers = urllib3.util.make_headers(basic_auth='{}:{}'.format(wml_credentials.get('username'), wml_credentials.get('password')))
url = '{}/v2/identity/token'.format(wml_credentials.get('url'))
response = requests.get(url, headers=headers)
mltoken = json.loads(response.text).get('token')

### Step 12:  Go to WML in IBM Cloud to create a Deployment Endpoint

### <span style="color:blue">Action Required</span>

* In your <a href="https://console.ng.bluemix.net/dashboard/apps/" target="_blank">IBM Cloud</a> dashboard, click into your WML Service and click the **Launch Dashboard** button under Watson Machine Learing.
![WML Launch Dashboard](https://raw.githubusercontent.com/yfphoon/dsx_demo/master/WML_Launch_Dashboard.png)

<br/>
* You should see your deployed model in the **Models** tab

* Under *Actions*, click on the 3 ellipses and click ***Create Deployment***.  Give your deployment configuration a unique name, e.g. "Predict Customer Churn Deply", select Type=Online and click **Save**.
<br/>
<br/>
* In the *Deployments tab*, under *Actions*, click **View Details**
<br/>
<br/>
* Scoll down to **API Details**, copy the value of the **Scoring Endpoint** into your notepad.  (e.g. 	https://ibm-watson-ml.mybluemix.net/v2/published_models/64fd0462-3f8a-4b42-820b-59a4da9b7dc6/deployments/7d9995ed-7daf-4cfd-b40f-37cb8ab3d88f/online)

### Step 13:  Invoke the model through REST API call

#### Create a JSON Sample record for the model 

In [None]:
json_payload = {
    "fields": [
    "ID",
    "Gender",
    "Status",
    "Children",
    "EstIncome",
    "CarOwner",
    "Age",
    "LongDistance",
    "International",
    "Local",
    "Dropped",
    "Paymethod",
    "LocalBilltype",
    "LongDistanceBilltype",
    "Usage",
    "RatePlan"
    ],
    "values": [ [999,"F","M",2.0,77551.100000,"Y",33.600000,20.530000,0.000000,41.890000,1.000000,"CC","Budget","Standard",62.420000,2.000000] ]
} 

#### Make Rest API call to test the deployed model

#### <span style="color:blue">Action Required</span>
Paste your **scoring_endpoint** in the code cell below

In [None]:
# Get the scoring endpoint from the WML service
# Replace the value for scoring_endpoint with your own scoring endpoint
scoring_endpoint = '<Insert Scoring Endpoint Here>'
header_online = {'Content-Type': 'application/json', 'Authorization': "Bearer " + mltoken}

# API call here
response_scoring = requests.post(scoring_endpoint, json=json_payload, headers=header_online)

print response_scoring.text

###### Grab Predicted Value 

In [None]:
wml = json.loads(response_scoring.text)

# First zip the fields and values together
zipped_wml = zip(wml['fields'], wml['values'].pop())

# Next iterate through items and grab the prediction value
print("Predicted Churn: " + [v for (k,v) in zipped_wml if k == 'predictedLabel'].pop())

You have come to the end of this notebook

**Sidney Phoon**<br/>
Jan 3, 2018