# Evaluate and predict customer churn
This notebook is an adaptation from the work done by [Sidney Phoon](https://github.com/IBMDataScience/DSX-DemoCenter/tree/master/DSX-Local-Telco-Churn-master) with the following modifications:
* Use datasets from our github sources
* Deploy and run the notebook on DSX running on IBM Cloud Private (ICP)
* Document some actions for a beginner data scienctist / developer who wants to understand what's going on.
* The web application was separated in another git project

The goal is still to demonstrate how to build a predictive model with Spark machine learning API (SparkML) to predict customer churn, and deploy it for scoring in Machine Learning (ML) running on ICP.

## Scope
A lot of industries have the issue of customers moving to competitors when the product differentiation is not that important. One industry illustrating this problem is the telecom industry with mobile, internet and IP TV product offerings. 


## Note book explanations
The notebook aims to follow the classical data science modeling steps:
1. load the data
1. prepare the data
1. analyze the data
1. build a model
1. deploy the model
1. consume the model as a service

This jupyter notebook uses Apache Spark to run machine learning jobs to build decision trees to assess when a customer is risk to move to competitor. Apache Spark offers a Python module called pyspark to operate on data and use ML constructs.

### Start by all imports
As a best practices for notebook implementation is to do the import at the top of the notebook. 
* [Spark SQLContext](https://spark.apache.org/docs/latest/sql-programming-guide.html) a spark module to process structured data
* [spark conf]() to access Spark cluster configuration and then be able to execute queries
* [jaydebeapi](https://pypi.python.org/pypi/JayDeBeApi) is used to connect to the DB 2 warehouse where customer data are persisted. We assume they are loaded.
* [ibmdbpy](https://pypi.python.org/pypi/ibmdbpy) interface for data manipulation and access to in-database algorithms in IBM dashDB and IBM DB2.
* [pandas](https://pandas.pydata.org) Python super library for data analysis
* [brunel](https://github.com/Brunel-Visualization/Brunel/wiki) API and tool to visualize data quickly. 
* [pixiedust](www.ibm.com/PixieDust) Visualize data inside Jupyter notebooks

In [1]:
from pyspark.sql import SQLContext
from pyspark.conf import SparkConf
import pyspark
import pandas as pd
import brunel
import numpy as np
from pixiedust.display import *

Pixiedust database opened successfully


### Load data
We suppose the churn data were built by a marketing department who used the customer id and flag them as potential churn or not. The data are delivered as csv file to the data scientist.

In this  notebook the data are loaded to the internal DSX storage using the following steps:
* Use the `+` icon on right side of the DSX menu bar to access to `Add Dataset` and then load the customer.csv and churn.csv files from the folder `refarch-analytics/jupyter-notebooks/TelcoChurn/data_assets`. The churn attribute is just a boolean. 
* Add a `code` cell in the netbook, select `1001` icon and then using `Insert to code > Insert spark DataFrame in python` to get a code snippet to load the data. 
* rename the auto generated data frame name

In the code below the `sc` variable is the Spark Context, and it should be initialized by the execution of the notebook and the DSX spark kernel.

In [2]:

    

# Add customer asset from file system
customers = SQLContext(sc).read.csv('../datasets/customer.csv', header='true')
customers.show(5)

# Add churn asset from file system
churns = SQLContext(sc).read.csv('../datasets/churn.csv', header='true')
churns.show(5)

+---+------+------+--------+------------+---------+---------+------------+-------------+----------+--------+---------+-------------+--------------------+----------+--------+
| ID|Gender|Status|Children|  Est Income|Car Owner|      Age|LongDistance|International|     Local| Dropped|Paymethod|LocalBilltype|LongDistanceBilltype|     Usage|RatePlan|
+---+------+------+--------+------------+---------+---------+------------+-------------+----------+--------+---------+-------------+--------------------+----------+--------+
|  1|     F|     S|1.000000|38000.000000|        N|24.393333|   23.560000|     0.000000|206.080000|0.000000|       CC|       Budget|      Intnl_discount|229.640000|3.000000|
|  6|     M|     M|2.000000|29616.000000|        N|49.426667|   29.780000|     0.000000| 45.500000|0.000000|       CH|    FreeLocal|            Standard| 75.290000|2.000000|
|  8|     M|     M|0.000000|19732.800000|        N|50.673333|   24.810000|     0.000000| 22.440000|0.000000|       CC|    FreeLoca

When the previous code run successfully you will see the top five rows of the dataset.

One of the major question to address is what are the attributes of this dataset that are relevant to classify churn risk.  

### Prepare data
To have a unique dataset to split into training and test sets, we need to merge the churn and customer tables in one dataset. The ID attribute represents the customer ID and is used as join column. 


In [3]:
data=customers.join(churns,customers['ID']==churns['ID']).select(customers['*'],churns['CHURN'])
data.show(5)
data.printSchema()

+---+------+------+--------+------------+---------+---------+------------+-------------+----------+--------+---------+-------------+--------------------+----------+--------+-----+
| ID|Gender|Status|Children|  Est Income|Car Owner|      Age|LongDistance|International|     Local| Dropped|Paymethod|LocalBilltype|LongDistanceBilltype|     Usage|RatePlan|CHURN|
+---+------+------+--------+------------+---------+---------+------------+-------------+----------+--------+---------+-------------+--------------------+----------+--------+-----+
|  1|     F|     S|1.000000|38000.000000|        N|24.393333|   23.560000|     0.000000|206.080000|0.000000|       CC|       Budget|      Intnl_discount|229.640000|3.000000|    T|
|  6|     M|     M|2.000000|29616.000000|        N|49.426667|   29.780000|     0.000000| 45.500000|0.000000|       CH|    FreeLocal|            Standard| 75.290000|2.000000|    F|
|  8|     M|     M|0.000000|19732.800000|        N|50.673333|   24.810000|     0.000000| 22.440000|0

Rename the column to remove spaces. Use the Panda Data Frame for visualization

In [4]:
from pyspark.sql.types import DoubleType,IntegerType
data = data.withColumnRenamed("Est Income", "EstIncome").withColumnRenamed("Car Owner","CarOwner")
data = data.withColumn("Age",data["Age"].cast(IntegerType())).withColumn("Children",data["Children"].cast(IntegerType())).withColumn("RatePlan",data["RatePlan"].cast(IntegerType()))
data = data.withColumn("EstIncome",data["EstIncome"].cast(DoubleType())).withColumn("LongDistance",data["LongDistance"].cast(DoubleType())).withColumn("International",data["International"].cast(DoubleType()))
data = data.withColumn("Local",data["Local"].cast(DoubleType())).withColumn("Usage",data["Usage"].cast(DoubleType()))
#data = data.withColumn("CHURN",data["CHURN"].cast(DoubleType()))
data.printSchema()


root
 |-- ID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Status: string (nullable = true)
 |-- Children: integer (nullable = true)
 |-- EstIncome: double (nullable = true)
 |-- CarOwner: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- LongDistance: double (nullable = true)
 |-- International: double (nullable = true)
 |-- Local: double (nullable = true)
 |-- Dropped: string (nullable = true)
 |-- Paymethod: string (nullable = true)
 |-- LocalBilltype: string (nullable = true)
 |-- LongDistanceBilltype: string (nullable = true)
 |-- Usage: double (nullable = true)
 |-- RatePlan: integer (nullable = true)
 |-- CHURN: string (nullable = true)



In [5]:
pdf = data.toPandas()
pdf.head()
pdf.describe()

Unnamed: 0,Children,EstIncome,Age,LongDistance,International,Local,Usage,RatePlan
count,2066.0,2066.0,2066.0,2066.0,2066.0,2066.0,2066.0,2066.0
mean,1.146176,51514.070465,42.802033,16.122076,1.191104,59.158025,75.907696,2.510649
std,0.843105,30805.652721,14.919932,9.874795,2.60201,57.571428,59.787475,1.124731
min,0.0,96.33,12.0,0.0,0.0,0.68,0.68,1.0
25%,0.0,21021.6,30.25,8.09,0.0,15.1775,34.19,2.0
50%,1.0,55860.0,46.0,16.14,0.0,39.845,57.11,2.0
75%,2.0,78000.0,54.0,22.99,0.0,87.46,107.02,4.0
max,2.0,120000.0,77.0,59.0,9.7,332.46,361.88,4.0


## Analyze Data
Data preparation and data understanding are the most time-consuming tasks in the data mining process. The data scientist needs to review and evaluate the quality of data before modeling.
Visualization is one of the ways to reivew data.
The Brunel Visualization Language is a highly succinct and novel language that defines interactive data visualizations based on tabular data. The language is well suited for both data scientists and business users. More information about Brunel Visualization: https://github.com/Brunel-Visualization/Brunel/wiki
Try Brunel visualization here: http://brunel.mybluemix.net/gallery_app/renderer

Transform numerical attributes.

In [6]:
pdf['Usage']=pdf['Usage'].apply(pd.to_numeric)
print('The mean of the usage is %d ' % np.mean(pdf['Usage']))
      

The mean of the usage is 75 


In [7]:
import brunel

%brunel data('pdf') bar x(CHURN) y(EstIncome) mean(EstIncome) color(LocalBilltype) stack tooltip(EstIncome) | x(LongDistance) y(Usage) point color(Paymethod) tooltip(LongDistance, Usage) :: width=1100, height=400

<IPython.core.display.Javascript object>

From the previous diagrams, higher revenue customers seem to stay more.

### PixieDust
PixieDust is a Python Helper library for Spark IPython Notebooks. One of its main features are visualizations. You'll notice that unlike other APIs which produce just output, PixieDust creates an interactive UI in which you can explore data. As PixieDust uses numerical values and some of the panda dataframe were still strings, we need to do some data transformation. 
Also the pixiesdust needs a panda dataframe.

In [8]:
b={'T':1,'F':0}
pdf['ChurnValue']=pdf['CHURN'].map(b)
pdf['Children']=pdf['Children'].apply(pd.to_numeric)
pdf['RatePlan']=pdf['RatePlan'].apply(pd.to_numeric)
display(pdf)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Prepare the decision trees and Random Forest with Spark
"Pipeline" is an API in SparkML that's used for building models. See spark machine learning library [documentation](https://spark.apache.org/docs/2.0.2/ml-guide.html) and [pipeline guide](https://spark.apache.org/docs/2.0.2/ml-pipeline.html). Spark uses DataFrame API since 2.0. 
The code below encodes all attributes that are labels of type string to indexed numberical value. We need that for ML processing.


**StringIndexer** encodes a string column of labels to a column of label indices. The indices are in (0, numLabels), ordered by label frequencies, so the most frequent label gets index 0.

**OneHotEncoder** maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

**VectorAssembler** is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees

In [9]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorIndexer, IndexToString
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

# Prepare string variables so that they can be used by the decision tree algorithm
# StringIndexer encodes a string column of labels to a column of label indices
SI1 = StringIndexer(inputCol='Gender', outputCol='GenderEncoded')
SI2 = StringIndexer(inputCol='Status',outputCol='StatusEncoded')
SI3 = StringIndexer(inputCol='CarOwner',outputCol='CarOwnerEncoded')
SI4 = StringIndexer(inputCol='Paymethod',outputCol='PaymethodEncoded')
SI5 = StringIndexer(inputCol='LocalBilltype',outputCol='LocalBilltypeEncoded')
SI6 = StringIndexer(inputCol='LongDistanceBilltype',outputCol='LongDistanceBilltypeEncoded')
labelIndexer = StringIndexer(inputCol='CHURN', outputCol='label').fit(data)

#Apply OneHotEncoder so categorical features aren't given numeric importance
OH1 = OneHotEncoder(inputCol="GenderEncoded", outputCol="GenderEncoded"+"classVec")
OH2 = OneHotEncoder(inputCol="StatusEncoded", outputCol="StatusEncoded"+"classVec")
OH3 = OneHotEncoder(inputCol="CarOwnerEncoded", outputCol="CarOwnerEncoded"+"classVec")
OH4 = OneHotEncoder(inputCol="PaymethodEncoded", outputCol="PaymethodEncoded"+"classVec")
OH5 = OneHotEncoder(inputCol="LocalBilltypeEncoded", outputCol="LocalBilltypeEncoded"+"classVec")
OH6 = OneHotEncoder(inputCol="LongDistanceBilltypeEncoded", outputCol="LongDistanceBilltypeEncoded"+"classVec")

# Pipelines API requires that input variables are passed in  a vector
assembler = VectorAssembler(inputCols=["GenderEncodedclassVec", "StatusEncodedclassVec", "CarOwnerEncodedclassVec", \
                                       "PaymethodEncodedclassVec", "LocalBilltypeEncodedclassVec", \
                                       "LongDistanceBilltypeEncodedclassVec", "Children", "EstIncome", "Age", \
                                       "LongDistance", "International", "Local",\
                                       "Dropped","Usage"], outputCol="features")

A pipeline is like a work flow to combine a set of operations or algorithm to apply on the data. It chains transformers and estimators. Transformers are used to convert DataFrame to another by appending one or more columns. Estimators are learning algorithm that train (or fit) on data.
In machine learning, it is common to run a sequence of algorithms to process and learn from data.
A Pipeline is a sequence of stages, and each stage is either a Transformer or an Estimator. These stages are run in order, and the input DataFrame is transformed as it passes through each stage. A pipeline is an estimator, but produces a transformation.

In [10]:
# instantiate the ramdom forest classifier algorithm, take the default settings
rf=RandomForestClassifier(labelCol="label", featuresCol="features")

# Convert indexed labels back to original labels.
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)

pipeline = Pipeline(stages=[SI1,SI2,SI3,SI4,SI5,SI6,labelIndexer, OH1, OH2, OH3, OH4, OH5, OH6,assembler, rf, labelConverter])
# pipeline = Pipeline(stages=[SI1,SI2,SI3,SI4,SI5,SI6,labelIndexer, OH1, OH2, OH3, OH4, OH5, OH6,assembler,rf])

In [13]:
# Remove unnecessary columns
data.drop('ID').collect()
# Split data into train and test datasets
train, test = data.randomSplit([0.8,0.2], seed=6)
train.cache()
test.cache()

DataFrame[ID: string, Gender: string, Status: string, Children: int, EstIncome: double, CarOwner: string, Age: int, LongDistance: double, International: double, Local: double, Dropped: string, Paymethod: string, LocalBilltype: string, LongDistanceBilltype: string, Usage: double, RatePlan: int, CHURN: string]

In [14]:
# Build models
model = pipeline.fit(train)

IllegalArgumentException: u'Data type StringType is not supported.'

## Score the model with the test set

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

results = model.transform(test)
results=results.select(results["ID"],results["CHURN"],results["label"],results["predictedLabel"],results["prediction"],results["probability"])
results.toPandas().head(6)
# Evaluate the model

print 'Precision model1 = {:.2f}.'.format(results.filter(results.label == results.prediction).count() / float(results.count()))


# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", labelCol="label", metricName="areaUnderROC")
print 'Area under ROC curve = {:.2f}.'.format(evaluator.evaluate(results))

We have finished building and testing a predictive model. The next step is to deploy it for real time scoring.
## Save Model in ML repository