# Customer Churn Prediction with IBM Db2 Warehouse using PySpark

# Part 3 : Deployment

__Introduction__

This notebook presents a churn prediction use case using anonymized customer data from a phone operator. It uses IBM Db2 Warehouse and runs on a PySpark kernel. It is the third part of a series on this use case. It is focused on deployment: in the previous notebook, we had saved our models. Let's reuse them on some fresh data!

__Use case__

Our goal is to accurately predict whether a customer is going to end his/her contract (labeled as positive,1). We prefer to send a commercial email to someone who intends to keep her contract but is labeled as willing to end it (false positive) rather than to overlook the opportunity of preventing a customer from ending her contract (false negative). We also care to accurately target customers with engagement campaigns : not overwhelming customers with commercials and not losing money by proposing special offers to too many people (precision and accuracy). Our optimization objective thus consisted in maximizing recall id est minimizing the false negative rate. We also looked at a couple of other indicators such as accuracy and area under curve.

__Previously__

In the first notebook, we used PySpark for data exploration and visualization. We created, scaled and selected features. In the second notebook, we built and tested several models. We selected the model with the highest recall on the test set. 

__Contents__
1. Get ready
2. Load fresh data
3. Make predictions

## 1. Get ready

__Imports__

In [1]:
# Basics
from pyspark.ml.feature import VectorAssembler

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
183,,pyspark,idle,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
# Classification models
from pyspark.ml.classification import RandomForestClassificationModel
from pyspark.ml.classification import LogisticRegressionModel
from pyspark.ml.classification import GBTClassificationModel
from pyspark.ml.clustering import KMeansModel

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

__Open the models__

In [3]:
# note : please modify the path if you did not save the previous notebook in the same folder as this one
lrModel = LogisticRegressionModel.load("/tmp/myLogRegModel")
rfModel = RandomForestClassificationModel.load("/tmp/myRFModel")
gbtModel = GBTClassificationModel.load("/tmp/myGBTModel")
clusterModel = KMeansModel.load("/tmp/myClusterModel")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
# open the CHURN_PROPORTION table you had previously saved
churn_proportion = spark.read \
        .format("com.ibm.idax.spark.idaxsource") \
        .options(dbtable="CHURN_PROPORTION") \
        .load()

churn_proportion.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+--------------------+
|ClusterID|          avg(CHURN)|
+---------+--------------------+
|        0| 0.14545454545454545|
|       13|  0.5445544554455446|
|        4| 0.10365853658536585|
|        8| 0.11278195488721804|
|        6| 0.09815950920245399|
|       11| 0.10052910052910052|
|        1|0.049079754601226995|
|        3| 0.06882591093117409|
|       10| 0.07482993197278912|
|        7| 0.07741935483870968|
|        2| 0.13934426229508196|
|       12| 0.14150943396226415|
|        9|  0.5666666666666667|
|       14| 0.07913669064748201|
|        5|  0.0891089108910891|
+---------+--------------------+

## 2. Load fresh data 

__Open the data__

In [5]:
# a table has been prepopulated in Db2 with a sample of unlabeled customer data : SAMPLES.EVAL
sparkSession = spark \
        .builder \
        .getOrCreate()

df = spark.read \
        .format("com.ibm.idax.spark.idaxsource") \
        .options(dbtable="SAMPLES.EVAL") \
        .load()
df.show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+-----+----------+--------+---------+----------+--------+---------+----------+----------+-----------+------------+---------+----------+-----------+---------+
|AREA|VMAIL|VMAIL_MSGS|DAY_MINS|DAY_CALLS|DAY_CHARGE|EVE_MINS|EVE_CALLS|EVE_CHARGE|NIGHT_MINS|NIGHT_CALLS|NIGHT_CHARGE|INTL_MINS|INTL_CALLS|INTL_CHARGE|SVC_CALLS|
+----+-----+----------+--------+---------+----------+--------+---------+----------+----------+-----------+------------+---------+----------+-----------+---------+
| 415|    0|        25|   265.1|      110|     45.07|  197.40|       99|     16.78|     244.7|         91|       11.01|     10.0|         3|       2.70|        1|
+----+-----+----------+--------+---------+----------+--------+---------+----------+----------+-----------+------------+---------+----------+-----------+---------+
only showing top 1 row

In [6]:
print("Number of rows "+str(df.count()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Number of rows 4

We have 4 new examples. They are not labeled. It's up to us to make predictions!

But first, let's transform our data into a suitable format.

__Prepare the data__

* Add ClusterChurn feature


In [7]:
# Assembler for kmeans
assembler_12 = VectorAssembler(
    inputCols=["SVC_CALLS", "DAY_MINS", "DAY_CHARGE", "VMAIL_MSGS", "VMAIL", 
               "INTL_CALLS", "INTL_CHARGE", "INTL_MINS", "EVE_CHARGE", "EVE_MINS",
               "NIGHT_MINS", "NIGHT_CHARGE"],
    outputCol="features")

# Join tables on ClusterID to add a new column
def preparation(DF):
    
    # assemble 
    DF_12 = assembler_12.transform(DF)
    
    # Assign each point of the training set to its cluster
    DF_prediction = clusterModel.transform(DF_12)
    
    # Join DF with table churn_proportion on ClusterID
    DF_joined = DF_prediction.join(churn_proportion, DF_prediction.prediction == churn_proportion.ClusterID, "inner")
    #DF_joined.show(1)
    #DF_joined.printSchema()
    
    # Rename columns
    DF_prepared = DF_joined.withColumnRenamed("features", "featuresClustering").withColumnRenamed("prediction", "predictionClustering").withColumnRenamed("avg(CHURN)", "ClusterChurn")
    #DF_prepared.printSchema()
    
    return DF_prepared

prep_df = preparation(df)
# prep_df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

* Assemble features for supervised learning

In [8]:
assembler_C = VectorAssembler(
    inputCols=["SVC_CALLS", "DAY_MINS", "DAY_CHARGE", "VMAIL_MSGS", "VMAIL", 
               "INTL_CALLS", "INTL_CHARGE", "INTL_MINS", "EVE_CHARGE", "EVE_MINS",
               "NIGHT_MINS", "NIGHT_CHARGE", "ClusterChurn"],
    outputCol="features")

assembled_df = assembler_C.transform(prep_df)

# Select only the features columns
labeled_df = assembled_df.select(assembled_df["features"])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 3. Make predictions

__Prediction__

Test all models and compare results


In [9]:
# Logistic regression
lr_pred = lrModel.transform(labeled_df)
# Random Forest
rf_pred = rfModel.transform(labeled_df)
# GBT
gbt_pred = gbtModel.transform(labeled_df)

print("Logistic Regression")
lr_pred.show()
print("Random Forest")
rf_pred.show()
print("Gradient Boosted Trees")
gbt_pred.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Logistic Regression
+--------------------+--------------------+--------------------+----------+
|            features|       rawPrediction|         probability|prediction|
+--------------------+--------------------+--------------------+----------+
|[4.0,129.1,21.95,...|[0.80710433645923...|[0.69149211293988...|       0.0|
|[1.0,265.1,45.07,...|[0.36388357908790...|[0.58998021289068...|       0.0|
|[4.0,332.9,56.59,...|[-1.2394847237748...|[0.22452568975126...|       1.0|
|[1.0,161.6,27.47,...|[2.75909674548468...|[0.94042504861497...|       0.0|
+--------------------+--------------------+--------------------+----------+

Random Forest
+--------------------+--------------------+--------------------+----------+
|            features|       rawPrediction|         probability|prediction|
+--------------------+--------------------+--------------------+----------+
|[4.0,129.1,21.95,...|[1.95945945945945...|[0.09797297297297...|       1.0|
|[1.0,265.1,45.07,...|          [19.0,1.0]|         [

__Comments__

By comparing the probabilities and predictions of the 3 models, we conclude that the first item is likely to have been misclassified by logistic regression. Our final prediction vector is [1,0,1,0].

## What you've learned

Congratulations!

In this notebook, you've seen how to:
* load models into a Jupyter notebook
* load data you had saved in Db2
* deploy models.

____
## Authors

Eva Feillet - ML intern, IBM Cloud and Cognitive Software, IBM Lab in Böbligen, Germany