# Customer Churn Prediction with IBM Db2 Warehouse using PySpark

# Part 3 : Deployment

__Introduction__

This notebook presents a churn prediction use case using anonymized customer data from a phone operator. It uses IBM Db2 Warehouse and runs on a PySpark kernel. It is the third part of a series on this use case. It is focused on deployment: in the previous notebook, we had saved our models. Let's reuse them on some fresh data!

__Use case__

Our goal is to accurately predict whether a customer is going to end his/her contract (labeled as positive,1). We prefer to send a commercial email to someone who intends to keep her contract but is labeled as willing to end it (false positive) rather than to overlook the opportunity of preventing a customer from ending her contract (false negative). We also care to accurately target customers with engagement campaigns : not overwhelming customers with commercials and not losing money by proposing special offers to too many people (precision and accuracy). Our optimization objective thus consisted in maximizing recall i.e. minimizing the false negative rate. We also looked at a couple of other indicators such as accuracy and area under curve.


__Previously__

In the first notebook, we used PySpark for data exploration and visualization. We created, scaled and selected features. In the second notebook, we built and tested several models. We selected the model with the highest recall on the test set. 

__Contents__
1. Get ready
2. Load fresh data
3. Make predictions

## 1. Get ready

__Imports__

In [1]:
# Basics
from pyspark.ml.feature import VectorAssembler

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
205,,pyspark,idle,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
# Classification models
from pyspark.ml.classification import RandomForestClassificationModel
from pyspark.ml.classification import LogisticRegressionModel
from pyspark.ml.classification import GBTClassificationModel
from pyspark.ml.clustering import KMeansModel
from pyspark.ml.feature import MinMaxScalerModel

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

__Open the models__

In [3]:
# note : please modify the path if you did not save the previous notebook in the same folder as this one
rfModel = RandomForestClassificationModel.load("/tmp/myRFModel")
clusterModel = KMeansModel.load("/tmp/myClusterModel")
scalerModel = MinMaxScalerModel.load("/tmp/myScalerModel")

# if you want to use other models for comparison
#gbtModel = GBTClassificationModel.load("/tmp/myGBTModel")
#lrModel = LogisticRegressionModel.load("/tmp/myLogRegModel")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [4]:
# open the CHURN_PROPORTION table you had previously saved
churn_proportion = spark.read \
        .format("com.ibm.idax.spark.idaxsource") \
        .options(dbtable="AVG_CLUSTER_CHURN") \
        .load()

churn_proportion.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---------+--------------------+
|ClusterID|           AVG_CHURN|
+---------+--------------------+
|        8|   0.563953488372093|
|        6| 0.09444444444444444|
|       11| 0.11805555555555555|
|        5| 0.06030150753768844|
|       10|0.061068702290076333|
|        2| 0.08121827411167512|
|        4| 0.08292682926829269|
|       13|0.025477707006369428|
|        1| 0.02564102564102564|
|       12| 0.07142857142857142|
|        3|  0.0728476821192053|
|        0|0.048484848484848485|
|       14| 0.05333333333333334|
|        7| 0.24861878453038674|
|        9|  0.4178082191780822|
+---------+--------------------+

## 2. Load fresh data 

__Open the data__

In [5]:
# a table has been prepopulated in Db2 with a sample of unlabeled customer data : SAMPLES.EVAL
sparkSession = spark \
        .builder \
        .getOrCreate()

df = spark.read \
        .format("com.ibm.idax.spark.idaxsource") \
        .options(dbtable="SAMPLES.EVAL") \
        .load()
df.show(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----+-----+----------+--------+---------+----------+--------+---------+----------+----------+-----------+------------+---------+----------+-----------+---------+
|AREA|VMAIL|VMAIL_MSGS|DAY_MINS|DAY_CALLS|DAY_CHARGE|EVE_MINS|EVE_CALLS|EVE_CHARGE|NIGHT_MINS|NIGHT_CALLS|NIGHT_CHARGE|INTL_MINS|INTL_CALLS|INTL_CHARGE|SVC_CALLS|
+----+-----+----------+--------+---------+----------+--------+---------+----------+----------+-----------+------------+---------+----------+-----------+---------+
| 415|    0|        25|   265.1|      110|     45.07|  197.40|       99|     16.78|     244.7|         91|       11.01|     10.0|         3|       2.70|        1|
+----+-----+----------+--------+---------+----------+--------+---------+----------+----------+-----------+------------+---------+----------+-----------+---------+
only showing top 1 row

In [6]:
print("Number of rows "+str(df.count()))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Number of rows 4

We have 4 new examples. They are not labeled. It's up to us to make predictions!

But first, let's transform our data into a suitable format.

__Prepare the data__

1. Add new columns
2. Scale the numerical features
2. Add the ClusterChurn feature
3. Assemble features for prediction

In [7]:
## 1. Add new columns

# 4 new features
TOT_MINS = df['DAY_MINS']+df['EVE_MINS']+df['INTL_MINS']+df['NIGHT_MINS']
DAY_MINS_perCALL = df['DAY_MINS']/df['DAY_CALLS']
NIGHT_MINS_perCALL = df['NIGHT_MINS']/df['NIGHT_CALLS']
EVE_MINS_perCALL = df['EVE_MINS']/df['EVE_CALLS']

# Add the columns to the existing ones in a new dataframe
tot_df = df.withColumn("TOT_MINS", TOT_MINS).withColumn("DAY_MINS_perCALL", DAY_MINS_perCALL).withColumn("NIGHT_MINS_perCALL", NIGHT_MINS_perCALL).withColumn("EVE_MINS_perCALL", EVE_MINS_perCALL)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [16]:
## 2. Scale the numerical features

assembler = VectorAssembler(
    inputCols=["TOT_MINS", "DAY_MINS_perCALL", "EVE_MINS_perCALL", "NIGHT_MINS_perCALL", "VMAIL_MSGS", "INTL_CALLS", "DAY_CALLS", "EVE_CALLS", "NIGHT_CALLS", "SVC_CALLS", "INTL_CHARGE", "DAY_CHARGE", "EVE_CHARGE", "NIGHT_CHARGE"],
    outputCol="rawFeatures")
assembled_df = assembler.transform(tot_df)

scaled_df = scalerModel.transform(assembled_df)

#Unroll scaled features vector and reinsert CHURN and VMAIL columns
columns = ["VMAIL", "TOT_MINS", "DAY_MINS_perCALL", "EVE_MINS_perCALL", "NIGHT_MINS_perCALL", "VMAIL_MSGS", "INTL_CALLS", "DAY_CALLS", "EVE_CALLS", "NIGHT_CALLS", "SVC_CALLS", "INTL_CHARGE", "DAY_CHARGE", "EVE_CHARGE", "NIGHT_CHARGE"]

full_scaled_df = scaled_df.rdd.map(lambda x:[x["VMAIL"]]+[float(y) for y in x['features']]).toDF(columns)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [17]:
## 3. Add the ClusterChurn feature (i.e. AVG_CHURN)

# Assembler for kmeans
assembler_16 = VectorAssembler(
    inputCols=['VMAIL', 'TOT_MINS', 'DAY_MINS_perCALL', 'EVE_MINS_perCALL', 
               'NIGHT_MINS_perCALL', 'VMAIL_MSGS', 'INTL_CALLS', 'DAY_CALLS', 
               'EVE_CALLS', 'NIGHT_CALLS', 'SVC_CALLS', 'INTL_CHARGE', 
               'DAY_CHARGE', 'EVE_CHARGE', 'NIGHT_CHARGE'],
    outputCol="features")

# Join tables on ClusterID to add a new column
def preparation(DF):
    
    # assemble 
    DF_16 = assembler_16.transform(DF)
    
    # Assign each point of the training set to its cluster
    DF_prediction = clusterModel.transform(DF_16)
    
    # Join DF with table churn_proportion on ClusterID
    DF_joined = DF_prediction.join(churn_proportion, DF_prediction.prediction == churn_proportion.ClusterID, "inner")
    #DF_joined.show(1)
    #DF_joined.printSchema()
    
    # Rename columns
    DF_prepared = DF_joined.withColumnRenamed("features", "featuresClustering").withColumnRenamed("prediction", "predictionClustering").withColumnRenamed("AVG_CHURN", "ClusterChurn")
    #DF_prepared.printSchema()
    
    return DF_prepared

prep_df = preparation(full_scaled_df)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [18]:
prep_df.printSchema()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- VMAIL: long (nullable = true)
 |-- TOT_MINS: double (nullable = true)
 |-- DAY_MINS_perCALL: double (nullable = true)
 |-- EVE_MINS_perCALL: double (nullable = true)
 |-- NIGHT_MINS_perCALL: double (nullable = true)
 |-- VMAIL_MSGS: double (nullable = true)
 |-- INTL_CALLS: double (nullable = true)
 |-- DAY_CALLS: double (nullable = true)
 |-- EVE_CALLS: double (nullable = true)
 |-- NIGHT_CALLS: double (nullable = true)
 |-- SVC_CALLS: double (nullable = true)
 |-- INTL_CHARGE: double (nullable = true)
 |-- DAY_CHARGE: double (nullable = true)
 |-- EVE_CHARGE: double (nullable = true)
 |-- NIGHT_CHARGE: double (nullable = true)
 |-- featuresClustering: vector (nullable = true)
 |-- predictionClustering: integer (nullable = false)
 |-- ClusterID: integer (nullable = false)
 |-- ClusterChurn: double (nullable = true)

In [19]:
prep_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+------------------+------------------+-------------------+-------------------+-------------------+----------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+--------------------+--------------------+---------+--------------------+
|VMAIL|          TOT_MINS|  DAY_MINS_perCALL|   EVE_MINS_perCALL| NIGHT_MINS_perCALL|         VMAIL_MSGS|INTL_CALLS|         DAY_CALLS|          EVE_CALLS|        NIGHT_CALLS|         SVC_CALLS|        INTL_CHARGE|         DAY_CHARGE|        EVE_CHARGE|      NIGHT_CHARGE|  featuresClustering|predictionClustering|ClusterID|        ClusterChurn|
+-----+------------------+------------------+-------------------+-------------------+-------------------+----------+------------------+-------------------+-------------------+------------------+-------------------+-------------------+------------------+------------------+--------------------+---------------

In [20]:
## 4. Assemble features for supervised learning

assembler_12 = VectorAssembler(
    inputCols=['VMAIL', 'TOT_MINS', 'DAY_MINS_perCALL', 'EVE_MINS_perCALL', 
               'NIGHT_MINS_perCALL', 'VMAIL_MSGS', 'INTL_CALLS', 'DAY_CALLS', 
               'EVE_CALLS', 'NIGHT_CALLS', 'SVC_CALLS', 'INTL_CHARGE', 
               'DAY_CHARGE', 'EVE_CHARGE', 'NIGHT_CHARGE', 'ClusterChurn'],
    outputCol="features")

assembled_df = assembler_12.transform(prep_df)

# Select only the features columns
labeled_df = assembled_df.select(assembled_df["features"])

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

## 3. Make predictions

__Prediction__

Let's see what predictions are made by our model on this unseen data!

In [21]:
# Random Forest
rf_pred = rfModel.transform(labeled_df)

print("Random Forest")
rf_pred.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Random Forest
+--------------------+--------------------+--------------------+----------+
|            features|       rawPrediction|         probability|prediction|
+--------------------+--------------------+--------------------+----------+
|[1.0,0.4907607790...|[9.40259740259740...|[0.37610389610389...|       1.0|
|[0.0,0.7206592308...|[23.8575492636634...|[0.95430197054653...|       0.0|
|[0.0,0.5675045779...|[24.0624543699248...|[0.96249817479699...|       0.0|
|[1.0,0.8862993174...|          [3.0,22.0]|         [0.12,0.88]|       1.0|
+--------------------+--------------------+--------------------+----------+

Our final prediction vector is [1,0,0,1].

## What you've learned

Congratulations!

In this notebook, you've seen how to:
* load models into a Jupyter notebook
* load data you had saved in Db2
* deploy models.

____
## Authors

Eva Feillet - ML intern, IBM Cloud and Cognitive Software, IBM Lab in Böblingen, Germany