# Running PySpark ML on Db2 Warehouse sample data

### Import

Import the necessary Spark classes

In [1]:
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
80,,pyspark,idle,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Load data

Load the data from the Db2 Warehouse TRAINING sample. 

In [6]:
sparkSession = SparkSession \
        .builder \
        .getOrCreate()

df = sparkSession.read \
        .format("com.ibm.idax.spark.idaxsource") \
        .options(dbtable="SAMPLES.TRAINING") \
        .load()
df.show(5)


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----+----+-----+----------+--------+---------+----------+--------+---------+----------+----------+-----------+------------+---------+----------+-----------+---------+
|CHURN|AREA|VMAIL|VMAIL_MSGS|DAY_MINS|DAY_CALLS|DAY_CHARGE|EVE_MINS|EVE_CALLS|EVE_CHARGE|NIGHT_MINS|NIGHT_CALLS|NIGHT_CHARGE|INTL_MINS|INTL_CALLS|INTL_CHARGE|SVC_CALLS|
+-----+----+-----+----------+--------+---------+----------+--------+---------+----------+----------+-----------+------------+---------+----------+-----------+---------+
|    0| 415|    1|         0|   246.5|      108|     41.91|   216.3|       89|     18.39|     179.6|         99|        8.08|     12.7|         3|       3.43|        2|
|    1| 408|    1|         0|   298.1|      112|     50.68|   201.3|      100|     17.11|     214.7|         88|        9.66|      9.7|         4|       2.62|        2|
|    0| 510|    1|         0|   119.3|       82|     20.28|   185.1|      111|     15.73|     157.0|         74|        7.07|     10.9|         4|       2.

Our dataset is about phone calls and can be used for churn prediction and consumer habits analysis. Columns include whether the customer resigned his contract, the number of minutes for day and evening calls, the corresponding charge, a special category for international calls... In the following section we are going to build customer clusters on the basis of the time they spend on the phone.

### Create a model 

Build a Spark ML algorithm that selects the call counts from the customer data and clusters them using KMeans.

In [3]:

assembler = VectorAssembler(
    inputCols=["INTL_CALLS", "DAY_CALLS", "EVE_CALLS", "NIGHT_CALLS"],
    outputCol="features")
newDF = assembler.transform(df)

# Trains a k-means model.
kmeans = KMeans().setK(3).setSeed(1)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Fit the model to the required training set and run the algorithm to find the clusters.

In [4]:
model = kmeans.fit(newDF)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Print out the cluster centers.

In [5]:
predictions = model.transform(newDF)
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Cluster Centers: 
[  4.48522829 101.89794091 112.01253357 116.48343778]
[ 4.50045746 81.75388838 90.97163769 95.76669716]
[  4.45325022 117.16384684  97.17809439  88.0445236 ]