# Running Spark ML on Db2 Warehouse sample data

### Import

Import the necessary Spark classes

In [1]:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.clustering.KMeansModel
import org.apache.spark.ml.feature.VectorAssembler

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
91,,spark,idle,,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.clustering.KMeansModel
import org.apache.spark.ml.feature.VectorAssembler


### Load

Load the data from the Db2 Warehouse TRAINING sample. 

In [2]:
val data = spark.read.format("com.ibm.idax.spark.idaxsource").
    option("url", "jdbc:db2:BLUDB").
    option("dbtable", "SAMPLES.TRAINING").
    option("mode", "JDBC").
    load()
println(data)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

data: org.apache.spark.sql.DataFrame = [CHURN: smallint, AREA: int ... 15 more fields]
[CHURN: smallint, AREA: int ... 15 more fields]


Our dataset is about phone calls and can be used for churn prediction and consumer habits analysis. Columns include whether the customer resigned his contract, the number of minutes for day and evening calls, the corresponding charge, a special category for international calls... In the following section we are going to build customer clusters on the basis of the time they spend on the phone.

### Create a model

Build a Spark ML pipeline that selects the call counts from the customer data and clusters them using KMeans

In [3]:
val assembler = new VectorAssembler().
    setInputCols(Array("INTL_CALLS", "DAY_CALLS", "EVE_CALLS", "NIGHT_CALLS")).
    setOutputCol("features")

val clustering = new KMeans().
    setFeaturesCol("features").
    setK(3).
    setMaxIter(3)

val pipe = new Pipeline().
    setStages(Array(assembler, clustering))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_31c8ff1f8679
clustering: org.apache.spark.ml.clustering.KMeans = kmeans_9e40851c3b54
pipe: org.apache.spark.ml.Pipeline = pipeline_c46f66e417f5


Run the pipeline to find the clusters

In [4]:
val model = pipe.fit(data)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

model: org.apache.spark.ml.PipelineModel = pipeline_c46f66e417f5


Print out the cluster centers

In [5]:
model.stages(1).asInstanceOf[KMeansModel].clusterCenters.foreach { println }

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[4.49812734082397,117.52059925093633,106.0314606741573,101.74756554307116]
[4.524608501118568,82.92170022371364,114.31991051454139,106.04474272930649]
[4.420289855072464,93.95833333333333,81.45561594202898,93.31702898550725]
