# Running Spark ML on DashDB sample data

Import the necessary Spark classes

In [2]:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.clustering.KMeansModel
import org.apache.spark.ml.feature.VectorAssembler

Load the data from the TRAINING sample. This table is pre-populated in dashDB local.

In [3]:
val data = spark.read.format("com.ibm.idax.spark.idaxsource").
    option("url", "jdbc:db2:BLUDB").
    option("dbtable", "SAMPLES.TRAINING").
    option("mode", "JDBC").
    load()
println(data)

Build a Spark ML pipeline that selects the call counts from the customer data and clusters them using KMeans

In [9]:
val assembler = new VectorAssembler().
    setInputCols(Array("INTL_CALLS", "DAY_CALLS", "EVE_CALLS", "NIGHT_CALLS")).
    setOutputCol("features")

val clustering = new KMeans().
    setFeaturesCol("features").
    setK(3).
    setMaxIter(3)

val pipe = new Pipeline().
    setStages(Array(assembler, clustering))

Run the pipeline to find the clusters

In [5]:
val model = pipe.fit(data)

Print out the cluster centers

In [6]:
model.stages(1).asInstanceOf[KMeansModel].clusterCenters.foreach { println }

[4.472298409215579,108.21832144816237,109.28304991771806,105.66812945693911]
[4.512367491166078,94.31566548881035,81.90106007067138,105.78563015312132]
[4.4568835098335855,86.8320726172466,98.22087745839637,77.47957639939486]


This cell will not be included in the Spark app

In [10]:
//NOT-FOR-APP
model.stages

Array(vecAssembler_acf9d1dd4a58, kmeans_279eb22adacb)