# Running Spark ML on DashDB sample data

Import the necessary Spark classes

In [1]:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.clustering.KMeansModel
import org.apache.spark.ml.feature.VectorAssembler

Load the data from the TRAINING sample. This table is pre-populated in dashDB local.

In [2]:
val data = sqlContext.read.format("com.ibm.idax.spark.idaxsource").
    option("url", "jdbc:db2:BLUDB").
    option("dbtable", "SAMPLES.TRAINING").
    option("mode", "JDBC").
    load()
data

[DAY_MINS: decimal(5,1), EVE_CALLS: int, INTL_MINS: decimal(4,1), NIGHT_CALLS: int, SVC_CALLS: int, VMAIL: smallint, DAY_CALLS: int, DAY_CHARGE: decimal(5,2), EVE_CHARGE: decimal(5,2), EVE_MINS: decimal(5,1), NIGHT_CHARGE: decimal(5,2), NIGHT_MINS: decimal(5,1), INTL_CALLS: int, INTL_CHARGE: decimal(4,2), AREA: int, CHURN: smallint, VMAIL_MSGS: int]

Build a Spark ML pipeline that selects the call counts from the customer data and clusters them using KMeans

In [5]:
val assembler = new VectorAssembler().
    setInputCols(Array("INTL_CALLS", "DAY_CALLS", "EVE_CALLS", "NIGHT_CALLS")).
    setOutputCol("features")

val clustering = new KMeans().
    setFeaturesCol("features").
    setK(3).
    setMaxIter(3)

val pipe = new Pipeline().
    setStages(Array(assembler, clustering))

pipe.getStages

Array(vecAssembler_9c9ee57c795b, kmeans_3dd26598f35e)

Run the pipeline to find the clusters

In [5]:
val model = pipe.fit(data)

Print out the cluster centers

In [None]:
model.stages(1).asInstanceOf[KMeansModel].clusterCenters.foreach { println }

[4.5121816168327795,108.42746400885936,105.26467331118494,108.61517165005537]
[4.378435517970402,84.23255813953489,119.11205073995772,90.81818181818183]
[4.468690702087287,94.01328273244782,82.76375711574953,89.69924098671727]
