<img src=http://fd.perso.eisti.fr/Logos/TORUS2.png>

To illustrate clustering algorithm, we can use the traditional example : the Iris example with K-Means!

"K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster."

(source : https://en.wikipedia.org/wiki/K-means_clustering)

### Read dataset (csv format) from HDFS

Here we use the dataset from https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

The target variable will be class (there are 3 classes : Iris Setosa, Iris Versicolour, Iris Virginica) and the variables descriptives are : 
- sepal length
- sepal width
- petal length
- petal width

In [ ]:
val sqlContext = new SQLContext(sc)

val data = sqlContext.read.format("com.databricks.spark.csv")
              .option("header", "true").option("inferSchema", "true") 
              .load("hdfs://hupi-factory-02-01-01-01/user/hupi/dataset_torusVN/formation4_ML/iris.csv")

       val sqlContext = new SQLContext(sc)
                        ^
sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@e2d547f
data: org.apache.spark.sql.DataFrame = [sepalLength: double, sepalWidth: double ... 3 more fields]


In [ ]:
data.show()

+-----------+----------+-----------+----------+-----------+
|sepalLength|sepalWidth|petalLength|petalWidth|      class|
+-----------+----------+-----------+----------+-----------+
|        5.1|       3.5|        1.4|       0.2|Iris-setosa|
|        4.9|       3.0|        1.4|       0.2|Iris-setosa|
|        4.7|       3.2|        1.3|       0.2|Iris-setosa|
|        4.6|       3.1|        1.5|       0.2|Iris-setosa|
|        5.0|       3.6|        1.4|       0.2|Iris-setosa|
|        5.4|       3.9|        1.7|       0.4|Iris-setosa|
|        4.6|       3.4|        1.4|       0.3|Iris-setosa|
|        5.0|       3.4|        1.5|       0.2|Iris-setosa|
|        4.4|       2.9|        1.4|       0.2|Iris-setosa|
|        4.9|       3.1|        1.5|       0.1|Iris-setosa|
|        5.4|       3.7|        1.5|       0.2|Iris-setosa|
|        4.8|       3.4|        1.6|       0.2|Iris-setosa|
|        4.8|       3.0|        1.4|       0.1|Iris-setosa|
|        4.3|       3.0|        1.1|    

###  Vector Assembler

To prepare for the construction of K-Means model by using ML library, we have to have a data with 1 column only ("features"). To have that, we need to put all the variables descriptives into a single vector column named "features".

In [ ]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors


In [ ]:
val assembler = new VectorAssembler()
  .setInputCols(Array("sepalLength", "sepalWidth", "petalLength", "petalWidth"))
  .setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_4901a2c07027


In [ ]:
val data_withGoodColumns = assembler.transform(data).select("features")

data_withGoodColumns: org.apache.spark.sql.DataFrame = [features: vector]


In [ ]:
data_withGoodColumns.take(5)

res14: Array[org.apache.spark.sql.Row] = Array([[5.1,3.5,1.4,0.2]], [[4.9,3.0,1.4,0.2]], [[4.7,3.2,1.3,0.2]], [[4.6,3.1,1.5,0.2]], [[5.0,3.6,1.4,0.2]])


### Build a K-Means model 

In this example, we choose number of clusters = 3 because we know beforehand that there are 3 classes of Iris. For other cases that we don't know how many classes there are, we have to find the optimal K. One method that can helps us to find K is Elbow method (https://en.wikipedia.org/wiki/Elbow_method_(clustering))

In [ ]:
import org.apache.spark.ml.clustering.KMeans

// Trains a k-means model.
val kmeans = new KMeans().setK(3).setSeed(1L)
val model = kmeans.fit(data_withGoodColumns)

import org.apache.spark.ml.clustering.KMeans
kmeans: org.apache.spark.ml.clustering.KMeans = kmeans_b71a1268a783
model: org.apache.spark.ml.clustering.KMeansModel = kmeans_b71a1268a783


### Evaluation of model 

In [ ]:
// Evaluate clustering by computing Within Set Sum of Squared Errors.
val WSSSE = model.computeCost(data_withGoodColumns)
println(s"Within Set Sum of Squared Errors = $WSSSE")

Within Set Sum of Squared Errors = 78.94506582597637
WSSSE: Double = 78.94506582597637


In [ ]:
// Shows the result.
println("Cluster Centers: ")
model.clusterCenters.foreach(println)

Cluster Centers: 
[5.88360655737705,2.7409836065573776,4.388524590163936,1.4344262295081969]
[6.853846153846153,3.0769230769230766,5.715384615384615,2.053846153846153]
[5.005999999999999,3.4180000000000006,1.4640000000000002,0.2439999999999999]
