# Cluster analysis Kmeans

### Importing MLlib libraries 

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
import org.apache.spark.ml.clustering.KMeans

### Read data and pre-processing

The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Ronald Fisher in his 1936 paper "The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis".
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres.

In [2]:
val rawData = spark.read.format("csv").option("header","true").option("inferSchema", "true").load("data/iris_h.csv")

In [3]:
rawData.printSchema()
rawData.show()

root
 |-- SepalLength: double (nullable = true)
 |-- SepalWidth: double (nullable = true)
 |-- PetalLength: double (nullable = true)
 |-- PetalWidth: double (nullable = true)
 |-- Species: string (nullable = true)

+-----------+----------+-----------+----------+-------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|Species|
+-----------+----------+-----------+----------+-------+
|        5.1|       3.5|        1.4|       0.2| setosa|
|        4.9|       3.0|        1.4|       0.2| setosa|
|        4.7|       3.2|        1.3|       0.2| setosa|
|        4.6|       3.1|        1.5|       0.2| setosa|
|        5.0|       3.6|        1.4|       0.2| setosa|
|        5.4|       3.9|        1.7|       0.4| setosa|
|        4.6|       3.4|        1.4|       0.3| setosa|
|        5.0|       3.4|        1.5|       0.2| setosa|
|        4.4|       2.9|        1.4|       0.2| setosa|
|        4.9|       3.1|        1.5|       0.1| setosa|
|        5.4|       3.7|        1.5|       0.2| setosa|
|

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees. VectorAssembler accepts the following input column types: all numeric types, boolean type, and vector type. In each row, the values of the input columns will be concatenated into a vector in the specified order.

In [4]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val assembler = new VectorAssembler().setInputCols(Array("SepalLength","SepalWidth",
"PetalLength","PetalWidth")).setOutputCol("features")
val Data = assembler.transform(rawData)
Data.show()

+-----------+----------+-----------+----------+-------+-----------------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|Species|         features|
+-----------+----------+-----------+----------+-------+-----------------+
|        5.1|       3.5|        1.4|       0.2| setosa|[5.1,3.5,1.4,0.2]|
|        4.9|       3.0|        1.4|       0.2| setosa|[4.9,3.0,1.4,0.2]|
|        4.7|       3.2|        1.3|       0.2| setosa|[4.7,3.2,1.3,0.2]|
|        4.6|       3.1|        1.5|       0.2| setosa|[4.6,3.1,1.5,0.2]|
|        5.0|       3.6|        1.4|       0.2| setosa|[5.0,3.6,1.4,0.2]|
|        5.4|       3.9|        1.7|       0.4| setosa|[5.4,3.9,1.7,0.4]|
|        4.6|       3.4|        1.4|       0.3| setosa|[4.6,3.4,1.4,0.3]|
|        5.0|       3.4|        1.5|       0.2| setosa|[5.0,3.4,1.5,0.2]|
|        4.4|       2.9|        1.4|       0.2| setosa|[4.4,2.9,1.4,0.2]|
|        4.9|       3.1|        1.5|       0.1| setosa|[4.9,3.1,1.5,0.1]|
|        5.4|       3.7|        1.5|  

In [5]:
%AddJar -magic https://brunelvis.org/jar/spark-kernel-brunel-all-2.2.jar
import org.apache.spark.sql.DataFrame

Starting download from https://brunelvis.org/jar/spark-kernel-brunel-all-2.2.jar
Finished download of spark-kernel-brunel-all-2.2.jar


In [6]:
%%brunel data('Data') x(SepalLength) y(PetalLength) color (Species)

### Perform cluster analysis

k-means is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the k-means++ method called kmeans||

In [7]:
val kmeans = new KMeans().setK(2).setSeed(1L).setFeaturesCol("features")
val model = kmeans.fit(Data)

In [8]:
model.clusterCenters.foreach(println)

[6.30103092783505,2.8865979381443303,4.958762886597939,1.6958762886597945]
[5.005660377358491,3.369811320754718,1.560377358490566,0.29056603773584894]


### Apply cluster model

### Write result in a directory hadoop style and coalesce all in one file

In [9]:
val transformed =  model.transform(Data)
val transformed1=transformed.drop(transformed.col("features")) 
// transformed1.write.option("header", "true").csv("results/res9")
transformed1.show()
transformed1.printSchema()

+-----------+----------+-----------+----------+-------+----------+
|SepalLength|SepalWidth|PetalLength|PetalWidth|Species|prediction|
+-----------+----------+-----------+----------+-------+----------+
|        5.1|       3.5|        1.4|       0.2| setosa|         1|
|        4.9|       3.0|        1.4|       0.2| setosa|         1|
|        4.7|       3.2|        1.3|       0.2| setosa|         1|
|        4.6|       3.1|        1.5|       0.2| setosa|         1|
|        5.0|       3.6|        1.4|       0.2| setosa|         1|
|        5.4|       3.9|        1.7|       0.4| setosa|         1|
|        4.6|       3.4|        1.4|       0.3| setosa|         1|
|        5.0|       3.4|        1.5|       0.2| setosa|         1|
|        4.4|       2.9|        1.4|       0.2| setosa|         1|
|        4.9|       3.1|        1.5|       0.1| setosa|         1|
|        5.4|       3.7|        1.5|       0.2| setosa|         1|
|        4.8|       3.4|        1.6|       0.2| setosa|       

The following code uses the model to assign each observation to a cluster, counts occurrences of cluster and label pairs, and prints them.

In [10]:
 transformed1.stat.crosstab("Species", "prediction").show()

+------------------+---+---+
|Species_prediction|  0|  1|
+------------------+---+---+
|         virginica| 50|  0|
|            setosa|  0| 50|
|        versicolor| 47|  3|
+------------------+---+---+



### Choice of K

A clustering could be considered good if each data point were near to its closest centroid. So, we define a Euclidean distance function, and a function that returns the distance from a data point to its nearest cluster’s centroid. From this, it’s possible to define a function that measures the average distance to centroid, for a model built with a given k.
This is an internal quality measure.
KMeansModel.computeCost(dataset)
Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

In [11]:
import org.apache.spark.ml.{PipelineModel, Pipeline}
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.feature.{OneHotEncoder, VectorAssembler, StringIndexer, StandardScaler}
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.sql.{DataFrame, SparkSession}
import scala.util.Random

In [13]:
def clusteringScore0(data: DataFrame, k: Int): Double = {
    val assembler = new VectorAssembler().
      setInputCols(data.columns.filter(_ != "label")).
      setOutputCol("featureVector")

    val kmeans = new KMeans().
      setSeed(Random.nextLong()).
      setK(k).
      setPredictionCol("cluster").
      setFeaturesCol("featureVector")

    val pipeline = new Pipeline().setStages(Array(assembler, kmeans))

    val kmeansModel = pipeline.fit(data).stages.last.asInstanceOf[KMeansModel]
   // println(k)
   // kmeansModel.clusterCenters.foreach(println)
    kmeansModel.computeCost(assembler.transform(data))  / data.count()
}

Evaluate score values for different k from 1 to 10.

In [14]:
val numericOnly = rawData.drop("Species").cache() 
val y = (2 to 10 by 1).map(k => (k, clusteringScore0(numericOnly, k))).toDF
y.printSchema()
y.show()
numericOnly.unpersist()

root
 |-- _1: integer (nullable = false)
 |-- _2: double (nullable = false)

+---+-------------------+
| _1|                 _2|
+---+-------------------+
|  2| 1.0156530117357272|
|  3| 0.5257044388398439|
|  4|0.38152315476190646|
|  5|0.34536695652173915|
|  6|0.30395081007115815|
|  7|0.26813242661250813|
|  8|0.22860835891496767|
|  9|0.19223666567636433|
| 10|0.22640852294764366|
+---+-------------------+



[SepalLength: double, SepalWidth: double ... 2 more fields]

In [15]:
%%brunel data('y') x(_1) y(_2) line

In [16]:
y.printSchema()

root
 |-- _1: integer (nullable = false)
 |-- _2: double (nullable = false)



### Data Normalization

Since Euclidean distance is used, the clusters will be influenced strongly by the magnitudes of the variables, especially by outliers. Normalizing removes this bias. We can normalize each feature by converting it to a standard score. This means subtracting the mean of the feature’s values from each value, and dividing by the standard deviation.

StandardScaler transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and/or zero mean. It takes parameters:
* withStd: True by default. Scales the data to unit standard deviation.
* withMean: False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.



In [17]:
def clusteringScore2(data: DataFrame, k: Int): Double = {
    val assembler = new VectorAssembler().
      setInputCols(data.columns.filter(_ != "label")).
      setOutputCol("featureVector")

    val scaler = new StandardScaler()
      .setInputCol("featureVector")
      .setOutputCol("scaledFeatureVector")
      .setWithStd(true)
      .setWithMean(true)

    val kmeans = new KMeans().
      setSeed(Random.nextLong()).
      setK(k).
      setPredictionCol("cluster").
      setFeaturesCol("scaledFeatureVector").
      setMaxIter(40).
      setTol(1.0e-5)

    val pipeline = new Pipeline().setStages(Array(assembler, scaler, kmeans))
    val pipelineModel = pipeline.fit(data)
    
    val kmeansModel = pipelineModel.stages.last.asInstanceOf[KMeansModel]
   // println(k)
   // kmeansModel.clusterCenters.foreach(println)
    kmeansModel.computeCost(pipelineModel.transform(data)) / data.count()
    
}


### Choice of K with normalized data

In [18]:
val numericOnly = rawData.drop("Species").cache() 
val y= (2 to 10 by 1).map(k => (k, clusteringScore2(numericOnly, k))).toDF
y.printSchema()
y.show()
numericOnly.unpersist()


root
 |-- _1: integer (nullable = false)
 |-- _2: double (nullable = false)

+---+-------------------+
| _1|                 _2|
+---+-------------------+
|  2|  1.472528623990597|
|  3|  1.265008257383279|
|  4| 0.7582745924996086|
|  5| 0.6406954843659359|
|  6| 0.5434103175361199|
|  7| 0.4779086111437122|
|  8| 0.4203335865252961|
|  9|0.39630754446587546|
| 10|0.31247528774992267|
+---+-------------------+



[SepalLength: double, SepalWidth: double ... 2 more fields]

In [19]:
%%brunel data('y') x(_1) y(_2) line

In [20]:
y.printSchema()

root
 |-- _1: integer (nullable = false)
 |-- _2: double (nullable = false)



### Definition of Entropy measure

The are different  metrics for homogeneity. Entropy is used here for illustration. A good clustering would have clusters whose collections of labels are homogeneous and so have low entropy. A weighted average of entropy can therefore be used as a cluster score.
This is an external quality measure.

In [21]:
def entropy(counts: Iterable[Int]): Double = {
    val values = counts.filter(_ > 0)
    val n = values.map(_.toDouble).sum
    values.map { v =>
      val p = v / n
      -p * math.log(p)
    }.sum
}

### Entropy for choosing K

In [22]:
// da rivedere da qui

def fitPipeline4(data: DataFrame, k: Int): PipelineModel = {
   //     val (protoTypeEncoder, protoTypeVecCol) = oneHotPipeline("protocol_type")
   //     val (serviceEncoder, serviceVecCol) = oneHotPipeline("service")
   //     val (flagEncoder, flagVecCol) = oneHotPipeline("flag")

    // Original columns, without label / string columns, but with new vector encoded cols
  //    val assembleCols = Set(data.columns: _*) --
     //   Seq("label", "protocol_type", "service", "flag") ++
     //   Seq(protoTypeVecCol, serviceVecCol, flagVecCol)
  //    val assembler = new VectorAssembler().
     //   setInputCols(assembleCols.toArray).
     //   setOutputCol("featureVector")

   
   val assembler = new VectorAssembler().
      setInputCols(data.columns.filter(_ != "Species")).
      setOutputCol("featureVector")

   val scaler = new StandardScaler()
      .setInputCol("featureVector")
      .setOutputCol("scaledFeatureVector")
      .setWithStd(true)
      .setWithMean(false)

    val kmeans = new KMeans().
      setSeed(Random.nextLong()).
      setK(k).
      setPredictionCol("cluster").
      setFeaturesCol("scaledFeatureVector").
      setMaxIter(40).
      setTol(1.0e-5)

    val pipeline = new Pipeline().setStages(
      Array(assembler, scaler, kmeans))
    pipeline.fit(data)
}

In [23]:
  def clusteringScore4(data: DataFrame, k: Int): Double = {
    val pipelineModel = fitPipeline4(data, k)

    // Predict cluster for each datum
    val clusterLabel = pipelineModel.transform(data).
      select("cluster", "Species").as[(Int, String)]
    val weightedClusterEntropy = clusterLabel.
      // Extract collections of labels, per cluster
      groupByKey { case (cluster, _) => cluster }.
      mapGroups { case (_, clusterLabels) =>
        val labels = clusterLabels.map { case (_, label) => label }.toSeq
        // Count labels in collections
        val labelCounts = labels.groupBy(identity).values.map(_.size)
        labels.size * entropy(labelCounts)
      }.collect()

    // Average entropy weighted by cluster size
    weightedClusterEntropy.sum / data.count()
  }

 

In [24]:
// val numericOnly = rawData.drop("Species").cache() 
    val y=(2 to 10 by 1).map(k => (k, clusteringScore4(rawData, k))).toDF

    val pipelineModel = fitPipeline4(rawData, 3)
    val countByClusterLabel = pipelineModel.transform(rawData).
      select("cluster", "Species").
      groupBy("cluster", "Species").count().
      orderBy("cluster", "Species")
    countByClusterLabel.show()

y.printSchema()

+-------+----------+-----+
|cluster|   Species|count|
+-------+----------+-----+
|      0|    setosa|   17|
|      0|versicolor|    7|
|      1|versicolor|   43|
|      1| virginica|   50|
|      2|    setosa|   33|
+-------+----------+-----+

root
 |-- _1: integer (nullable = false)
 |-- _2: double (nullable = false)



In [25]:
%%brunel data('y') x(_1) y(_2) line

Checked