# K Means Clustering

Look at seeds: https://archive.ics.uci.edu/ml/datasets/seeds.

Attribute Information:

To construct the data, seven geometric parameters of wheat kernels were measured: 
1. area A, 
2. perimeter P, 
3. compactness C = 4*pi*A/P^2, 
4. length of kernel, 
5. width of kernel, 
6. asymmetry coefficient 
7. length of kernel groove. 

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cluster').getOrCreate()

In [2]:
from pyspark.ml.clustering import KMeans

# Loads data.
dataset = spark.read.csv("seeds_dataset.csv",header=True,inferSchema=True)

In [3]:
dataset.head()

Row(area=15.26, perimeter=14.84, compactness=0.871, length_of_kernel=5.763, width_of_kernel=3.312, asymmetry_coefficient=2.221, length_of_groove=5.22)

In [4]:
dataset.describe().show()

+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+
|summary|              area|         perimeter|         compactness|   length_of_kernel|   width_of_kernel|asymmetry_coefficient|   length_of_groove|
+-------+------------------+------------------+--------------------+-------------------+------------------+---------------------+-------------------+
|  count|               210|               210|                 210|                210|               210|                  210|                210|
|   mean|14.847523809523816|14.559285714285718|  0.8709985714285714|  5.628533333333335| 3.258604761904762|   3.7001999999999997|  5.408071428571429|
| stddev|2.9096994306873647|1.3059587265640225|0.023629416583846364|0.44306347772644983|0.3777144449065867|   1.5035589702547392|0.49148049910240543|
|    min|             10.59|             12.41|              0.8081|              4.899|            

## Format the Data

In [6]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [7]:
dataset.columns

['area',
 'perimeter',
 'compactness',
 'length_of_kernel',
 'width_of_kernel',
 'asymmetry_coefficient',
 'length_of_groove']

In [8]:
vec_assembler = VectorAssembler(inputCols = dataset.columns, outputCol='features')

In [9]:
final_data = vec_assembler.transform(dataset)

In [10]:
final_data.show()

+-----+---------+-----------+------------------+------------------+---------------------+------------------+--------------------+
| area|perimeter|compactness|  length_of_kernel|   width_of_kernel|asymmetry_coefficient|  length_of_groove|            features|
+-----+---------+-----------+------------------+------------------+---------------------+------------------+--------------------+
|15.26|    14.84|      0.871|             5.763|             3.312|                2.221|              5.22|[15.26,14.84,0.87...|
|14.88|    14.57|     0.8811| 5.553999999999999|             3.333|                1.018|             4.956|[14.88,14.57,0.88...|
|14.29|    14.09|      0.905|             5.291|3.3369999999999997|                2.699|             4.825|[14.29,14.09,0.90...|
|13.84|    13.94|     0.8955|             5.324|3.3789999999999996|                2.259|             4.805|[13.84,13.94,0.89...|
|16.14|    14.99|     0.9034|5.6579999999999995|             3.562|                1.355| 

## Scale the data

[the curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality)

In [11]:
from pyspark.ml.feature import StandardScaler

In [13]:
scaler = StandardScaler(inputCol = "features", outputCol = "scaledFeatures", withStd=True, withMean=False)

In [14]:
scalerModel = scaler.fit(final_data)

In [15]:
final_data = scalerModel.transform(final_data)

## Train Model

In [36]:
kmeans = KMeans(featuresCol='scaledFeatures',k=3)
model = kmeans.fit(final_data)

## Evaluate
Within Set Sum of Squared Errors.

In [37]:
wssse = model.computeCost(final_data)
print("Within Set Sum of Squared Errors = " + str(wssse))

Within Set Sum of Squared Errors = 428.60820118716356


In [38]:
# Shows the result.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
    print(center)

Cluster Centers: 
[ 4.96198582 10.97871333 37.30930808 12.44647267  8.62880781  1.80061978
 10.41913733]
[ 4.07497225 10.14410142 35.89816849 11.80812742  7.54416916  3.15410901
 10.38031464]
[ 6.35645488 12.40730852 37.41990178 13.93860446  9.7892399   2.41585013
 12.29286107]


In [39]:
model.transform(final_data).select('prediction').show()

+----------+
|prediction|
+----------+
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         2|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         0|
|         1|
+----------+
only showing top 20 rows



## Optimal K

In [34]:
for i in range(19):
    kmeans = KMeans(featuresCol='scaledFeatures',k=i+2)
    model = kmeans.fit(final_data)
    wssse = model.computeCost(final_data)
    print(str(i+2)+":Within Set Sum of Squared Errors = " + str(wssse))

2:Within Set Sum of Squared Errors = 656.7932253385325
3:Within Set Sum of Squared Errors = 428.60820118716356
4:Within Set Sum of Squared Errors = 380.89132510833224
5:Within Set Sum of Squared Errors = 330.76275833275713
6:Within Set Sum of Squared Errors = 298.3094949234943
7:Within Set Sum of Squared Errors = 261.9450937266424
8:Within Set Sum of Squared Errors = 257.0716512204947
9:Within Set Sum of Squared Errors = 244.31641522926247
10:Within Set Sum of Squared Errors = 213.24847477392632
11:Within Set Sum of Squared Errors = 189.3210964863838
12:Within Set Sum of Squared Errors = 194.6960479891054
13:Within Set Sum of Squared Errors = 173.27820078129628
14:Within Set Sum of Squared Errors = 165.55880281710859
15:Within Set Sum of Squared Errors = 174.85103100863853
16:Within Set Sum of Squared Errors = 153.99977599563772
17:Within Set Sum of Squared Errors = 148.89703933579267
18:Within Set Sum of Squared Errors = 138.39814097760518
19:Within Set Sum of Squared Errors = 134.115

We know there only 3 clusters, so this is definitely overfitting.