# Clustering Evaluation

In [10]:
import warnings
warnings.filterwarnings('ignore')

First, get the data and convert categorical variables to numerical variables.

In [1]:
import pandas

iris = pandas.read_csv('../Datasets/iris.csv')
iris = pandas.get_dummies(iris)
iris.sample(3)

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species_setosa,Species_versicolor,Species_virginica
92,5.8,2.6,4.0,1.2,0,1,0
148,6.2,3.4,5.4,2.3,0,0,1
115,6.4,3.2,5.3,2.3,0,0,1


Second, standardize the data.  (This might not be necessary.  In practice, we might want to compare 3 cases: unprocessed data, normalized, and standardized data).

In [8]:
from sklearn.preprocessing import StandardScaler

data = StandardScaler().fit_transform(iris)
data[0]

array([-0.90068117,  1.01900435, -1.34022653, -1.3154443 ,  1.41421356,
       -0.70710678, -0.70710678])

Third, we might consider transforming the data using PCA.  For now, we'll just cluster them.

In [18]:
from sklearn.cluster import KMeans

kmean = KMeans(n_clusters=6)
kmean.fit(data)

### Evaluating Clustering Performance

Often, we don't have a "ground truth" (e.g. species) to evaluate the result of a clustering performance.

Two popular methods of evaluating clustering performance:
* inertia
* silhouette score

The silhouette score of a clustering is between -1 (bad) and 1 (good). This is a combined score of how tight the clusters are and how far they are apart from each other.

In contrast, inertia only measures how tight the clusters are.

First, we need to cluster the data.

In [13]:
from sklearn.metrics import silhouette_score

kmean.fit(data)


In [14]:
print(kmean.inertia_)

85.28257238186514


In [16]:
from sklearn.metrics import silhouette_score 

In [17]:
print(silhouette_score(data, kmean.labels_))

0.39853132283140574


#### Silhouette coefficient

$s(x) = {b(x) - a(x) \over b(x)}$.  (Assuming that a(x) < b(x))

* a(x) - the average distance between x and the other points in the same cluster as x.
* b(x) - the smallest average distance between x and points in any other cluster.
* Given a typical dataset and a decent algorithm, a(x) < b(x).


Observations:
* If the cluster containing x is tight, a(x) is small.
* If the cluster containing x is far from the other clusters, b(x) is large.
* If a clustering algorithm does a good job, s(x) is close to 1.

#### Silhouette score

The silhouette score is the average silhouette coefficients of all data points.

In [25]:
#### Exercise: compare the inertia and silhouette scores for various number of clusters

import pandas
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import seaborn

# 1. Get the data
iris = pandas.read_csv('../Datasets/iris.csv')
iris = pandas.get_dummies(iris)
data = StandardScaler().fit_transform(iris)

# 2. Show the inertia and silhouette score with n_clusters from 2 to 40.
    
# 3. Visualize the inertias and estimate the best value of n_clusters (i.e. identify the elbow)
inertias = []

# seaborn.relplot(x=range(2,40), y=inertias)

# 4. Visualize the silhouette scores and estimate the best value of n_clusters (i.e. identify the elbow)
scores = []

# seaborn.relplot(x=range(2,40), y=scores)

447.361704965023 0.560966798611912
166.5385159539896 0.6574023399250883
136.90766779486216 0.5707728066997068
111.1794098707951 0.46770952009617267
85.52894536833564 0.390927305096967
76.14019868247635 0.3928383513184
65.81538053691301 0.3882397524062864
57.67104648196529 0.4073348278800671
50.2547000619101 0.38123470997716946
46.3518906880454 0.38866440513301637
43.756209399427576 0.36344244140458426
40.477494647181516 0.37028310773206036
37.89012717453981 0.3523906593031642
34.78878506395903 0.3696123919934227
33.25417433959703 0.3586522329066216
31.445488204667676 0.34322354117848464
29.64925414755887 0.333795922814855
27.186395725690925 0.3358492033981683
26.425581083513585 0.3461044039808576
25.243807976856274 0.3191030180126243
23.169971865671066 0.32793982688810686
22.778003201594395 0.2853266914798472
21.54074506598006 0.29457786891532406
20.306878657946136 0.3218623763721918
19.85957362543118 0.3290551500707977
19.009376061008936 0.3009859769443441
18.637746910100606 0.3005105