# Evaluating cluster performance

## 1. Evaluating clusters when labels are known.

### 1.1 Adjusted Rand Index

### 1.2 Fowlkes-Mallows score

## 2. Evaluating clusters when labels are not known.

### 2.1 The silhouette coefficient

If the actual cluster labels are unknown, the model itself must be used for evaluation. When calculating the Silhouette Coefficient, a higher score means that you ended up with a model with better defined clusters. Two scores are used to generate the silhouette coefficient:

- **a** is the average distance between one data sample and all other points in the same cluster
- **b**: The average distance between one data sample and all other points in the next nearest cluster.

The silhouette coefficient is then given by:

$$ s = \dfrac{(b-a)}{max(b-a)} $$

applied to our model before, we can use 

In [66]:
from sklearn import metrics
from sklearn.metrics import pairwise_distances
labels = k_means.labels_

metrics.silhouette_score(X, labels, metric='euclidean')

0.6805359856413705

Obviously, this number isn't very informative by itself, it only really is informative when you compare it to another coefficient. Let's look at the coefficient when we would have fitted a model with only 6 clusters.

In [69]:
k_means_6 = KMeans(n_clusters = 6)
k_means_6.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=6, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

In [70]:
labels_6 = k_means_6.labels_
metrics.silhouette_score(X, labels_6, metric='euclidean')

0.661640578627616

Seems like 7 clusters generated a better result according to the silhouette coefficient!

### 2.2 Calinski-Harabaz Index

Very similarly to the silhouette coefficient, the Calinski-Harabaz index (`sklearn.metrics.calinski_harabaz_score`) can be used to evaluate the model when class labels are not known a priori. Again, a higher CH score means that the model has better defined clusters.

When you have k clusters, the score s is a ratio of the between-cluster dispersion, and the within-cluster dispersion.

$$ s(k) = \dfrac{Tr(B_k)}{Tr(W_k)}\times \dfrac{N-k}{k-1}$$

Here, the between group dispersion matrix is $B_k$ and the withing dispersion matrix is $W_k$:

$$W_k = \sum^k_{q=1} \sum_{x\in C_q} (x - c_q)(x-c_q)^T$$
$$B_k = \sum_{q} n_q (c_q - c)(c_q-c)^T$$

where 

- $N$ is the data size
- $C_q$ represents the samples in cluster $q$
- $c_q$ represents the center of cluster $q$
- $c$ the center of $E$
- $n_q$ the number of samples in cluster $q$.

In [72]:
metrics.calinski_harabaz_score(X, labels)  

4057.9994309392187

In [73]:
metrics.calinski_harabaz_score(X, labels_6)  

2696.6914224636153

Also here, the CH index is higher for the model with 7 clusters!

Sources:

https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html

http://scikit-learn.org/stable/modules/clustering.html