# Clustering quality metrics

We can divide the clustering quality metrics into two groups: internal and external indicies. Internal indicies base only on the the clustered data without a clustered version that we could benchmark with. In other words, we don't have labelled/clustered data to compare with. In external indicies we have the information about the expected output and we are able to calculate metrics like accuracy, f1-score and other known from supervised learning.

In this section we focus on internal indicies only. We can group internal indicies into two groups:
- heterogeneity,
- homogeneity.
Four heterogeneity and two homogeneity methods are presented in this notebook. 

## Libraries

Clustering quality metrics are implemented using only two libraries: ``numpy`` and ``math``, where the second one is used calculate the Euclidean distance.

In [1]:
import numpy as np
from math import sqrt

def calculate_distance(x,v):
    return sqrt((x[0]-v[0])**2+(x[1]-v[1])**2)

## Data set

We compare two data sets, both are a results of HCM method. The first one is a HCM that divided the data set into two groups. We just restore the variable ``new_assignation_hcm`` and ``new_centers_hcm``. We have set two other variable of possible clustering for three groups and saved the results in ``new_assignation_hcm3`` and ``new_centers_hcm3``.

In [2]:
%store -r new_assignation_hcm
%store -r new_centers_hcm
%store -r data_set

new_assignation_hcm = np.array(new_assignation_hcm)
new_centers_hcm = np.array(new_centers_hcm)

new_assignation_hcm3 = np.array([[0., 1., 0.],[0., 1., 0.], [0., 1., 0.],[0., 1., 0.], [0., 1., 0.], [0., 1., 0.], [0., 0., 1.], [0., 0., 1.], [0., 0., 1.], [1., 0., 0.]])
new_centers_hcm3 = np.array([[0.42239686, 0.38503185],[0.07858546, 0.17832272],[0.82907662, 0.97059448]])

## Homogeneity and heterogeneity

We have four separation measures $s_{1}(c_{i},c_{j})$, $s_{2}(c_{i},c_{j})$,
$s(s_{1})$ and $s(s_{2})$. The first two separation measures explain how far from each other the clusters are. We take the objects in each cluster and measure the distances between each object of two different clusters. 

The metric $s_{1}$ can be calculate as following:
\begin{equation}
 s_{1}(c_{i},c_{j})=\frac{1}{n_{i}n_{j}}\sqrt{\sum_{x_{1},\in c_{i},x_{2}\in c_{j}}d^{2}(x_{1},x_{2})}.
\end{equation}
This measure can be implemented as below.

In [3]:
def calculate_s_1(centers,assignation):
    s1 = []
    for center_1 in range(len(centers)):
        for center_2 in range(len(centers)):
            if center_1 == center_2:
                break
            ids_1 = np.where(assignation[:, center_1] == 1)[0]
            ids_2 = np.where(assignation[:, center_2] == 1)[0]
            elements_1 = data_set[ids_1]
            elements_2 = data_set[ids_2]
            s_1 = 1.0 / (len(ids_1) * len(ids_2))
            for element_1 in elements_1:
                for element_2 in elements_2:
                    s_1 = s_1 * sqrt(calculate_distance(element_1, element_2) ** 2)
            s1.append(s_1)
    return s1

We take two objects, each from different cluster and calculate the power distance measure. Next, we sum all the distances from object of two clusters and calculate a square root of it. The value is then divided by the multiplication of the counts of objects in both clusters. The higher value is better.

In [4]:
s1_2 = calculate_s_1(new_centers_hcm,new_assignation_hcm)
s1_3 = calculate_s_1(new_centers_hcm3,new_assignation_hcm3)

print(s1_2)
print(s1_3)

[0.10062820443387574]
[0.0007587103186922478, 0.12913857784026206, 0.3030325350305444]


The best clusters are the first compared to the second in the thre group clustering. 

The second separation measure is about the distance between two centers only:
\begin{equation}
 s_{2}(c_{i},c_{j})=d(c_{i},c_{j}).
\end{equation}
It is the simplest measure and it can be implemented as below.

In [5]:
def calculate_s_2(centers):
    s2 = []
    for center_1 in range(len(centers)):
        for center_2 in range(len(centers)):
            if center_1 == center_2:
                break
            s2.append(calculate_distance(centers[center_1], centers[center_2]))
    return s2

The same as the previous metric, higher values means a better cluster.

In [6]:
s2_2 = calculate_s_2(new_centers_hcm)
s2_3 = calculate_s_2(new_centers_hcm3)

print(s2_2)
print(s2_3)

[1.0361961303876361]
[0.40116697670087065, 0.7129319889345509, 1.0912980907761376]


A different metric is the $\sigma_{1}$ which does not take the centers into consideration. It measures the distances between each object within a cluster. The smaller value here the better cluster it is. The metric is defined as:

\begin{equation}
 \sigma_{1}(c_{i})=\frac{1}{m}\sum_{x_{1}, x_{2}\in c_{i}}d^{2}(x_{1},x_{2}),
\end{equation}

where the $m$ is defined as 

\begin{equation}
 m=\frac{(n_{i}-1)n_{i}}{2}.
\end{equation}

In [7]:
def calculate_sigma_1(assignation):
    sigma_1 = []
    unique_labels = len(assignation[0])
    for label_id in range(unique_labels):
        ids = np.where(assignation[:, label_id] == 1)[0]
        if len(ids) == 1:
            sigma_1.append(1.0)
            continue
        else:
            m = (len(ids) - 1.0) * len(ids) / 2.0
        elements = data_set[ids]
        sigma = (1.0 / m)
        for element_x_1 in range(len(elements)):
            for element_x_2 in range(len(elements)):
                if element_x_1 == element_x_2:
                    continue
                distance = calculate_distance(elements[element_x_1], elements[element_x_2])
                if distance != 0:
                    sigma = sigma + (distance ** 2)
        sigma_1.append(sigma)
    return sigma_1

The results are given for each cluster. That's why we have only two values in the first cases, and three in the second list. Lower value means a better cluster.

In [8]:
sigma1_2 = calculate_sigma_1(new_assignation_hcm)
sigma1_3 = calculate_sigma_1(new_assignation_hcm3)

print(sigma1_2)
print(sigma1_3)

[2.775635895144805, 0.9686234538734891]
[1.0, 0.7496360155982489, 0.9686234538734891]


In the metric $s(s_{1})$ we take two previously calculated values:
\begin{equation}
 s(s_{1})=\sum_{i,j=1;j\neq i}^{K}\frac{s_{1}(c_{i},c_{j})}{\sigma_{1}(c_{i})}.
\end{equation}
This measure is a bit more complex and it measure the relation between the distances of objects within a cluster and distnaces between objects in two clusters. Higher values means a better cluster, but it takes the relation between two clusters as a result.

In [9]:
def calculate_s_s_1(s1, sigma_1):
    s_1_sum = 0.0
    sigma_1_sum = 0.0
    for s_1 in s1:
        s_1_sum = s_1_sum + s_1
    for sigma_1 in sigma_1:
        sigma_1_sum = sigma_1_sum + sigma_1
    s_s1 = s_1_sum / sigma_1_sum
    return s_s1

The results below makes it clear that three clusters solution is better than two. Usually it is that increasing the number of clusters increase the value of this metric.

In [10]:
s11_2 = calculate_s_s_1(s1_2,sigma1_2)
s11_3 = calculate_s_s_1(s1_3,sigma1_3)

print(s11_2)
print(s11_3)

0.026875329685765333
0.1592672914604556


The metric $\sigma_{2}$ measures the distances between objects within a cluster and the center of this cluster. It is defined as:
\begin{equation}
 \sigma_{2}(c_{i})=\frac{1}{n_{i}}\sum_{x\in c_{i}}d^{2}(x,c_{i}).
\end{equation}
It can be implemented as:

In [11]:
def calculate_sigma_2(centers,assignation):
    sigma_2 = []
    for center_id in range(len(centers)):
        ids = np.where(assignation[:, center_id] == 1)[0]
        elements = data_set[ids]
        sigma = 1.0 / len(ids)
        for element_id in range(len(elements)):
            distance = calculate_distance(elements[element_id], centers[center_id])
            if distance != 0:
                sigma = sigma + (distance) ** 2
        sigma_2.append(sigma)
    return sigma_2

Lower value means a better cluster. We get a list of values for each cluster. In our case the second cluster in the three cluster approach is the best one.

In [12]:
sigma2_2 = calculate_sigma_2(new_centers_hcm,new_assignation_hcm)
sigma2_3 = calculate_sigma_2(new_centers_hcm3,new_assignation_hcm3)

print(sigma2_2)
print(sigma2_3)

[0.3377154891089826, 0.4392150200900259]
[1.0, 0.22358077907763185, 0.4392150200900259]


## Other internal indices

The Dunn index says what is the relation between the minimal distance of objects within a cluster and the maximum distance of objects within a cluster. It gives a better understanding of how far are the objects from each other. It can be easily calculated as a quotient of two distances:
\begin{equation}
 C=\frac{d_{min}}{d_{max}},
\end{equation}
where the equations of $d_{max}$ and $d_{min}$ are like following:
\begin{equation}
 d_{max}=\max_{1\leq k\leq K}D_{k},\\
\end{equation}
and
\begin{equation}
 d_{min}=\min_{k\neq k'} d_{k}.
\end{equation}

Both distances are just the minimum and maximum euclidean distances between objects. The minimum distance is a measure of two object that are in different clusters:
\begin{equation}
 d_{k}=\min_{i,j\in I_{k}; i\neq j}d(x_{i}^{(k)}-x_{j}^{(k')}).
\end{equation}
The clusters are marked with $k$ and $k'$. The maximum distance takes the distance of two objects within a cluster:
\begin{equation}
 D_{k}=\max_{i,j\in I_{k}; i\neq j}d(x_{i}^{(k)}-x_{j}^{(k)}).
\end{equation}
It can be implemeneted as:

In [13]:
def dunn_index(assignation):
    minimum_distance = 1
    maximum_distance = 0
    unique_labels = np.unique(assignation[0])
    for label_id_1 in range(len(unique_labels)):
        ids_1 = np.where(assignation[:, label_id_1] == 1)[0]
        for label_id_2 in range(len(unique_labels)):
            if label_id_1 == label_id_2:
                break
            ids_2 = np.where(assignation[:, label_id_2] == 1)[0]
            for element_1 in data_set[ids_1]:
                for element_2 in data_set[ids_2]:
                    distance = calculate_distance(element_1, element_2)
                    if distance > maximum_distance:
                        maximum_distance = distance
                    if distance < minimum_distance:
                        minimum_distance = distance
    dunn_index = minimum_distance / maximum_distance
    return dunn_index

The values of Dunn index for our examples are given below. In this case a higher values means a better clustering result.

In [14]:
dunn2 = dunn_index(new_assignation_hcm)
dunn3 = dunn_index(new_assignation_hcm3)

print(dunn2)
print(dunn3)

0.48534159949891914
0.7224828783672833
