# Evaluation of clustering 

## Purity
Purity is a simple and transparent evaluation measure. Purity will measure how accurately the points are clustered.

$$ Purity(\Omega) = \frac {1}{m} \sum_j {max_i{(\omega_{ij})}} $$

where $\Omega = \{ \omega_1, \omega_2, \ldots, \omega_j \}$is the set of clusters and $\mathbb{C} = \{ c_1,c_2,\ldots,c_J \}$ is the set of classes, and m is the total number of points.

High purity is easy to achieve when the number of clusters is large - in particular, purity is 1 if each document gets its own cluster. Thus, we cannot use purity to trade off the quality of the clustering against the number of clusters. 

## entropy

###  Cluster Entropy

This parameter checks the homogeneity of clusters. The entropy of cluster j is defined by

$$ H(\omega_j) = -\sum_i(\frac {c(i,j)}{\sum_i c(i,j)}).log(\frac {c(i,j)}{\sum_i c(i,j)})$$

Where c(i, j) is the number of times label i occurs in cluster j.
    
The entropy for a cluster is zero if the labels of all the documents are the same, otherwise it is positive.

 **The total cluster entropy** is the weighted average of the individual cluster entropies:

$$ H(\Omega) = \frac {1}{m} \sum_j {e_j} . ({\sum_i c(i,j)}) $$

Where ${\sum_i c(i,j)}$ is the number of documents in cluster j.

As a consequence, the lower the entropy the better the quality.

### Class Entropy

Class entropy evaluates how cluster fragmentation rate. Optimal value of class entropy is zero.
If a single topic is present in multiple clusters, it increases the value of class entropy.
$$ e_i = -\sum_j{(\frac {c(i,j)}{\sum_j c(i,j)})}.log(\frac {c(i,j)}{\sum_j c(i,j)})$$
 **The total class entropy** is the weighted average of the individual cluster entropies:

$$ e_{total} = \frac {1}{m} \sum_i {e_i} . ({\sum_j c(i,j)})$$

In [None]:
def purity(Clusters):
    m = 0
    p = 0
    for Cluster in Clusters:
        m += sum(Cluster)
        p += max(Cluster)
    P = p / m        
    return P

In [1]:
from math import log
def entropy(Cluster):
    N_c = sum(Cluster)
    e = 0
    for Point in Cluster:
        if Point > 0:
            e += (Point / N_c) * log(Point / N_c,2)
    return -e

In [2]:
def entropy_total(Clusters):
    m = 0
    e = 0
    e_total = 0
    for Cluster in Clusters:
        N_c = sum(Cluster)
        m += N_c
        e += entropy(Cluster) * (N_c)
    e_total = e / m
    return e_total

# Example Cluster


!["Fig 1"](img\ClusterExample.jpg)

*Fig. 1: Cluster Example*

class:
- class 1: X
- class 2: O
- class 3: D



In [None]:
#C = {X,O,D}
C = [[5,1,0],
     [1,4,1],
     [2,0,3]]

## Calculation of the Purity

$Purity = (\frac{1}{17}) \times (5+4+3) = 0.70588 $

In [None]:
P = purity(C)
print(P)

## Calculation of the Cluster entropy 

$ H(\omega_1) = (\frac{5}{6}) \times \log_2(\frac {5}{6}) + 
                (\frac{1}{6}) \times \log_2(\frac {1}{6}) + 
                (\frac{0}{6}) \times \log_2(\frac{0}{6}) = 0.650 $

$ H(\omega_2) = (\frac{1}{6}) \times \log_2(\frac {1}{6}) + 
                (\frac{4}{6}) \times \log_2(\frac {4}{6}) + 
                (\frac{1}{6}) \times \log_2(\frac {1}{6}) = 1.252 $

$ H(\omega_3) = (\frac{2}{5}) \times \log_2(\frac {2}{5}) +
                (\frac{0}{5}) \times \log_2(\frac {0}{5}) +
                (\frac{3}{5}) \times \log_2(\frac {3}{5}) = 0.971 $

So then to find the total entropy for a set of clusters, you take the sum of the entropies times the relative weight of each cluster.

$ H(\Omega) = (0.650 \times \frac {6}{17} + 1.252 \times \frac {6}{17} + 0.971 \times \frac {5}{17}) $

$ H(\Omega) = 0.956 $


In [None]:
H = entropy_total(C)
print(H)

## Calculation of the Class entropy

$C = [[5,1,0],
     [1,4,1],
     [2,0,3]]$

$C_T = [[5,1,2],
       [1,4,0],
       [0,1,3]]$

$ e(c_1) = (\frac{5}{8}) \times \log_2(\frac {5}{8}) + 
           (\frac{1}{8}) \times \log_2(\frac {1}{8}) + 
           (\frac{2}{8}) \times \log_2(\frac {2}{8}) = 1.298 $

$ e(c_2) = (\frac{1}{5}) \times \log_2(\frac {1}{5}) + 
           (\frac{4}{5}) \times \log_2(\frac {4}{5}) + 
           (\frac{0}{5}) \times \log_2(\frac {0}{5}) = 0.721 $

$ e(c_3) = (\frac{0}{4}) \times \log_2(\frac {0}{4}) +
           (\frac{1}{4}) \times \log_2(\frac {1}{4}) +
           (\frac{3}{5}) \times \log_2(\frac {3}{4}) = 0.811 $

So then to find the total entropy for a set of classes, you take the sum of the entropies times the relative weight of each class.

$ e_{total}(\mathbb{C}) = (1.298 \times \frac {8}{17} + 0.721 \times \frac {5}{17} + 0.811 \times \frac {4}{17}) $

$ e_{total}(\mathbb{C}) = 1.011 $


In [None]:
import numpy as np

C = np.array(C)
C_T = C.transpose()

H = entropy_total(C_T)
print(H)
