# Clustering
In this series, we intend to provide implementations of the methods and algorithms described by [Rokach and Maimon 2005, ch 15](http://www.springer.com/us/book/9780387254654). First we design functions to calculate these metrics for understanding purposes. We show their SciPy counterparts by the end of this notebook.

## Distance measure (dissimilarity)

This notebook intends to provide functions to calculate distance (dissimilarity) and similarity between p-dimensional instances.

---

First we declare some toy instances in order to test our methods

In [1]:
import numpy as np
from numpy import linalg as LA
# numeric attributes
num1 = np.array([0.12,0.34,0.56,0.78,0.90])
num2 = np.array([0.09,0.78,0.65,0.43,0.21])
# binary attributes
bin1 = [True,False,False,True,True]
bin2 = [True,False,True,True,False]
# nominal attributes
nom1 = ["green","blue","green","red","blue"]
nom2 = ["red","blue","red","red","green"]
# mixed type attributes
mix1 = ["green",True,False,0.53,"red"]
mix2 = ["blue",True,True,0.64,"red"]

### Minkowski: Distance Measures for Numeric Attributes
For the numeric attributes, we consider the Minkowski distance
$$d(x_i,x_j)=\left(\sum_{n=1}^p \mid x_{1,n}-x_{2,n} \mid ^q \right)^{1/q}$$

In [2]:
def mink(x1,x2,g):
    if len(x1)!=len(x2): return None
    res=np.sum(abs(x1-x2)**g)**(1/g)
    return res

Which results in:

In [3]:
print("manhattan distance = ",mink(num1,num2,1))
print("euclidean distance = ", mink(num1,num2,2))
print("Chebychev distance ~= ", mink(num1,num2,1000))

manhattan distance =  1.6
euclidean distance =  0.895097760024
Chesbychev distance ~=  0.69


### Distance Metrics for Ordinal Attributes
For ordinal atributes, it suffices to normalize the values using:
$$z_{i,j}=\frac{r_{i,n}-1}{M_n-1}$$
where $z_{i,n}$ is the standardized value of the attribute, $r_{i,n}$ is this value before standardization and $M_n$ is the upper limit of the domain of attribute $a_n$, assuming the lower limit is $1$.
The one should proceed using the minkowsky distance.  
However, I propose one correction to this formula. Since ordinal values are usually presented in decreasing order of positioning (e.g. rankings, queues etc.) the lower the value, the higher its position. Therefore, the value should be normalized to the $[0,1]$ domain using:
$$z_{i,j}=1-\frac{r_{i,n}-1}{M_n-1}$$

### Distance Measures for Binary Attributes
For the binary attributes we consider the matching coefficient.
First, we generate the contingency table  

In [4]:
def contTable(x1,x2):
    if len(x1)!=len(x2): return None
    q=r=s=t=0
    for idx in range(len(x1)):
        if x1[idx] and x2[idx]:
            q+=1
        elif x1[idx]:
            r+=1
        elif x2[idx]:
            s+=1
        else:
            t+=1
    res = [q,r,s,t]
    return res

Then we can calculate the matching coefficient as
$$d(x_i,x_j)=\frac{r+s}{q+r+s+t}$$

In [5]:
def matCoef(x1,x2):
    [q,r,s,t]=contTable(bin1,bin2)
    res=(r+s)/(q+r+s+t)
    return res

Which results in:

In [6]:
print("Matching coefficient = ", matCoef(bin1,bin2))

Matching coefficient =  0.4


### Distance Measures for Nominal Attributes
For the nominal attributes we use simple matching
$$d(x_i,x_j)=\frac{p-m}{p}$$

In [7]:
def simMatch(x1,x2):
    if len(x1)!=len(x2): return None
    m=0
    p=len(x1)
    for idx in range(p):
        if x1[idx]==x2[idx]:
            m+=1
    res=(p-m)/p
    return res

Which results in:

In [8]:
print("Simple Match = ",simMatch(nom1,nom2))

Simple Match =  0.6


### Distance Metrics for Mixed-Type Attributes
Finally, for mixed type attributes we use the mixed type distance
$$d(x_i,x_j)=\frac{\sum\limits_{n=1}^p \delta_{ij}^{(n)} d_{ij}^{(n)}}{\sum\limits_{n=1}^p \delta_{ij}^{(n)}}$$
where $\delta_{ij}^{(n)}=0$ if one attribute is missing and $1$ otherwise. The distance $d_{ij}^{(n)}$ between the $n^{th}$ atribute of the instances depends on the type of attribute.
* If categorical or binary, $d_{ij}^{(n)}=0$ if both values match and $0$ otherwise
* If numeric, the distance $d_{ij}^{(n)}$ is the absolute difference between both values


In [9]:
def mixDis(x1,x2):
    if len(x1)!=len(x2): return None
    dij = np.zeros(len(x1))
    delij = np.ones(len(x1))
    for idx in range(len(dij)):
        if type(x1[idx]) != type(x2[idx]): return None
        if (type(x1[idx]) is str) or (type(x1[idx]) is bool):
            dij[idx] = int(x1[idx] == x2[idx])
        elif (type(x1[idx]) is int) or (type(x1[idx]) is float):
            dij[idx] = abs(x1[idx]-x2[idx])
        else:
            delij[idx] = 0
    res = delij.dot(dij)/np.sum(delij)
    return res

Which results in:

In [10]:
print("Mixed Distance = ", mixDis(mix1,mix2))

Mixed Distance =  0.422


## Similarity functions
As opposed to distance metrics, similarity functions should provide higher values if $x_1$ and $x_2$ are "similar".

### Cosine Measure
This metric evaluates the similarity as a function of the angle of vectors as opposed to their distance.
$$s(x_i,x_j)=\frac{x_i^T \cdot x_j}{\|x_i\| \cdot \|x_j\|}$$

In [11]:
def cosMeas(x1,x2):
    if len(x1)!=len(x2): return None
    res = x1.dot(x2)/(LA.norm(x1)*LA.norm(x2))
    return res

Which results in:

In [12]:
print("Cosine measure = ",cosMeas(num1,num2))

Cosine measure =  0.757796738324


### Pearson correlation measure
This metric uses the normalized Pearson correlation
$$s(x_i,x_j)=\frac{(x_i-\bar x)^T \cdot (x_j-\bar x)}{\|x_i-\bar x\| \cdot \|x_j-\bar x\|}$$

In [13]:
def pearCor(x1,x2):
    res = np.dot((x1-x1.mean()),(x2-x2.mean()))/(LA.norm(x1-x1.mean())*LA.norm(x2-x2.mean()))
    return res

Which results in:

In [14]:
print("Pearson correlation = ", pearCor(num1,num2))

Pearson correlation =  -0.00543744287826


### Extended Jaccard Measure
The [Jaccard index](https://en.wikipedia.org/wiki/Jaccard_index) is a comparison of the similarity in the content of two sets, expressed as the intersection divided by the union.
The extended Jaccard measure is given by:
$$s(x_i,x_j)=\frac{x_i^T \cdot x_j}{\|x_i\|^2 + \|x_j\|^2 - x_i^T \cdot x_j}$$

In [15]:
def extJac(x1,x2):
    res = x1.dot(x2)/(LA.norm(x1)**2 + LA.norm(x2)**2 - x1.dot(x2))
    return res

Which gives us:

In [16]:
print("Extended Jaccard = ", extJac(num1,num2))

Extended Jaccard =  0.592389092389


### Dice Coefficient
The [Dice coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient) is similar to the Jaccard measure, except that it is semimetric because it doesn't satisfy the triangle inequalty.
$$s(x_i,x_j)=\frac{2 x_i^T \cdot x_j}{\|x_i\|^2 + \|x_j\|^2}$$

In [17]:
def dicCoe(x1,x2):
    res = 2*x1.dot(x2)/(LA.norm(x1)**2 + LA.norm(x2)**2)
    return res

Which gives us:

In [18]:
print("Dice coefficient = ", dicCoe(num1,num2))

Dice coefficient =  0.744025559105


## Distance and Similarity using SciPy
Most of the above distances are already implemented in [SciPy](https://docs.scipy.org/doc/scipy/reference/spatial.distance.html) and we provide a list of them below.

In [37]:
from scipy.spatial import distance as spd
print("manhattan distance = ",spd.cityblock(num1,num2))
print("euclidean distance = ", spd.euclidean(num1,num2))
print("Chebychev distance = ", spd.chebyshev(num1,num2))
print("Minkowsky distance = ", spd.minkowski(num1,num2,3))
print("Hamming distance (matCoef) = ", spd.hamming(bin1,bin2))
print("Simple Match = ",spd.hamming(nom1,nom2))
print("Mixed Distance = ","NA")
print("Cosine measure = ",1 - spd.cosine(num1,num2))
print("Pearson correlation = ", 1 - spd.correlation(num1,num2))
print("Jaccard (bool only) = ", spd.jaccard(bin1,bin2))
print("Dice coefficient (bool only)= ", spd.dice(bin1,bin2))

manhattan distance =  1.6
euclidean distance =  0.89509776002401
Chebychev distance =  0.69
Minkowsky distance =  0.770444450219
Hamming distance (matCoef) =  0.4
Simple Match =  0.6
Mixed Distance =  NA
Cosine measure =  0.757796738324
Pearson correlation =  -0.00543744287826
Jaccard (bool only) =  0.5
Dice coefficient (bool only)=  0.3333333333333333
