<a href="https://colab.research.google.com/github/lingchm/datascience/blob/master/exercises/socially_distanced_robots.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# KMeans

*kmeans*, *clustering*

**Problem**

Given $m$ data points $\{x\}_{i=1}^m$, we want to cluster them into $k$ groups based on similarity. 

**Method**

K-means is a clustering algorithm that groups data points into $k$ clusters by minimizing the distortion function over $\{r^{ij}, \mu^j\}$:
$$J = \sum_{i=1}^m \sum_{j=1}^k r^{ij} || x^i-\mu^j|| ^2,$$
where $r^{ij} = 1$ if $x^i$ belongs to the j-th cluster and $r^{ij} = 0$ otherwise. 


## K-means using Euclidean distance

Using Euclidean distance, the center of the clusters have solution

$$\mu^j = \frac{\sum_{i=1}^m r^{ij}x^i}{\sum_{i=1}^m r^{ij}}$$

*Proof*. To minimize the distortion function, we compute the derivative:
$
\begin{align} 
\frac{dJ}{d\mu} &= \frac{d}{d\mu} ( \sum_{i=1}^m \sum_{j=1}^k r^{ij} || x^i-\mu^j||) 
                = \sum_{i=1}^m \sum_{j=1}^k r^{ij} \frac{d}{d\mu}|| x^i-\mu^j|| \\ 
                &= \sum_{i=1}^m \sum_{j=1}^k -2r^{ij} (x^i - \mu^j) 
                = \sum_{i=1}^m \sum_{j=1}^k -2r^{ij} (x^i - \mu^j) \\
                &= \sum_{i=1}^m \sum_{j=1}^k 2r^{ij} \mu^j - \sum_{i=1}^m \sum_{j=1}^k 2r^{ij}x^i = 0 \\ 
                &\Rightarrow \sum_{i=1}^m \sum_{j=1}^k 2r^{ij} \mu^j = \sum_{i=1}^m \sum_{j=1}^k 2r^{ij}x^i \\
                &\Rightarrow \mu^j = \frac{\sum_{i=1}^m r^{ij}x^i}{\sum_{i=1}^m r^{ij}}
\end{align}
$

## K-means using Mahalanobis distance

Mahalanobis distance is another popular distance measure: $d(x,y) = (x-y)^T \Sigma (x-y)$. In Mahalanobis distance, the distance measure is computed after vectors are "transformed" taking into account the variance and covariance amongst the variables.

We show that the centroids in this case are still the same as (2).

*Proof*. The K-means algorithms' distortion function under Mahalanobis distance becomes
$$J = \sum_{i=1}^m \sum_{j=1}^k r^{ij} (x^i-\mu^j)^T \Sigma (x^i-\mu^j),$$
where $\sum$ is the covariance matrix.

To minimize the distortion function, we compute the derivative:
$
\begin{align} 
\frac{dJ}{d\mu} &= \frac{d}{d\mu} (\sum_{i=1}^m \sum_{j=1}^k r^{ij} (x^i-\mu^j)^T \sum (x^i-\mu^j)) 
                = \sum_{i=1}^m \sum_{j=1}^k r^{ij} \frac{d}{d\mu} ((x^i-\mu^j)^T \sum (x^i-\mu^j)) \\ 
                &= \sum_{i=1}^m \sum_{j=1}^k -2r^{ij} \sum (x^i-\mu^j) 
                = \sum_{i=1}^m \sum_{j=1}^k 2r^{ij} \sum \mu^j - \sum_{i=1}^m \sum_{j=1}^k 2r^{ij} \sum x^i = 0 \\ 
                &\Rightarrow \sum_{i=1}^m \sum_{j=1}^k r^{ij} \mu^j = \sum_{i=1}^m \sum_{j=1}^k r^{ij}x^i \\
                &\Rightarrow \mu^j = \frac{\sum_{i=1}^m r^{ij}x^i}{\sum_{i=1}^m r^{ij}}
\end{align}
$

Under squared Euclidean distance, the assignment function is:
$$\pi(i) = \arg \min_{j=1 \dots k} ||x^i - \mu^j||^2$$

Under Mahalanobis distance, the assignment function is:
$$\pi(i) = \arg \min_{j=1 \dots k} (x^i - \mu^j)^T \Sigma (x^i - \mu^j)$$

The assignment function has changed because the distance function used to find the closest cluster has changed. 

## Convergence of K-means

We will prove that the K-means algorithm converges to a local optimum in finite steps.

Let $J = \sum_{i=1}^m \sum_{j=1}^k r^{ij} || x^i-\mu^j|| ^2, r^{i, \pi(i)} = 1$ be the cost function. K-means algorithm works to minimize $J$ at each iteration and we know that:

* For each iteration, $J$ is going to decrease:
 * cluster assignment is decreasing the distorsion function by $\pi(i) = \arg \min ||x^i - c^j||^2$
 * center adjustment is decreasing the distorsion function by $c^j = \arg \min \sum ||x^i - c||^2$
* $J \geq 0$ since it is the sum of squared distances
* Given $m$ data points and $k$ clusters, there are at most $k^m$ ways to assign data points to clusters (each data point has $k$ choices). Therefore, the algorithm is guaranteed to terminate 

By knowing that the algorithm will terminate, is decreasing the distorsion function, and that the distorsion function is always greater than zero, we can conclude that the algorithm converges to a local optimum in finite steps.


## Example of K-means using Manhattan distance

Given the following 5 point configuration, we will compute K-means by hand using Manhattan distance.
![ex 2.13](data/points.png)

Initialization:

          k = 2, m = 5
          cB = (2, 1), cA = (-3, -1)

Iteration 1:

* Cluster assigment

        m1 = (2, 2)    d(m1, cB)= 1  d(m1, cA) = 8 -> pi(1) = B
        m2 = (-1, 1)   d(m2, cB)= 3  d(m1, cA) = 4 -> pi(2) = B
        m3 = (3, 1)    d(m2, cB)= 1  d(m1, cA) = 8 -> pi(3) = B
        m4 = (0, -1)   d(m2, cB)= 4  d(m1, cA) = 3 -> pi(4) = A
        m5 = (-2, -2)  d(m2, cB)= 7  d(m1, cA) = 2 -> pi(5) = A
        J = 1 + 3 + 1 + 3 + 2 = 10
        
* Center adjustment:

        A: 
        arg min Jx = |0-cAx| + |-2-cAx| 
        arg min Jy = |-1-cAy| + |-2-cAy|  
        => cA = (-3, -1)

        B: 
        arg min Jx = |2-cBx| + |-1-cBx| + |3-cBx| 
        arg min Jy = |2-cBy| + |1-cBy| + |1-cBy| 
        => cB = (2, 1)
        
Iteration 2:

* Cluster assigment

        m1 = (2, 2)    d(m1, cB)= 1  d(m1, cA) = 6 -> pi(1) = B
        m2 = (-1, 1)   d(m2, cB)= 3  d(m1, cA) = 2 -> pi(2) = A
        m3 = (3, 1)    d(m2, cB)= 1  d(m1, cA) = 6 -> pi(3) = B
        m4 = (0, -1)   d(m2, cB)= 4  d(m1, cA) = 1 -> pi(4) = A
        m5 = (-2, -2)  d(m2, cB)= 7  d(m1, cA) = 2 -> pi(5) = A
        J = 1 + 2 + 1 + 1 + 2 = 7
        
* Center adjustment:

        A: 
        arg min Jx = |-1-cAx| + |0-cAx| + |-2-cAx| 
        arg min Jy = |1-cAy| + |-1-cAy| + |-2-cAy|  
        => cA = (-1, -1)
        B: 
        arg min Jx = |2-cBx| + |3-cBx|  
        arg min Jy = |2-cBy| + |1-cBy|
        => cB = (2, 1)
        
        
Iteration 3:

* Cluster assigment

        m1 = (2, 2)    d(m1, cB)= 1  d(m1, cA) = 6 -> pi(1) = B
        m2 = (-1, 1)   d(m2, cB)= 4  d(m1, cA) = 2 -> pi(2) = A
        m3 = (3, 1)    d(m2, cB)= 1  d(m1, cA) = 6 -> pi(3) = B
        m4 = (0, -1)   d(m2, cB)= 5  d(m1, cA) = 1 -> pi(4) = A
        m5 = (-2, -2)  d(m2, cB)= 8  d(m1, cA) = 2 -> pi(5) = A
        J = 1 + 2 + 1 + 1 + 2 = 7
        
Since there is no improvement of $J$, the algorithm terminated at iteration 2. The final cluster assignment and location of the centers are:
    
        pi(1) = B
        pi(2) = A
        pi(3) = B
        pi(4) = A
        pi(5) = A
        cA = (-1, -1)
        cB = (2, 1)

