****
# K-Means without Scikit-Learn
****
<p style="text-align:right"><i>Jesus Perez Colino<br>First version: November 2016</i></p>

## About this notebook: 
****
Notebook prepared by **Jesus Perez Colino** Version 0.2, First Released: 01/10/2016, Alpha (work-in-progress)

- This work is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). This work is offered for free, with the hope that it will be useful.


- **Summary**: This Jupyter notebook is the simplest Python implementation of the **K-Means algorithm**.

In [1]:
import IPython
from sys import version 

print (' Reproducibility conditions for this notebook '.center(85,'-'))
print ('Python version:       ' + version)
print ('IPython version:      ' + IPython.__version__)
print ('-'*85)

-------------------- Reproducibility conditions for this notebook -------------------
Python version:       3.5.3 |Anaconda 4.4.0 (x86_64)| (default, Mar  6 2017, 12:15:08) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
IPython version:      5.3.0
-------------------------------------------------------------------------------------


## K-Means algorithm

Given a set of observations $(x_1,x_2,\ldots, x_n)$, where each observation is a $d$-dimensional real vector, **$k$-means clustering** look for a partition of the $n$ observations into $k\leq n$ sets or clusters $S=\{S_1,S_2,\ldots,S_k\}$, such that the partition  $S$ *minimize* the within-cluster sum of squares between observations and the mean, or **centroid**, of the cluster (or alternatively, minimize the variance). 

More formally, the objective is to find a particular partition of clusters $S$ that is the solution of:


$$ \underset{S} {\operatorname{arg\,min}}  \sum_{i=1}^{k} \sum_{ x \in S_i} \left\|  x - \mu_i \right\|^2 = \underset{S} {\operatorname{arg\,min}}  \sum_{i=1}^{k} |S_i| \operatorname{Var} S_i $$


where $\mu_i$ is the mean or centroid of points in the cluster $S_i$. 

The most common **k-means algorithm** uses an iterative refinement technique. Given an initial set of $k$ means $\mu_1,…,\mu_k$ (see below), the algorithm proceeds by alternating between two steps:

- **Assignment step**: Assign each observation to the cluster whose mean has the least squared *Euclidean distance*, this is intuitively the "nearest" mean. 

\begin{equation}
S_i^{(t)} = \big \{ x_p : \big \| x_p - \mu^{(t)}_i \big \|^2 \le \big \| x_p - \mu^{(t)}_j \big \|^2 \ \forall j, 1 \le j \le k \big\}
\end{equation} 

where each $x_p$ is assigned to exactly one $S^{(t)}$, even if it could be assigned to two or more of them.

- **Update step**: Calculate the new means to be the **centroids** of the observations in the new clusters:

\begin{equation}
\mu^{(t+1)}_i = \frac{1}{|S^{(t)}_i|} \sum_{x_j \in S^{(t)}_i} x_j 
\end{equation}

The algorithm will converge as soon as the assignments no longer change.  However, there is no guarantee that the optimum is found using this algorithm.

The algorithm is often presented as assigning objects to the nearest cluster by distance.  Using a different distance function other than (squared) Euclidean distance may stop the algorithm from converging. Various modifications of k-means such as spherical k-means and k-medoids have been proposed to allow using other distance measures.


In [2]:
from random import sample
from math import fsum, sqrt
from collections import defaultdict
from functools import partial

In [3]:
def mean(data):
    'Accurate arithmetic mean'
    if isinstance(data,list)==False:
        data = list(data)
    return fsum(data) / len(data)

def transpose(matrix):
    'Swap rows with columns for a 2-D array'
    return zip(*matrix)

def distance(p, q, sqrt=sqrt, fsum=fsum, zip=zip):
    'Multi-dimensional euclidean distance between points p and q'
    return sqrt(fsum((x1 - x2) ** 2.0 for x1, x2 in zip(p, q)))

def assign_data(centroids, data):
    'Assign data the closest centroid'
    d = defaultdict(list)
    for point in data:
        centroid = min(centroids, key=partial(distance, point))
        d[centroid].append(point)
    return dict(d)

def compute_centroids(groups):
    'Compute the centroid of each group'
    return [tuple(map(mean, transpose(group))) for group in groups]

def k_means(data, k=2, iterations=10):
    'Return k-centroids for the data'
    data = list(data)
    centroids = sample(data, k)
    for i in range(iterations):
        labeled = assign_data(centroids, data)
        centroids = compute_centroids(labeled.values())
    return centroids

def quality(labeled):
    'Mean value of squared distances from data to its assigned centroid'
    return mean(distance(c, p) ** 2 for c, pts in labeled.items() for p in pts)

## Examples

### Simple example with six 3-D points clustered into two groups

In [4]:
points=[(10, 41, 23),
        (22, 30, 29),
        (11, 42, 5),
        (20, 32, 4),
        (12, 40, 12),
        (21, 36, 23)]

centroids = k_means(points, k=2)
print(assign_data(centroids, points))

{(17.666666666666668, 35.666666666666664, 25.0): [(10, 41, 23), (22, 30, 29), (21, 36, 23)], (14.333333333333334, 38.0, 7.0): [(11, 42, 5), (20, 32, 4), (12, 40, 12)]}



### Example with a richer dataset

In [5]:
data = [ (10, 30),
         (12, 50),
         (14, 70),
         (9, 150),
         (20, 175),
         (8, 200),
         (14, 240),
         (50, 35),
         (40, 50),
         (45, 60),
         (55, 45),
         (60, 130),
         (60, 220),
         (70, 150),
         (60, 190),
         (90, 160)]

print('k     quality')
print('-     -------')
for k in range(1, 8):
    centroids = k_means(data, k, iterations=20)
    d = assign_data(centroids, data)
    print('{0}    {1:8,.1f}'.format(k, quality(d)))

k     quality
-     -------
1     5,583.5
2     1,337.8
3       851.2
4       666.6
5       434.9
6       239.5
7       386.0
