# KMeans from scratch

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import time
import pylab as pl
from IPython import display
from warnings import simplefilter 
simplefilter(action='ignore', category=DeprecationWarning)

## Algorithm
**Input:**  
- `X`: an array of shape `(N,d)` whose rows are samples and columns are features
- `k`: number of clusters
- `init`: "random" or an array of shape (k,d)  
if "random", `k` points are chosen randomly from X as the initial cluster centers  
if an array, the array is used as the initial cluster centers

**Output:**  
A tuple `(y_new, centers)`.  
- `y_new`: an array of shape `(N,)` that records the labels in `(0, ..., k-1)` of each sample 
- `centers`: an array of shape `(k,d)` that records the cluster centers

**Steps:**
1. Initialize a collection of centers $\mu_0,\ldots,\mu_{k-1}$:  
    - if `init` is an array, the centers are the rows of `init` . 
    - if `init=="random"`, the centers are chosen as `k` random rows of `X` .  
2. Label each sample ${\bf x}_i$ by $j$ if $\mu_j$ is the closed center to ${\bf x}_i$.
3. Call the points with label $j$ as group $j$.  Update $\mu_j$ as the center of points in group $j$.  
4. Repeat Steps 2 and 3 until `y_new` do not change anymore.

## Pseudocode
Translate the algorithm into the pseudocode.  
This helps you to identify the parts that you don't know how to do it.  

    1. 
    2. 
    3. ...

## Code

In [None]:
#return distance of two input 2D np.array, in Euclidean distance
def distance_mtx(X, Y):
    X_col = X[:, np.newaxis, :]
    Y_row = Y[np.newaxis, :, :]
    diff = X_col - Y_row
    return np.linalg.norm(diff, axis=2)

In [None]:
### your answer here
# X: an array of shape (N,d)
# k: number of clusters
# init: "random" or an array of shape (k,d), which means the initial points 
def MyKmeans(X, k, init = "random"):
    n = X.shape[0] #n is the number of points
    center = 0 #store the center of each group
    if(init == "random"):
        center = X[np.random.choice(range(n), k, replace=False)]
    else:
        center = init
    label = np.array([0. for i in range(n)])
    
    #label every point first
    dist_mtx = distance_mtx(X, center)
    label = np.argmin(dist_mtx, axis=1) #lable the point with the index of minimun in array dist_to_centers

    #repeat to label all the points until the label result is same as before
    pre_label = np.array([-1. for i in range(n)])
    
    while(not np.array_equal(label, pre_label)):
        pre_label = np.copy(label)
        
        #count the mean to find the new center
        for i in range(k):
            group = X[np.where(label == i)]
            center[i] = np.mean(group, axis = 0)
        #label every point
        dist_mtx = distance_mtx(X, center)
        label = np.argmin(dist_mtx, axis=1)

    return label, center

## Test
Take some sample data from [KMeans-with-scikit-learn](KMeans-with-scikit-learn.ipynb) and check if your code generates similar outputs with the existing packages.

##### Name of the data
Description of the data.

In [None]:
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
plt.scatter(X[:,0], X[:,1])

In [None]:
### results with your code
my_lable, my_center = MyKmeans(X, 2)
plt.scatter(X[:,0], X[:,1], c = my_lable )

In [None]:
### results with existing packages
model = KMeans(n_clusters = 2)
model.fit(X)
plt.scatter(X[:,0], X[:,1], c = model.labels_ )

## Comparison

##### Exercise 1
Modify your code so that it prints the inertia at each iteration.  
Is it decreasing?

In [None]:
### your answer here

# X: an array of shape (N,d)
# k: number of clusters
# init: "random" or an array of shape (k,d), which means the initial points 
def _MyKmeans(X, k, init = "random"):
    n = X.shape[0] #n is the number of points
    center = 0 #store the center of each group
    inertia = 0 #store the inertia
    if(init == "random"):
        center = X[np.random.choice(range(n), k, replace=False)]
    else:
        center = init
    label = np.array([0. for i in range(n)])
    
    #label every point first
    dist_mtx = distance_mtx(X, center)
    label = np.argmin(dist_mtx, axis=1) #lable the point with the index of minimun in array dist_to_centers
    #compute the inertia of first result
    inertia = np.sum( np.min(dist_mtx, axis=1)**2 )
    
    times = 0 #record the times that label updates
    #repeat to label all the points until the label result is same as before
    pre_label = np.array([-1. for i in range(n)])
    
    while(not np.array_equal(label, pre_label)):
        #
        times += 1
        #print the inertia
        print("inertia =", inertia)
        inertia = 0
        
        pre_label = np.copy(label)        
        #count the mean to find the new center
        for i in range(k):
            group = X[np.where(label == i)]
            center[i] = np.mean(group, axis = 0)
        #label every point
        dist_mtx = distance_mtx(X, center)
        label = np.argmin(dist_mtx, axis=1) #lable the point with the index of minimun in array dist_to_centers
        #compute the inertia of first result
        inertia = np.sum( np.min(dist_mtx, axis=1)**2 )
        
    print("update times :", times)
    return label, center

In [None]:
my_lable, my_center = _MyKmeans(X, 9)

In [None]:
### it is a function that plot the result of labeling at each iteration
# X: an array of shape (N,d)
# k: number of clusters
# init: "random" or an array of shape (k,d), which means the initial points 
def AnimatedMyKmeans(X, k, init = "random", frame_time = 1.1):
    n = X.shape[0] #n is the number of points
    center = 0 #store the center of each group
    if(init == "random"):
        center = X[np.random.choice(range(n), k, replace=False)]
    else:
        center = init
    label = np.array([0. for i in range(n)])
    
    #label every point first
    dist_mtx = distance_mtx(X, center)
    label = np.argmin(dist_mtx, axis=1) #lable the point with the index of minimun in array dist_to_centers
    
    times = 0 #record the label update times
    #repeat to label all the points until the label result is same as before
    pre_label = np.array([-1. for i in range(n)])
    while(not np.array_equal(label, pre_label)):
        #
        times += 1
        #update a frame of the animation
        plt.scatter(X[:,0], X[:,1], c = label)
        display.clear_output(wait=True)
        display.display(plt.show())
        time.sleep(frame_time)
        
        pre_label = np.copy(label)
        #count the mean to find the new center
        for i in range(k):
            group = X[np.where(label == i)]
            center[i] = np.mean(group, axis = 0)
        #label every point
        dist_mtx = distance_mtx(X, center)
        label = np.argmin(dist_mtx, axis=1) #lable the point with the index of minimun in array dist_to_centers
    
    print("update times :", times)
    return label, center

In [None]:
#re run this cell, and you will see the process of kmeans
my_lable, my_center = AnimatedMyKmeans(X, 4)

##### Exercise 2
Let  
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y_iris = iris.target
```

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y_iris = iris.target

###### 2(a)
Use your function to apply the $k$-means algorithm with $k=3$.  
What is the inertia?  
Run it several times to see if the results are always the same.  

In [None]:
### your answer here
#we try 5 times, and the inertia may not be same every time
for i in range(5):
    inertia = 0
    my_lable, my_center = MyKmeans(X, 3)
    
    for k in range(3):
        X_pick = X[np.where(my_lable == k)]
        inertia += np.sum((X_pick - my_center[k])**2)

    print("inertia =", inertia)

###### 2(b)
Use `sklearn.cluster.KMeans` to apply the $k$-means algorithm with $k=3$.  
What is the inertia?  
Run it several times to see if the results are always the same.  

In [None]:
### your answer here
#we try 5 times, and the inertia are all same
for i in range(5):
    model = KMeans(n_clusters = 3)
    model.fit(X)
    print("inertia =", model.inertia_)

###### 2(c)
Pick a label `labels` that you like.  
Compute the cluster centers of `X` corresponding to `labels` .  
Is this inertia bigger or smaller than the previous two answers?

In [None]:
### your answer here
#we choose the labels y_rand which were randomly choosed
y_rand = np.random.randint(3, size=X.shape[0])
rand_inertia = 0

for i in range(3):
    X_pick = X[np.where(y_rand == i)]
    clus_center = np.mean(X_pick, axis = 0)
    rand_inertia += np.sum((X_pick - clus_center)**2)

print("inertia in random =", rand_inertia)
#this is much greater than the previous ans, 
#since it doesn`t consider the clusters of each points, 
#which makes the points in the same group may be so "far"

###### 2(d)
The label `y_iris` is the "correct" real-world answer.  
Compute the cluster centers and the inertia.  
Is this inertia bigger or smaller than the answers in 2(a) and 2(b)?

In [None]:
### your answer here
real_inertia = 0

for i in range(3):
    X_pick = X[np.where(y_iris == i)]
    center = np.mean(X_pick, axis = 0)    
    real_inertia += np.sum((X_pick - center)**2)
    

print("real inertia =", real_inertia)
#and it is bigger than the result in 2(b), and sometimes bigger than the result in 2(a)

##### Exercise 3
The $k$-means algorithm is a deterministic algorithm once the initial cluster centers have been determined.  
Therefore, your function and `sklearn.cluster.KMeans` should obtain the same result when `init` is given.  
Check if this is true.  

Note:  There are still many subtle differences, for example, `sklearn` uses the location of centers to test the convergence but our algorithm uses the label to test the convergence.

In [None]:
### your answer here
#we print the result of sklearn.cluster.KMeans 5 times
#with initial center point is fixed (X[2], X[4], X[6])
init = np.array([X[2], X[4], X[6]])

for i in range(5):
    model = KMeans(n_clusters = 3, init = init)
    model.fit(X)
    print("result label by model :\n", model.labels_)
    
#and the results are all same

In [None]:
#we print the result of MyKmeans 5 times
#with initial center point is fixed (X[2], X[4], X[6])
init = np.array([X[2], X[4], X[6]])

for i in range(5):
    my_lable, my_center = MyKmeans(X, 3, init = init)
    print("result label by MyKmeans:\n", my_lable)
    
#and the results are all same