# Lab5 - Machine Learning
## K-means Clustering

In this lab session we will implement the K-means algorithm. K-means is one of the most well-known algorithms for data clustering.

Given data $\mathbf{x} = \{x_1, x_2,\ldots,x_n\}$ where each $x_n \in \mathbb{R}^D$ we want to group these datapoints in $K$ clusters ($K \ll N$), where each cluster is represented by a mean vector $\mu_k \in \mathbb{R}^D$.

Our purpose is to find a way to divide the data in clusters, finding appropriate values for the means $\mu_k, k=1,\ldots,K$


### Kmeans Algorithm steps
<ol>
  <li>From $ n =1 $ to $N$, assign each data instance in the cluster with the closest center $\mu_k$</li>
    
  <br>
  <li>Calculate the new cluster centers $$\mu_k = \frac{\sum_{n=1}^N r_{nk} \, x_n }{\sum_{n=1}^N r_{nk}}$$ </li>
  <br>
  <li>Check for convergence and stop, otherwise go to step 1</li>
</ol>


$$ r_{nk} \in \{0,1\}, \, \sum_{k=1}^K r_{nk}=1$$
so when $x_n$ belongs to cluster $k$, then $r_{nk}=1$ and $r_{nj}=0$ for each $j \neq k$.

The cost function 
$$ J(r_1,\ldots,r_N, \mu_1,\ldots,\mu_K))= \sum_{n=1}^N \sum_{k=1}^K r_{nk} \lVert \mathbf{x}_n - \mathbf{\mu}_k \rVert ^2$$

In [None]:
from __future__ import division
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.random.seed(0)

In [None]:
N = 2000
X = np.zeros((N, 2))
p = [0.3, 0.6, 1]
Mtrue = np.array([[0,0], [-2,3], [2,3]])
for i in range(N):
    u = np.random.rand()
    if u < p[0]:
        k = 0
    elif u>=p[0] and u < p[1]:
        k = 1
    else:
        k = 2
    X[i, :] = Mtrue[k]+ 0.5*np.random.randn(2)
N, D  = X.shape
K = 3
Minit = np.array([np.mean(X, axis=0),]*K) + np.random.randn(K,2)
print Minit  
    

In [None]:
# Plot the data and the initial values of the centers
plt.plot(X[:, 0], X[:, 1], 'o', color='lightgray', markersize = 1)
plt.plot(Minit[:, 0], Minit[:, 1], 'b+', mew = 3, ms=25)
plt.show()

In [None]:
from matplotlib.colors import cnames as mcolors
def plot_clusters(X, r, k, M):
    if k >3:
        colors = mcolors.keys()
    else:
        colors = ['r', 'g', 'b']
    for k in range(K):
        plt.plot(X[r[:,k]==1, 0], X[r[:,k]==1, 1], '.', color=colors[k], markersize = 3)
    plt.plot(M[:, 0], M[:, 1], 'b+', mew = 3, ms=25)
    plt.show()

def plot_costs(costs):
    x = range(1, len(costs)+1)
    y = costs
    plt.plot(x, y)
    plt.ylabel('cost')
    plt.xlabel('iterations')
    plt.title("Cost Function =")
    plt.xticks(x)
    plt.show()

In [None]:
def ml_kmeans(X, M):
    N, D = X.shape
    K = M.shape[0]
    # Apply the two steps of K means until convergence
    tol = 1e-6
    Jold = np.inf
    maxIters = 100    
    M = np.copy(Minit)
    costs = []
    for it in range(maxIters):        
        r = np.zeros((N, K))
        # Step 1 -- Assgin data to clusters
        # ******************************************************************
        # **********************Your code here *****************************
        # ******************************************************************
                
               
        # Step 2 -- Update mean centers
        # ******************************************************************
        # **********************Your code here *****************************
        # ******************************************************************    
        
        # Step 3 -- Calculate cost function and check for convergence
        
        
        # ******************************************************************
        # **********************Your code here *****************************
        # ******************************************************************
        print("Iteration #{}, Cost function value: {}".format(it, J))
        
        plot_clusters(X, r, k, M)
    plot_costs(costs)
    return M, r
        

In [None]:
M, r = ml_kmeans(X, Minit)

In [None]:
x = range(1, len(costs)+1)
y = costs
plt.plot(x, y)
plt.ylabel('cost')
plt.xlabel('iterations')
plt.title("Cost Function =")
plt.xticks(x)
plt.show()

In [None]:
# Plot the clustering and the centers
fig = plt.figure()
f, ax = plt.subplots(1,2)
f.set_figwidth(12)
ax[0].plot(X[:, 0], X[:, 1], 'o', color='lightgray', markersize = 1)
ax[0].plot(Minit[:, 0], Minit[:, 1], 'b+', mew = 3, ms=25)
if k >3:
    colors = mcolors.keys()
else:
    colors = ['r', 'g', 'b']
for k in range(K):
    ax[1].plot(X[r[:,k]==1, 0], X[r[:,k]==1, 1], '.', color=colors[k], markersize = 3)
ax[1].plot(M[:, 0], M[:, 1], 'b+', mew = 3, ms=25)
plt.show()

In [None]:
print M
print Minit