# Introduction to Clustering

The goal of clustering is to partition the input data into subsets.

For this exercise, we will use the same iris data we used in the classification module.

In [None]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets

# load the iris data set
iris = datasets.load_iris()
# extract input output pairs
X = iris.data
Y = iris.target
cmap_data = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

def euclideanDistance(x1, x2):
    d2 = np.sum((x1 - x2)**2, axis=1)
    return np.sqrt(d2)

## k-means clustering
The idea of the k-means clustering is to find k points that will partition the data such that the overall distance between any of the training points to the nearest point in the set of k points is minimized.
An iterative procedure that tries to minimize this objective is the following:
1. start with a random set of k different points that cover the space where data lies.
2. for each of the k points find the set of points in the training data that are closer to them
3. Update the position of each of the k points by computing the mean of the set of points that where assigned in the previous step.


## Simple implementation of k-means


In [None]:
np.random.permutation(X[:])[:3]

In [None]:
def kMeansClustering(X, k, dist=euclideanDistance, max_iter=300):
    n_dims = X.shape[1]
    # select k points of the training set as seeds
    C = np.random.permutation(X[:])[:k]
    def partition(X, C):
        Y = np.zeros(X.shape[0], dtype=np.int)
        for i, Xi in enumerate(X):
            NNidx = np.argmin(dist(Xi, C))
            Y[i] = NNidx
        return Y
    iTr = 0
    clusters = partition(X,C)
    while iTr < max_iter and iTr > 0:
        clusters_old = clusters
        for iK in range(k):
            C[iK] = np.mean(X[clusters_old == iK])
        clusters = partition(X, C)
        if clusters_old == clusters:
            break
        iTr += 1
    return clusters
    


In [None]:
Y_pred = kMeansClustering(X,3)
print(Y_pred)

In [None]:
plt.scatter(X[:,0], X[:,1], c=Y_pred, cmap=cmap_data, s=20)
plt.show()

## Exercise 1
Compare the simple implementation with the sklearn function sklearn.cluster.kMeans().

In [None]:
## Write some lines of code here

## Exercise 2
Improve our k-means implementation by geting the best of n_runs.
For this part, we need to compute the value of the objective for a given clustering and keep the best form n_runs.

In [None]:
## Write some lines of code here