# 2 K-Means Clustering
In this notebook we will explore and cluster data with the K-Means algorithm.

In K-Means, a dataset is partitioned in _k_ clusters while trying to minimise the sum of squared distances of each point to its cluster centre. On of the characteristics of this algorithm is that the number of clusters _k_ is predefined, i.e. the choice is left to the machine learning practitioner.

You can find an overview of the algorithm on page 23 in the lecture.

Before we start however, we need a package that we did not use before, _ipywidgets_. We will use it later for a dynamic visualisation of our clusters.

In [1]:
! pip install ipywidgets



No we can import the usual packages.

In [2]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import ipywidgets as widgets
%config InlineBackend.figure_format = 'svg' # matplotlib magic
np.random.seed(1337) # seeds help with reproducible results

## 2.1 The dataset
We are going to use familiar data, the Iris flower dataset.
However, the dataset contains four features. We want to be able to look at our clustering results later. Four dimensional data is hard to visualise for human brains. Thankfully, we know a handy dimensionality reduction technique.

### Task 2.1.1 Transform the data
- Load the Iris dataset with the provided function.
- Use the PCA class from sklearn to project the dataset into a two-dimensional space.

In [22]:
iris_data = load_iris()
np.shape(iris_data['data'])
pca = PCA(n_components=2)
pca.fit(iris_data['data'])
iris_pca = pca.transform(iris_data['data'])
# your code here

## 2.2 Initialization
Before we start with the learning phase, we need to set up a few initial parameters.

### Task 2.2.1 The _k_-Question
As mentioned, the choice of the right _k_ is an important decision for the success of the algorithm. The number of clusters you will stare at in the end, depends on _k_. Too many or too few clusters might give you suboptimal results.

Fortunately, we know that the Iris dataset is labeled. Those labels already partition the data. Therefore, let's choose _k_ according to the number of labels.

In [None]:
k = 3# your code here

### Task 2.2.2 The first cluster centres
You have to start somewhere! Theoretically, you could choose arbitrary points in the input space. Unfortunately, if we choose those randomly, it might take a while to converge.

In order to speed things up a little, let's choose different random datapoints as the initial cluster centres and put them into a list/array.

_Hint:_ np.random has a few good functions for that purpose.

In [32]:
rnd = [ np.random.randint(0, len(iris_pca)) for i in range(3) ]

rnd_data_point = [ iris_pca[_rnd] for _rnd in rnd]
# print(rnd_data_point)


[array([ 0.51169856, -0.10398124]), array([ 0.35788842, -0.06892503]), array([3.49992004, 0.4606741 ])]


## 2.3 The Algorithm
Now the algorithm goes as follows:

    - obtain the distance of each point to each cluster centre
    - assign that point to the nearest cluster
    - move position of centre to mean of points in cluster
    
Thus, we need to calculate a few things.

In [56]:
np.argmin([80,2,3,4,5,6])

1

### Task 2.3.1 Compute the distances
Complete the function _distances()_ that takes a list/array of datapoints and a list/array of cluster centres and returns the distance of each datapoint to each cluster centre.

In [54]:
def distances(data, centroids):

    data = np.array(data)
    centroids = np.array(centroids)

    dist = []

    dist = [ np.sqrt((np.abs(data[item,0] - centroids[0:,0]))**2 + (np.abs(data[item,1] - centroids[0:, 1]))**2) for item in range(len(data)) ]

    return dist
    # your code here


print(distances(iris_pca, rnd_data_point))
    

[array([3.22374651, 3.06669914, 6.18565922]), array([3.22666658, 3.07393062, 6.24669456]), array([3.40093589, 3.24776891, 6.41755083]), array([3.264085  , 3.11323493, 6.29365615]), array([3.26891775, 3.11186318, 6.23007609]), array([2.91769311, 2.76034513, 5.78758858]), array([3.33226794, 3.17849251, 6.34435463]), array([3.14921366, 2.99306251, 6.13327428]), array([3.43102694, 3.28401738, 6.47026697]), array([3.18446941, 3.03097605, 6.19934818]), array([3.11019254, 2.95246843, 6.00969668]), array([3.12670814, 2.9718213 , 6.12892039]), array([3.30041387, 3.14838682, 6.32441956]), array([3.75765393, 3.60891924, 6.79362783]), array([3.40714058, 3.25154876, 6.18648759]), array([3.23672254, 3.08362632, 5.95099356]), array([3.26592239, 3.10846381, 6.1334426 ]), array([3.18723783, 3.03020428, 6.15001772]), array([2.8821021 , 2.72558133, 5.71462336]), array([3.1606019 , 3.00290986, 6.08813615]), array([2.86509646, 2.70755335, 5.81058986]), array([3.10223096, 2.94468512, 6.04368865]), array([3.

### Task 2.3.2 Assign to clusters
Now that we can compute the distances to the cluster centres, we need to assign the points to their respective clusters.

Complete the function _compute\_assignments()_ that takes a list/array of datapoints and a list/array of cluster centres and returns a list of assignments of each data point to the nearest cluster centre.

_Hint:_ Make ample use of the _distances()_ function you just wrote.

In [62]:
np.unique([1,3,1,3,4,5], return_index=True)

(array([1, 3, 4, 5]), array([0, 1, 4, 5]))

In [59]:
def compute_assignments(data, centroids):
     dist = distances(data, centroids)
     clusters = [ np.argmin(dist[item])  for item in range(len(dist)) ]
     return clusters
    # your code here

# compute_assignments(iris_pca, rnd_data_point)

### Task 2.3.3 New cluster centres
Now that we have our clusters, we can compute new centres that better represent the cluster.

Complete the function _compute\_new\_centres()_ that takes takes a list/array of datapoints and a list/array of assignments and returns the new cluster centres.

In [87]:
def compute_new_centres(data, assignments):
    # your code here
    cluster_index = []
    cluster = []
    new_cluster = []
    for j in range(len(np.unique(assignments))):
        cluster_index.append([i for i in range(len(assignments)) if j == assignments[i]])

    for item in range(len(cluster_index)):
        cluster.append([data[index] for index in cluster_index[item]])

    print(len(cluster[0][0]))

    for i in range(len(cluster)):
        new_cluster.append([np.sum(cluster[i][0:][feature]) for feature in range(len(cluster[0][0])) ])

        # new_clusters = [np.sum(cluster[i]) for i in range(len(cluster))]


    return new_cluster

print(compute_new_centres(iris_pca, compute_assignments(iris_pca, rnd_data_point)))



2
[[1.9699861593256598, -2.3647283793844367, 2.521343618304825], [1.2508221705749465, -2.891142912359107, 2.9605791673374235]]


The most important parts are done! Theoretically, we only need to run the algorithm repeatedly until the cluster centres do not change anymore.

## 2.4 Cluster quality
As we have seen in previous assignments, blindly running an algorithm without evaluating the quality of its results is not always the best idea.

Hence, we will use the Davies Bouldin Index to evaluate the quality of our clusters (see also page 18 in the lecture).

### Task 2.4.1 The Davies-Bouldin-Index

Write a function _db\_index()_ that takes a list/array of datapoints, a list/array of cluster centres and a list/array of assignments and returns the Davies-Bouldin Index.

You will need to:
    
    - calculate the radii of the clusters, R
    - calculate the inter class distance between the clusters
    - calculate the badness of separation between the clusters, D
    
Lastly, you need to average over the relevant D-values of each cluster.

All necessary formulas can be found in lecture 19, "Basic Clustering.

In [None]:
def db_index(data, centroids, assignments):
    # your code here


## 2.5 Learning Phase

### Task 2.5.1 Iterative Clustering
Finally, we have all the ingredients in order to cluster our data. Remember, we already initialised the first cluster centres.

Therefore, for 20 iterations, you will need to:

    - compute the cluster assignments
    - compute the new cluster centres according to the assignments
    - compute the DB index for the current assignments and cluster centres
    
Do not forget to log relevant data for each iteration:

    - the cluster centres
    - the cluster assignments
    - the DB-Index

In [None]:
iterations = 20

# your code here

## 2.6 Evaluation

### Task 2.6.1 Plotting the DB-Index
Plot the DB-Index over the iterations.

In [None]:
%matplotlib inline
# your code here

### Task 2.6.2 Plotting the Cluster Assignments
In order to see how the clusters evolve over the iterations, plotting the state of the iterations over and over again is a bit cumbersome. Therefore we are going to use some matplotlib magic to make an interactive plot within this notebook.

The _plot\_clusters()_ function takes as an argument the current iteration and updates the plot with the relevant data from the iteration. If everything works out, you can then use the interaction slider to go back and forth between the iterations and see how the clusters develop.

In [None]:
%matplotlib widget
_, ax = plt.subplots()

    
def plot_clusters(iteration):
    iter_assignments = # get current cluster assignments
    iter_centres = # get current cluster centres
    ax.clear()
    ax.set_xlabel('YOUR LABEL HERE')
    ax.set_ylabel('YOUR LABEL HERE')
    ax.set_title('YOUR TITLE HERE')
    
    # Plot each cluster
    for i in range(k):
        points = # get data points belonging to cluster i
        ax.scatter(points[:, 0], points[:, 1], color="C{}".format(i), s=20)
        ax.scatter(iter_centres[i, 0], iter_centres[i, 1], color="C{}".format(i), 
                   marker='s', s=50, edgecolor="black", linewidth=2)
    

iteraction_slider = widgets.IntSlider(min=0, max=iterations-1, description='Iteration:')
widgets.interact(plot_clusters, iteration=iteraction_slider);

### Task 2.6.3 Clustered Data vs Labeled Data
In the beginning, we told our K-Means algorithm to separate the data into three clusters, because we have labels that also separate the data into three parts.

Create two plots side-by-side (using subplots), where one side is showing the clustered data and the other side is showing the partitions of the labeled data. 

In [None]:
%matplotlib inline
# your code here

## 2.7 K-Means in the wild
It is quite fun to write the K-Means algorithm from the ground up. But, usually, a practitioner would rely on libraries, which have already implemented the algorithm, if possible.
K-Means is implemented in the sk-learn library, so we are going to use it cluster somewhat more complex data and visualise the cluster centres.

In [None]:
from sklearn.datasets import load_digits
from sklearn.cluster import KMeans

### Task 2.7.1 Digits data and the K-Question
Load the digits data set and decide the obvious question of how many cluster centres we want to have.

In [None]:
# your code here

### Task 2.7.2 Run the K-Means algorithm
Use the provided KMeans object on the digits data and extract the cluster centres

In [None]:
# your code here

### 2.7.3 Plot the Cluster Centres
Plot all the extracted cluster centres

In [None]:
%matplotlib inline
# your code here