In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import warnings;
warnings.filterwarnings('ignore');

## K-means clustering of the MNIST dataset

This project applies the k-means clustering algorithm to cluster written digits.

The [MNIST](http://yann.lecun.com/exdb/mnist/) dataset is a large database of handwritten digits. We will analyse a subset of this database with digit images reduced to 8x8 grayscaled valued pixels.  

It is a very well known dataset in the machine learing community and can be loaded directly from Scikit-learn:

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()

print(digits.DESCR)

In `digits`, `data` contains the pixel feature vectors and `target` contains the labels:

In [None]:
print(digits.data.shape)
print(digits.data)
print(digits.target)

The following code shows a random datapoint:

In [None]:
import random

plt.grid(b=None)
idx = random.randint(0,len(digits.data)-1)
plt.imshow(digits.data[idx].reshape(8,8),cmap=plt.cm.gray_r)
plt.title("label = %i"%digits.target[idx])
plt.show()

The `KMeans` function in Scikit-learn has the following parameters:

In [None]:
from sklearn.cluster import KMeans

help(KMeans)

Notice that the k-means++ initialization is the default in Scikit-learn KMeans.

Also notice hyperparameter `n_init` that sets the number of time the k-means algorithm will be run with different centroid seeds, with the final best result selected based on the inertia metric.

*Cluster the data into 10 groups with just one random cluster center initialization. Set `random_state` equal to zero:* 

In [None]:
#Start code here
kmeans = 
clusters =
#End code here

print(kmeans.cluster_centers_.shape)
print(clusters)

*What is the inertia for the obtained clusters?*

In [None]:
#Start code here

#End code here

The following code plots the 10 cluster centers:

In [None]:
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
for axi, center in zip(ax.flat, kmeans.cluster_centers_):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center.reshape(8, 8), cmap=plt.cm.binary)

Now we have each datapoint (written digit) assiged to one of these cluster centers.

*Print the labels for all the datapoints in the first cluster:*

In [None]:
#Start code here

#End code here

The true labels are known (in contrast to real unsupervised learning where the labels are unknown) and can be used to evaluate the k-means clustering result. 

For this, we assign the mode of the datapoint labels in a cluster to all the datapoints in that cluster.

*Use the Python `mode()` function to compute, for each cluster, the mode of the labels of the datapoints in that cluster:*

In [None]:
from scipy.stats import mode

#Start code here

#End code here

*Does every cluster have a different label mode?*

We can set the label mode as the k-means class prediction for each datapoint in a cluster.

The following code creates a numpy array `cluster_labels` that contains these predictions for each datapoint: 

In [None]:
cluster_labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    cluster_labels[mask] = mode(digits.target[mask])[0]

Finally, we compare these predictions with the true labels of the datapoints.

*Compute the accuracy of the k-means class predictions:*

In [None]:
#Start code here

#End code here

*Plot the prediction results as a confusion matrix:*

In [None]:
from sklearn.metrics import confusion_matrix

#Start code here

#End code here

*What does this tell you about the following pair of classes?*
- *8 and 1*
- *9 and 5*
- *9 and 8*

*Repeat the k-means clustering with just one random cluster center initialization for 100 different cluster center initializations (run k-means a hundred times). Compute the accuracy for each of the k-means class predictions and store these in the list called `kmeans_accuracies`:*

In [None]:
kmeans_accuracies = []

#Start code here

#End code here

*Use the Seaborn `distplot()` function to plot the 100 accuracies in `kmeans_accuracies`:*

In [None]:
#Start code here
sns.distplot(kmeans_accuracies)
#End code here

*What do you see?*

*Make the same plot for k-means cluster centers initialized with k-means++:* 

In [None]:
kmeans_accuracies = []

#Start code here

#End code here

*What do you see?*

*Make the same plot for k-means cluster centers initialized with k-means++  and `n_init` equal to 10:*

In [None]:
kmeans_accuracies = []

#Start code here

#End code here

*What do you see?*

*Estimate the performance (accuracy) of an optimized logistic regression model on unseen external data:*

In [None]:
#Start code here
#End code here