# K-means clustering on USPS digits dataset

The K means clustering is an unsupervised machine learning alogorthm which studies the properties of all the objects/points in a group. It then divides the groups of objects into clusters with similar features.

This means that the data K-means alogithm can be used on, is unlabeled. We do not know what category each data point belongs to.

Many clustering algorithms are available in Scikit-Learn and elsewhere, but perhaps the simplest to understand is an algorithm known as k-means clustering, which is implemented in sklearn.cluster.KMeans.

Below are some of the standard imports for running K-means algorithm, manipulating the dataset and to generate plots.


In [2]:
from sklearn.datasets import load_digits
import numpy as np

import matplotlib.pyplot as plt

from scipy.stats import mode

from sklearn.metrics import accuracy_score
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.metrics import confusion_matrix
import seaborn as sns
from sklearn.manifold import TSNE

## Concept behind K-means

The k-means algorithm searches for a predetermined number of clusters within an unlabeled multidimensional dataset. It accomplishes this using a simple conception of what the optimal clustering looks like:

+ The “cluster center” is the arithmetic mean of all the points belonging to the cluster.

+ Each point is closer to its own cluster center than to other cluster centers.

Let's explore the algorithm in detail using the USPS dataset problem.

### Load the USPS digits dataset

In [3]:
digits = load_digits()
digits.data.shape

(1797, 64)

Here we will attempt to use k-means to try to identify similar digits without using the original label information; this might be similar to a first step in extracting meaning from a new dataset about which you don’t have any a priori label information.

Next, we will cluster the data points into 10 different clusters (since we know that there are ten types of digits - 0 to 9)

In [None]:
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

The result is 10 clusters in 64 dimensions. Notice that the cluster centers themselves are 64-dimensional points, and can themselves be interpreted as the “typical” digit within the cluster. Let’s see what these cluster centers look like:

In [None]:
fig, ax = plt.subplots(2, 5, figsize=(8, 3))
centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
    axi.set(xticks=[], yticks=[])
    axi.imshow(center, interpolation='nearest', cmap=plt.cm.binary)

We see that even without the labels, KMeans is able to find clusters whose centers are recognizable digits, with perhaps the exception of 1 and 8.

Match each learned cluster label with the true labels found in them:

In [None]:
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

Check the accuracy of the predicted clustering:

In [None]:
accuracy_score(digits.target, labels)

Generate the confusion matrix:

In [None]:
mat = confusion_matrix(digits.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=digits.target_names,
            yticklabels=digits.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')
plt.show()

From this plot, we can see that the main points of confusion for the algorithm are the digits 8 and 1, while it can predict all other digits accurately.

We can improve the accuracy by using a few advanced techniques for dataset manipulation. We can use the t-distributed stochastic neighbor embedding (t-SNE) algorithm to preprocess the data before performing k-means. t-SNE is a nonlinear embedding algorithm that is particularly adept at preserving points within clusters. 
Let’s see how it does:

In [1]:
# Project the data: this step will take several seconds
tsne = TSNE(n_components=2, init='pca', random_state=0)
digits_proj = tsne.fit_transform(digits.data)

# Compute the clusters
kmeans = KMeans(n_clusters=10, random_state=0)
clusters = kmeans.fit_predict(digits_proj)

# Permute the labels
labels = np.zeros_like(clusters)
for i in range(10):
    mask = (clusters == i)
    labels[mask] = mode(digits.target[mask])[0]

# Compute the accuracy
accuracy_score(digits.target, labels)

NameError: name 'TSNE' is not defined

That’s nearly 94% classification accuracy without using the labels. This is the power of unsupervised learning when used carefully: it can extract information from the dataset that it might be difficult to do by hand or by eye.