# Clustering Overview

Jake VanderPlas, author of Python Data Science Handbook, has shared his content on github and elsewhere. Here's the chapter about K-Means Clustering 
https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html

In [2]:
from sklearn.metrics import pairwise_distances_argmin

def find_clusters(X, n_clusters, rseed=2):
    # 1. Randomly choose clusters
    rng = np.random.RandomState(rseed)
    i = rng.permutation(X.shape[0])[:n_clusters]
    centers = X[i]
    
    while True:
        # 2a. Assign labels based on closest center
        labels = pairwise_distances_argmin(X, centers)
        
        # 2b. Find new centers from means of points
        new_centers = np.array([X[labels == i].mean(0)
                                for i in range(n_clusters)])
        
        # 2c. Check for convergence
        if np.all(centers == new_centers):
            break
        centers = new_centers
    
    return centers, labels

#centers, labels = find_clusters(X, 4)
#plt.scatter(X[:, 0], X[:, 1], c=labels,
#            s=50, cmap='viridis');

As we might expect from the cluster centers we visualized before, the main point of confusion is between the eights and ones. But this still shows that using k-means, we can essentially build a digit classifier without reference to any known labels!

Just for fun, let's try to push this even farther. We can use the t-distributed stochastic neighbor embedding (t-SNE) algorithm (mentioned in In-Depth: Manifold Learning) to pre-process the data before performing k-means. t-SNE is a nonlinear embedding algorithm that is particularly adept at preserving points within clusters. Let's see how it does:

In [4]:
#from sklearn.manifold import TSNE
#
## Project the data: this step will take several seconds
#tsne = TSNE(n_components=2, init='random', random_state=0)
#digits_proj = tsne.fit_transform(digits.data)
#
## Compute the clusters
#kmeans = KMeans(n_clusters=10, random_state=0)
#clusters = kmeans.fit_predict(digits_proj)
#
## Permute the labels
#labels = np.zeros_like(clusters)
#for i in range(10):
#    mask = (clusters == i)
#    labels[mask] = mode(digits.target[mask])[0]
#
## Compute the accuracy
#accuracy_score(digits.target, labels)

In [None]:
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist

def plot_kmeans(kmeans, X, n_clusters=4, rseed=0, ax=None):
    labels = kmeans.fit_predict(X)

    # plot the input data
    ax = ax or plt.gca()
    ax.axis('equal')
    ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)

    # plot the representation of the KMeans model
    centers = kmeans.cluster_centers_
    radii = [cdist(X[labels == i], [center]).max()
             for i, center in enumerate(centers)]
    for c, r in zip(centers, radii):
        ax.add_patch(plt.Circle(c, r, fc='#CCCCCC', lw=3, alpha=0.5, zorder=1))

https://bl.ocks.org/rpgove/0060ff3b656618e9136b
https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html
