K-means produces more “compact” clusters than agglomerative clustering. Which do you think is better for active semi-supervised learning? Compare SciKit Learn’s K-means and Agglomerative Clustering (50 clusters) on the MNIST dataset to see what they do. Which clustering approach performs better? 

We will download the MNIST dataset and split the data into training data, `X_train` and `y_train` and test data, `X_test` and `y_test`.

In [None]:
from sklearn.datasets import fetch_openml
data = fetch_openml(name='mnist_784')

import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.1)
X_train, X_test, y_train, y_test = train_test_split(X_test, y_test, test_size=0.33)

Do the clustering.

In [None]:
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.neighbors import KNeighborsClassifier

def tryclusterer(k, clust):
  # do the clustering
  clr = clust(n_clusters=k)
  clr.fit(X_train)
  train_ids = clr.labels_.copy()

  # Request one label per cluster and make an interim dataset out of X_train, y_guess .
  clust_labs = np.array(k*['-1'])
  for i in range(k):
    clust_labs[i] = y_train[train_ids == i][0]
    y_guess = clust_labs[train_ids]

  # Assign test data labels based on the nearest instance in the interim dataset.
  clf = KNeighborsClassifier(n_neighbors=1).fit(X_train,y_guess)
  y_pred = clf.predict(X_test)
  return(sum(y_pred == y_test)/len(y_test))

K means with 50 clusters got 65% accuracy.

In [None]:
tryclusterer(50, KMeans)

0.6519480519480519

Agglomerative Clustering with 50 clusters got 70% accuracy.

In [None]:
tryclusterer(50, AgglomerativeClustering)

0.703030303030303