# Unsupervised Machine Learning II - Clustering

**Brull Borràs, Pere Miquel**

With PCA we learned a machine learning algorithm used for extracting key structural information from large and complex datasets without the need of labels. Clustering is another technique for unsupervised machine learning. Its most popular implementation is **k-means**, which works by reaching $k$ stable groups within the data by randomly initializing $k$ points in the data subspace: the *means* of the cluster.

The algorithm follows a recursive approach:

- Assign each point to the nearest mean based on the least sum of squares within cluster.
- Reassign the cluster mean as the centroid of each cluster.

Throughout many iterations, centroids that minimize the *within cluster least sum of squares* are reached, thus the algorithm stops when converging on a solution. To maximize perfomance, algorithms try to find a good initial approximation that minimizes variances within classes.

Again, we will use the digits dataset from sklearn as an example:

In [11]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn import metrics
from mpl_toolkits.mplot3d import Axes3D

from time import time
np.random.seed()
scale = StandardScaler()

digits = load_digits()
data = load_digits().data

n_samples, n_features = data.shape
n_digits = len(np.unique(digits.target))
labels = digits.target

In [33]:
sample_size = 300

print("n_digits: %d, \t n_samples %d, \t n_features %d"
   % (n_digits, n_samples, n_features))


print(79 * '_')
print('% 9s' % 'init''         time   inertia   homo   compl   v-meas   ARI    silhouette')

def bench_k_means(estimator, name, data):
    t0 = time()
    estimator.fit(data)
    print('% 9s    %.2fs  %i   %.3f  %.3f   %.3f    %.3f  %.3f'
      % (name, (time() - t0), estimator.inertia_,
         metrics.homogeneity_score(labels, estimator.labels_),
         metrics.completeness_score(labels, estimator.labels_),
         metrics.v_measure_score(labels, estimator.labels_),
         metrics.adjusted_rand_score(labels, estimator.labels_),
         metrics.silhouette_score(data, estimator.labels_,
             metric='euclidean',
             sample_size=sample_size)
        )
     )

bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10), name="k-means++", data=data)
print(79 * '_')

n_digits: 10, 	 n_samples 1797, 	 n_features 64
_______________________________________________________________________________
init         time   inertia   homo   compl   v-meas   ARI    silhouette
k-means++    0.23s  1165149   0.740  0.748   0.744    0.668  0.175
_______________________________________________________________________________
