**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
!{sys.executable} -m pip install yellowbrick
!{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import numpy as np
import pandas as pd

from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
from class_utils import ColGrid, sorted_order, show_tree
from sklearn.tree import DecisionTreeClassifier

from yellowbrick.contrib.classifier import DecisionViz
from yellowbrick.cluster import SilhouetteVisualizer, KElbowVisualizer
# revert yellowbrick's invasive changes to matplotlib's
# styling; also suppressing deprecation warnings
import warnings
import yellowbrick

with warnings.catch_warnings(record=True) as w:
    yellowbrick.style.rcmod.set_aesthetic('reset')
    yellowbrick.style.rcmod.reset_orig()

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }

# create a directory for storing any outputs
import os
os.makedirs("data", exist_ok=True)
os.makedirs("output", exist_ok=True)

# create a synthetic dataset
_blobs, _labels = datasets.make_blobs(
    n_samples=600, random_state=3,
    cluster_std=0.75, centers=5
)

_df_blobs = pd.DataFrame(np.hstack([_blobs, _labels.reshape(-1, 1)]),
                         columns=['x', 'y', 'label'])
_df_blobs['y'] *= 100
_df_blobs.to_csv("data/blobs_2d.csv", index=False)

del _blobs
del _labels
del _df_blobs

In [None]:
#@title -- Auxiliary Functions -- { display-mode: "form" }
cluster_colors = ['r', 'g', 'b', 'c', 'm', 'y', 'k', 'gray']

def scatter_legend(ax, sc, labels, num_colors, color_array,
                   s, edgecolor):
    handles = []
    
    for i in range(num_colors):
        h = mlines.Line2D([0], [0], ls="", color=color_array[i],
                          ms=s, marker=sc.get_paths()[0],
                          markeredgecolor=edgecolor)
        handles.append(h)

    ax.legend(handles=handles, labels=labels)

def plot_data(
    data, cluster_centres=None, color='b', ax=None,
    cluster_colors=cluster_colors,
    edgecolors='k', labels=None,
    center_color='orange', center_size=200,
    legend=True
):
    if ax is None:
        ax = plt.gca()
        
    if labels is None:
        ax.scatter(data[:, 0], data[:, 1], s=50,
                   color=color, edgecolors=edgecolors)
    else:
        c = np.asarray(cluster_colors)[labels]
        
        sc = ax.scatter(data[:, 0], data[:, 1], s=50,
                        c=c, edgecolors=edgecolors,
                        #cmap=plt.cm.get_cmap('category10', np.max(labels)+1)
                       )
        
        if legend:
            nclusts = np.max(labels)+1
            scatter_legend(ax, sc, ['$c_{}$'.format(i) for i in range(nclusts)],
                           nclusts, cluster_colors, s=6, edgecolor='k')
        
    if not cluster_centres is None:
        ax.scatter(cluster_centres[:, 0],
                   cluster_centres[:, 1],
            s=center_size, c=center_color,
            edgecolors=edgecolors)

    ax.grid(ls='--')
    ax.set_xlabel("x")
    ax.set_ylabel("y")
    ax.set_axisbelow(True)

## k-means in Scikit-Learn

Now that we have explored the principles behind $k$-means, we will have a look at a practical implementation from the well-known `scikit-learn` package. In addition to illustrating how to apply $k$-means to a dataset, we will have a look at some visualization techniques that will allow us to explore some of the clusters' properties and even help us to select a good number of clusters $k$.

### Preprocessing: Remember to Normalize the Data

To load and preprocess the data, we will be using our standard preprocessing pipeline, which **normalizes**  (standardizes) numeric columns. Given that $k$-means is based on distances, proper scaling is crucial and we **should not forget to normalize** . If the range of a certain column is much larger than that of the other columns and we fail to normalize it, that column will have a far greater influence on the results of the clustering than the other columns. This is usually not desirable.



In [None]:
# we load the data from the CSV
df = pd.read_csv("data/blobs_2d.csv")

# all inputs are numeric
categorical_inputs = []
numeric_inputs = list(df.columns[:-1])

# the preprocessing pipeline
input_preproc = make_column_transformer(
    (make_pipeline(
        SimpleImputer(strategy='constant', fill_value='MISSING'),
        OneHotEncoder()),
     categorical_inputs),
    
    (make_pipeline(
        SimpleImputer(),
        StandardScaler()),
     numeric_inputs)
)

# the preprocessed data and the classes
X = input_preproc.fit_transform(df[categorical_inputs+numeric_inputs])
labels = df["label"]

# let's also keep the unnormalized data
X_unnorm = df[categorical_inputs+numeric_inputs].values

# plot the data
plt.figure(figsize=(6, 5))
plot_data(X)

### The Clustering

Now that we have the data, it is the easiest thing in the world to do the clustering. All we need to do is create a `KMeans` object from `scikit-learn`, specifying the number of clusters as $k=5$. We then fit the `KMeans` object to the data using the standard `fit` interface. Note that we are doing unsupervised learning here so there are no desired outputs `y`. Method `predict` returns our cluster identifiers.

Once the clustering is done, we plot our data again: this time colouring points by the computed cluster identifiers. This will allow us to verify that clustering went correctly.



In [None]:
model = KMeans(n_clusters=5)
model.fit(X)
clusts = model.predict(X)

plt.figure(figsize=(6, 5))
plot_data(X, labels=clusts, legend=True)

#### Clustering with Unnormalized Data

To see why normalizing the data is so essential, we will now also compute a clustering for the unnormalized version.



In [None]:
model = KMeans(n_clusters=5)
model.fit(X_unnorm)
clusts = model.predict(X_unnorm)

plt.figure(figsize=(6, 5))
plot_data(X_unnorm, labels=clusts)

As you can see, the clusters are not quite right this time. This is because dimension $y$ has much greater range and it is therefore given more weight.

### Determining the Number of Clusters $k$

In the example above we assumed that we know the correct number of clusters $k$. In practice this is seldom true: unless we already know precisely what we are looking for. So how do we determine a good value of $k$? There are, in fact, several methods.

#### The Elbow Plot

One way to determine a good value of $k$ is using an elbow plot. The idea behind this kind of plot is to run $k$-means clustering for different values of $k$, compute the distortion score for each and plot the results. In the plot, we are then looking for an "elbow", i.e. the point with the maximum curvature – where the steepness of the plot changes the most. To create our elbow plots, we will be using the `yellowbrick` package, which also finds and visualizes the elbow point automatically using a knee point detection algorithm [[yellowbrick]](#yellowbrick).

The distortion score is computed as the sum of squared errors (SSE), i.e. the sum of Euclidean distances between points and their corresponding cluster centres [[yellowbrick](#yellowbrick), [k_research](#k_research)]:
$$
J(C) = \sum*{j=1}^{k} \sum* {x_i \in c_j} | x_i - \mu_j |^2.
$$

Note that this is the same criterion that $k$-means is trying to minimize. Consider also that we cannot simply pick the $k$ that results in the smallest distortion: that would mean creating as many clusters as there are points, which would reduce distortion to zero, but would not result in a good clustering. The intuition behind picking the elbow point is that once you arrive at the correct value of $k$, the distortion should decrease sharply because there is now a cluster centre reasonably close to each point. Adding further clusters should not have as much effect now: it is just going to split already well-defined clusters into smaller ones.

It is possible to create an elbow plot using other score functions, e.g. the Calinski-Harabasz index or the Silhoutte score Calinski-Harabasz [[yellowbrick](#yellowbrick), [ch_index](#ch_index)]: feel free to experiment with that. Also, we will be using Silhoutte scores later in the notebook to perform Silhoutte analysis. 

#### The Elbow Plot: An Example

Now let us use the elbow plot to determine the best $k$ for our dataset. We will use the `KElbowVisualizer` class from `yellowbrick` and try $k \in \{2, 3, ..., 9\}$. Since we already know that the correct number of clusters is 5 in our case, we should observe that the elbow is at $k=5$.



In [None]:
visualizer = KElbowVisualizer(model, k=(2, 10), timings=False)
visualizer.fit(X)
visualizer.ax.grid(ls='--')
visualizer.finalize()

plt.savefig("output/kmeans_elbow.svg", bbox_inches='tight', pad_inches=0)

### Silhouette Analysis

Silhouette analysis is another approach that can be used to select a good value of $k$. It also provides a way to visualize some key cluster properties for each $k$. The approach is based on the Silhouette coefficient, which is defined as follows [[k_research]](#k_research):
$$
s(x_i) = \frac{
    b_i - a_i
}{
    \max{ a_i, b_i }
},
$$

where $a_i$ is the **intra-cluster dissimilarity** , i.e. the average distance of sample $x_i$ from all the other samples in the same cluster; and $b_i$ is the **inter-cluster dissimilarity** , i.e. the shortest distance to a sample from a different cluster. The smaller the intra-cluster dissimilarity $a_i$, the more sample $x_i$ should belong in the cluster. The greater the inter-cluster dissimilarity $b_i$, the less the sample $x_i$ should belong to any other cluster [[k_research]](#k_research).

The **Silhouette score**  is the mean of the Silhouette coefficients across all samples $x_i \in X$:
$$
S = \frac{\sum_{x_i \in X} s(x_i)}{|X|},
$$
where $X$ is the dataset and $|X|$ is the number of samples in it (its cardinality).

The greater the Silhouette score, the greater on average is the degree to which samples should belong in their clusters and not in any other clusters. We can therefore plot the Silhouette scores for multiple values of $k$ and pick the $k$ with the maximum score. We will again use `KElbowVisualizer` to create the plot, but this time we specify `silhouette` as the metric. Clearly the maximum value is at $k=5$.



In [None]:
visualizer = KElbowVisualizer(model, k=(3, 8),
                              metric='silhouette',
                              timings=False)
visualizer.fit(X)
visualizer.ax.grid(ls='--')
visualizer.finalize()

The great thing about Silhouette analysis though, is that it also provides an easy way to visualize key properties of all the individual clusters. What we do is compute the Silhouette coefficients for all the points, group them by clusters, order them by magnitude and display them.

The resulting plots will show how large each cluster is and how certain it is that each of its individual points should belong into the cluster and not into other clusters: points with lower Silhouette coefficients are likely to be on the borders of clusters and points with very low coefficients might be clustered incorrectly.

In the plots below we show Silhouette plots and scatter plots side by side to make comparisons easier, but note that in practice, when working with high-dimensional datasets, you would, of course, only have the Silhouette plots at your disposal.



In [None]:
k_range = range(3, 8)
fig, axes = plt.subplots(len(k_range), 2)

for k, ax in zip(k_range, axes):
    model = KMeans(n_clusters=k)
    
    visualizer = SilhouetteVisualizer(
        model, colors=cluster_colors,
        alpha=1.0, ax=ax[0])
    visualizer.fit(X)
    visualizer.ax.grid(ls='--')
    visualizer.finalize()
    
    clusts = model.predict(X)
    plot_data(X, labels=clusts, ax=ax[1])

fig.set_size_inches([10, len(k_range)*3])
plt.tight_layout()

Note the very low Silhouette coefficients of some points indicating that they are probably not clustered correctly. With $k=5$ no points have very low coefficients, which indicates that this is a much better clustering. Note also how we can visually compare cluster sizes using these plots.

### Interpreting Clusters

In data analysis finding clusters is usually the easy part of the job: interpreting them is more challenging. However, the set of tools required to interpret cluster is fundamentally as same as that used for exploratory data analysis. We can add the cluster identifiers as a new column into our dataset and then proceed to examine interactions with other columns, plot the ones that show correlation in more detail etc.

#### Violin Plots

In our case, violin plots might be useful: they would tell us to what ranges of $x$ and $y$ each cluster corresponds roughly. To make comparisons easier, we also include a corresponding scatter plot.



In [None]:
model = KMeans(n_clusters=5)
model.fit(X)
clusts = model.predict(X)

# add the cluster identifiers as a new column to the dataset
df['cluster'] = clusts

In [None]:
df

In [None]:
g = ColGrid(df, ['cluster'], ['x', 'y'])
fig, axes = g.map_dataframe(sorted_order(sns.violinplot),
                  palette=cluster_colors)
fig.set_size_inches((10, 4))

for ax in axes:
    ax.grid(ls='--')
    ax.set_axisbelow(True)

In [None]:
plot_data(X_unnorm, labels=clusts)

#### A Decision Tree

We could also fit a small decision tree to the data and see what rules it comes up with.



In [None]:
dtree = DecisionTreeClassifier()
dtree.fit(X_unnorm, clusts)
show_tree(dtree, numeric_inputs, filled=False)

To make it easier to check that the decision tree got the cluster boundaries right, we can, in our case, also plot its decision boundaries w.r.t. $x$ and $y$.



In [None]:
viz = DecisionViz(
    dtree, features=numeric_inputs,
    classes=['$c_{}$'.format(i) for i in range(np.max(clusts)+1)]
)

viz.fit(X_unnorm, clusts)
viz.draw(X_unnorm, clusts)
viz.show()

### References

<a id="k_research">[k_research]</a> Yuan, C. and Yang, H., 2019. Research on K-value selection method of K-means clustering algorithm. J—Multidisciplinary Scientific Journal, 2(2), pp.226-235.

<a id="ch_index">[ch_index]</a> Wang, X. and Xu, Y., 2019, July. An improved index for clustering validation based on Silhouette index and Calinski-Harabasz index. In IOP Conference Series: Materials Science and Engineering (Vol. 569, No. 5, p. 052024). IOP Publishing.

<a id="yellowbrick">[yellowbrick]</a> Yellowbrick. URL: <https://github.com/DistrictDataLabs/yellowbrick>.

