<a href="https://colab.research.google.com/github/martatolos/eae-dsaa-2025/blob/main/unsupervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unsupervised Algorithms

> Goal of the session:
>
> - At the end of this activity, you will understand the basics of unsupervised algorithms and how to apply them. Also see their limitations and how some approaches try to overcome them.
>
> Scope of the session
>
> - Understand the motivation and use cases of unsupervised algorithms.
> - Prepare a dataset for unsupervised learning.
> - Use KMeans clustering to group data points.
> - Analyze how KMeans clustering works.
> - Go through the limitations of KMeans clustering.
> - See metrics and methodologies to try to evaluate whether clustering is effective.
> - Use methods such as Spectral Clustering cluster data points.

## 1. Introduction

Unsupervised algorithms are those whose training data consists of a set of input variables $X$ without a target variable $Y$.

There are two categories:

* **Clustering**: discover groups with similar features within a dataset.

* **Dimensionality reduction**: reduce a dataset with a high number of dimensions to two or three ones. This reduction will allow the visualization as well as a better knowledge about your data.

### Applications

* **Customer segmentation**: the market is divided into smaller segments of buyers who have different needs, characteristics and behaviors to apply different strategies. **Note:** You could apply a customer segmentation using the CRM exercise based on $age$, $GDP$, $gender$, $ first$ $purchase$, $visits$, .etc.

![](https://i.pinimg.com/originals/d7/2f/7b/d72f7bde33d814881a5d058212228514.png)

* **Fraud detection**: identify which transactions can be considered false pretenses. It is about finding anomalous behaviours which are not related to the general behaviour of the rest of the population.

Kaggle example:
http://archive.ics.uci.edu/ml/datasets/statlog+(australian+credit+approval)

* **Face detection using PCA**: principal component analysis is used to reduce the number of variables. The data is compressed in such a way that the main characteristics are preserved. In the case of an image where a face appears, we know that not all the pixels represent the main features of the face. Using PCA, we extract the main ones which define a face and reduce dimensions.  

PCA example from scratch to detect faces:

https://medium.com/@reubenrochesingh/building-face-detector-using-principal-component-analysis-pca-from-scratch-in-python-1e57369b8fc5

## 2. Setup

### Dependencies

- ``numpy`` 2.0.2
- ``nbformat``
- ``pandas`` 2.2.2
- ``plotly`` 5.24.1
- ``scikit-learn`` 1.6.1

In [None]:
%pip install numpy==2.0.2 nbformat pandas==2.2.2 plotly==5.24.1 scikit-learn==1.6.1

### Imports

In [None]:
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.datasets import make_blobs, make_circles
from sklearn.metrics import pairwise_distances_argmin, silhouette_score

### Data Generation

*K-means* technique is an unsupervised algorithm from *Clustering* category.

Its purpose is to partition a set of $n$ observations into $k$ groups where each observation belongs to the group whose **mean value is the closest**.

We create a two dimensional example using one of the $sklearn$ library component called $make\_blobs$ to generate clusters.

In [None]:
n_clusters = 3
X, y = make_blobs(n_samples=300, centers=n_clusters, cluster_std=0.50, random_state=0)

In [None]:
fig = px.scatter(x=X[:, 0], y=X[:, 1], size_max=15)
fig.show()

In [None]:
help(make_blobs)

We can see that the number of groups are 3. The ${K-means}$ algorithm should detect automatically to which group each data point has to be assigned:

## 3. K-means

### Model training

In [None]:
kmeans = KMeans(n_clusters=n_clusters)  # Number of groups are pre-defined
kmeans.fit(X)  # Training

### Model inference

In [None]:
y_predicted = kmeans.predict(X)  # Prediction
y_predicted

#### What is happening

Each of the 300 points has been associated with one of the three previously set groups. We could apply the same to new data points to predict the group to which they belong.

### Visualize how k-means works

#### Figure out the center positions

In [None]:
centers = kmeans.cluster_centers_

print(centers)

We represent the points with the colour associated to their specific group. We also plot the centroids groups which are defined as the minimum mean distance of each set.

In [None]:
# Create a figure
fig = go.Figure()

# Add data points colored by their cluster labels
fig.add_trace(
    go.Scatter(
        x=X[:, 0],
        y=X[:, 1],
        mode="markers",
        marker={"color": y_predicted, "colorscale": "viridis", "size": 10},
        name="Data Points",
    )
)

# Add cluster centers
fig.add_trace(
    go.Scatter(
        x=centers[:, 0],
        y=centers[:, 1],
        mode="markers",
        marker={"color": "red", "size": 12, "symbol": "x"},
        name="Cluster Centers",
    )
)

# Update layout
fig.update_layout(title="K-means Clustering Results", xaxis_title="Feature 1", yaxis_title="Feature 2")

fig.show()

$K-means$ algorithm assigns the points to the clusters as you would have done.

The entire point is knowing how it works. The good news is this methodology is very simple and we could implement it by ourselves. This method is based on the **Expectation-Maximization algorithm** and the approach consists on:

1. Initial estimation of the centroids.
2. *Expectation step*: Assign the points to the closest cluster.
3. *Maximization step*: Set the centroids based on the new computed mean.
4. Go back if the centroids have changed. Otherwise, stop.

When the centroids are not changing, the algorithm has converged.

See the following code, the first function implements the same process we run with ``scikit-learn`` and the second allows to deep dive into the algorithm to see how it works though visualizations:

In [None]:
def find_clusters(
    X: np.ndarray[np.float64], centers: np.ndarray[np.float64]
) -> tuple[np.ndarray[np.float64], np.ndarray[np.float64], int]:
    """Find the clusters within a dataset.

    :param iterable X: Dataset with samples to be clustered.
    :param iterable centers: Initial centers of the clusters.
    :return: Tuple with the centers and labels for each iteration and the number of iterations.
    """
    # Initial parameters
    iters = 0
    n_clusters = len(centers)  # Number of clusters
    centers_iters = []  # Save centers for each iteration
    labels_iters = []  # Save assignments for each iteration

    while True:
        # Assign the points to the closest group
        labels = pairwise_distances_argmin(X, centers)

        # Save results
        centers_iters.append(centers)
        labels_iters.append(labels)

        # Reallocate the centroids
        new_centers = np.array([X[labels == i].mean(0) for i in range(n_clusters)])

        # Check convergence
        # In this case we're forcing the function to reach the same center as the cluster
        if np.all(centers == new_centers):
            break

        centers = new_centers
        iters += 1

    # The output lists are converted to numpy arrays.
    return np.array(centers_iters, dtype=np.float64), np.array(labels_iters, dtype=np.float64), iters

In [None]:
def visualize_kmeans_process(centers_iters: np.ndarray, labels_iters: np.ndarray, n_clusters: int, iters: int) -> None:
    """
    Visualize the k-means process for each iteration using Plotly.

    :param centers_iters: Centers for each iteration
    :param labels_iters: Assignments for each iteration
    :param n_clusters: Number of clusters
    :param iters: Number of iterations until convergence.
    """
    n_plots = iters + 1
    n_cols = min(3, n_plots)
    n_rows = int(np.ceil(n_plots / n_cols))

    # Determine figure width and height based on number of columns and rows
    width = n_cols * 600
    height = n_rows * 400

    # Create subplots with computed number of rows and columns
    fig = make_subplots(rows=n_rows, cols=n_cols, subplot_titles=[f"Iteration {i}" for i in range(n_plots)])

    for i in range(n_plots):
        row = i // n_cols + 1
        col = i % n_cols + 1

        fig.add_trace(
            go.Scatter(
                x=X[:, 0],
                y=X[:, 1],
                mode="markers",
                marker={"color": labels_iters[i], "colorscale": "viridis", "size": 8},
                showlegend=False,
            ),
            row=row,
            col=col,
        )

        for cluster in range(n_clusters):
            fig.add_trace(
                go.Scatter(
                    x=[centers_iters[i][cluster][0]],
                    y=[centers_iters[i][cluster][1]],
                    mode="markers",
                    marker={"color": "red", "size": 10, "symbol": "x"},
                    showlegend=False,
                ),
                row=row,
                col=col,
            )

    fig.update_layout(title="K-means Clustering Process", width=width, height=height)
    fig.show()

In [None]:
centers = np.array([[1, 1], [2, 3], [2, 1]])
# centers = np.array([[1, 1], [1, 3], [2, 1]])

centers_iters, labels_iters, iters = find_clusters(X, centers)

n_clusters = len(centers)
visualize_kmeans_process(centers_iters, labels_iters, n_clusters, iters)

In [None]:
help(kmeans)

The first graph shows an initial cluster assignment that is not the desired one because of the random centroids used.

However, the centroids are getting closer to their corresponding groups until the solution coverges. **It happens when the distance of the points to the closest centroid does not produce new assigments**

### Does the result depend on the initial centroids?

In [None]:
centers = np.array([[2, 0], [3, 1], [2, 1]])

centers_iters, labels_iters, iters = find_clusters(X, centers)

n_clusters = len(centers)
visualize_kmeans_process(centers_iters, labels_iters, n_clusters, iters)

#### Using only two groups

In [None]:
centers = np.array([[1, 1], [2, 3]])

centers_iters, labels_iters, iters = find_clusters(X, centers)

n_clusters = len(centers)
visualize_kmeans_process(centers_iters, labels_iters, n_clusters, iters)

### Limitations of K-means

#### Number of clusters

One of the most important limitations is that $K-means$ needs the number of groups as an argument. How are we going to know a priori the number of groups if we want to use this method to figure it out?

What happen if we had chosen a different number of clusters?

In [None]:
kmeans = KMeans(n_clusters=5)  # Set number of clusters
kmeans.fit(X)  # Training
y_kmeans = kmeans.predict(X)  # Prediction
centers = kmeans.cluster_centers_

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=X[:, 0],
        y=X[:, 1],
        mode="markers",
        marker={"color": y_kmeans, "colorscale": "Viridis", "size": 10},
        name="Data Points",
    )
)
fig.add_trace(
    go.Scatter(
        x=centers[:, 0],
        y=centers[:, 1],
        mode="markers",
        marker={"color": "red", "symbol": "x", "size": 12},
        name="Cluster Centers",
    )
)
fig.show()

To solve this problem, we can execute multiple $K-means$ with different number of groups and choose the one which meets a certain criteria. There are several criteria that allow us to measure "how well" the clusters have achieved. The two most famous criteria are the $Elbow$  and the $Silhouette$ methods. For more information, see the following link:

https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb

We use $Silhoutte$ for two reasons:

* Already implemented in $sklearn$.
* The optimum number of groups is automatically extracted.

**This criteria is based on the idea that a point will belong to a group if it is very close to its centroid and very far from the another ones.**

See the following code:

In [None]:
scores = []
groups = np.arange(2, 11)  # 2, 3, 4, ..., 8, 9, 10

for k in groups:
    kmeans = KMeans(n_clusters=k, n_init=10).fit(X)
    labels = kmeans.labels_
    scores.append(silhouette_score(X, labels, metric="euclidean"))

# Create a figure using Plotly
fig = go.Figure()

# Add a line trace for silhouette scores
fig.add_trace(go.Scatter(x=groups, y=scores, mode="lines+markers"))
fig.update_layout(
    title="Silhouette Scores for Different Numbers of Clusters",
    xaxis_title="Number of Clusters",
    yaxis_title="Silhouette Score",
    xaxis={"tickmode": "linear"},
)

fig.show()

[Check Silhouette analysis](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html) from scikit-learn

#### Lineal Separation

The fundamental model assumptions of k-means (points will be closer to their own cluster center than to others) means that the algorithm will often be ineffective if the clusters have complicated geometries.

In particular, the boundaries between k-means clusters will always be linear, which means that it will fail for more complicated boundaries. Consider the following data, along with the cluster labels found by the typical k-means approach:

In [None]:
X, y = make_circles(n_samples=400, factor=0.3, noise=0.05)
fig = px.scatter(
    x=X[:, 0],
    y=X[:, 1],
    color=y.astype(str),
    title="Circles Dataset - Instance Classes",
    labels={"x": "Feature 1", "y": "Feature 2", "color": "Class"},
)
fig.update_traces(marker={"size": 10})
fig.show()

### Do the groups above have a linear separation?

In [None]:
kmeans = KMeans(n_clusters=2).fit(X)
y_kmeans = kmeans.predict(X)

In [None]:
centers = kmeans.cluster_centers_

fig = go.Figure()

# Add data points colored by their cluster labels
fig.add_trace(
    go.Scatter(
        x=X[:, 0],
        y=X[:, 1],
        mode="markers",
        marker={"color": y_kmeans, "colorscale": "viridis", "size": 10},
        name="Data Points",
    )
)

# Add cluster centers
fig.add_trace(
    go.Scatter(
        x=centers[:, 0],
        y=centers[:, 1],
        mode="markers",
        marker={"color": "red", "size": 12, "symbol": "x"},
        name="Cluster Centers",
    )
)

# Update layout
fig.update_layout(title="K-means Clustering Results", xaxis_title="Feature 1", yaxis_title="Feature 2")

fig.show()

## 4. Spectral Clustering

In order to solve this, we can use a kernel transformation to project the data into a higher dimension where a linear separation is possible. We might imagine using the same trick to allow k-means to discover non-linear boundaries.

One version of this kernelized k-means is implemented in Scikit-Learn within the $SpectralClustering$ estimator. It uses the graph of nearest neighbors to compute a higher-dimensional representation of the data, and then assigns labels using a k-means algorithm:

In [None]:
model = SpectralClustering(n_clusters=2, affinity="nearest_neighbors", assign_labels="kmeans")
labels = model.fit_predict(X)

# Create a figure using Plotly
fig = go.Figure()

# Add data points colored by their cluster labels
fig.add_trace(
    go.Scatter(
        x=X[:, 0],
        y=X[:, 1],
        mode="markers",
        marker={"color": labels, "colorscale": "viridis", "size": 10},
        name="Data Points",
    )
)

# Update layout
fig.update_layout(title="Spectral Clustering Results", xaxis_title="Feature 1", yaxis_title="Feature 2")

fig.show()

**In real cases**, it's hard to check if your clusters have a linear separation because of the number of dimensions. The approach will be to try different models and see how it works based on your requirements.

## Categorical Variables

The standard k-means algorithm isn't directly applicable to categorical data, for various reasons. The sample space for categorical data is discrete, and doesn't have a natural origin. A Euclidean distance function on such a space isn't really meaningful.

There's a variation of k-means known as k-modes, introduced in this paper by Zhexue Huang, which is suitable for categorical data:

http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf

An example in python:

https://towardsdatascience.com/the-k-prototype-as-clustering-algorithm-for-mixed-data-type-categorical-and-numerical-fe7c50538ebb