# # Clustering - K-Means and Hierarchical

---

## Block 1: Use Case Scenarios for Clustering

### What is the practical use of clustering?

Clustering is used to automatically group data into distinct clusters without knowing how many clusters there are or what they indicate beforehand. Common real-world applications include:

**1. Customer Segmentation**
- A marketing organization might separate customers into distinct segments
- Then investigate how those segments exhibit different purchasing behaviors
- Actionable insight: tailor products/services to each segment

**2. Clustering as a Preprocessing Step for Classification**
- Start by identifying distinct groups of data points using unsupervised learning
- Assign class labels to those clusters based on domain knowledge
- Use the labeled data to train a supervised classification model
- Benefit: reduces human labeling effort and ensures meaningful labels

**3. Species/Category Discovery**
- In the seeds dataset example: three seed species (0=*Kama*, 1=*Rosa*, 2=*Canadian*) are already known
- Clustering can separate seeds by physical characteristics (area, perimeter, compactness, etc.)
- Compare the unsupervised cluster assignments to the true species labels
- Validate whether clustering algorithms can naturally discover species boundaries

---

## Block 2: Practical Implementation Strategy - Seeds Dataset Example

### Objective
Create a K-Means clustering model using the seeds dataset to:
1. Automatically group seeds into clusters based on physical features
2. Evaluate cluster quality using silhouette score
3. Compare clusters to known seed species


In [None]:
import pandas as pd
from plot_clusters import plot_clusters


In [None]:
# load the training dataset
data = pd.read_csv('../../generated/data/raw/seeds.csv')

# Display a random sample of 10 observations (just the features)
features = data[data.columns[0:6]]
features.sample(10)

from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA

# Normalize the numeric features so they're on the same scale
scaled_features = MinMaxScaler().fit_transform(features[data.columns[0:6]])

# Get two principal components
pca = PCA(n_components=2).fit(scaled_features)
features_2d = pca.transform(scaled_features)
print(features_2d[0:10])

### K-Means Clustering Algorithm Overview

**How K-Means Works:**

1. A set of K centroids are randomly chosen.
2. Clusters are formed by assigning the data points to their closest centroid.
3. The mean of each cluster is computed and the centroid is moved to the mean.
4. Steps 2 and 3 are repeated until a stopping criteria is met. Typically, the algorithm terminates when each new iteration results in negligible movement of centroids and the clusters become static.
5. When the clusters stop changing, the algorithm has *converged*, defining the locations of the clusters. Note that the random starting point for the centroids means that re-running the algorithm could result in slightly different clusters, so training usually involves multiple iterations, re-initializing the centroids each time, and the model with the best WCSS (*within cluster sum of squares*) is selected.


### Step 1: Prepare Data - Feature Extraction and Scaling

In [None]:
from sklearn.cluster import KMeans

# Create a model based on 3 centroids
# For the seeds dataset, we'll use K=3 clusters because we know there are 3 seed species
# The algorithm will discover whether the physical features naturally separate into 3 distinct groups

model = KMeans(n_clusters=3, init='k-means++', n_init=100, max_iter=1000)

# Fit to the data and predict the cluster assignments for each data point
km_clusters = model.fit_predict(features.values)

# View the cluster assignments
print(km_clusters)

### Step 2: Visualize Cluster Assignments

We'll plot the clusters using two principal components to see how well the seeds are separated.


In [None]:
plot_clusters(features_2d, km_clusters)


The data should be separated into three distinct clusters. If not, rerun the previous steps.


### Step 3: Evaluate Cluster Quality with Silhouette Score

To quantify how well the clusters are separated, we calculate a **silhouette score** - a metric with a value between -1 and +1. The closer this value is to +1, the better separated the clusters are.

**Silhouette Score Interpretation:**
- **+1**: Points are very close to their own cluster, far from others (excellent separation)
- **0**: Points sit on cluster boundaries (ambiguous)
- **-1**: Points are in the wrong cluster (poor separation)

**How it works**: For each point, compare its distance to points in its own cluster (a) vs. the nearest other cluster (b). Score = (b - a) / max(a, b)


In [None]:
from sklearn.metrics import silhouette_score

# Compute the silhouette score
silhouette_avg = silhouette_score(features, km_clusters)
print(f"Silhouette Score: {silhouette_avg:.3f}")

# Interpretation
if silhouette_avg > 0.5:
    print("✓ Strong cluster separation - clusters are well-defined")
elif silhouette_avg > 0.3:
    print("→ Reasonable cluster structure - acceptable separation")
else:
    print("✗ Weak or overlapping clusters - consider adjusting k or reviewing data")


So what's the practical use of clustering? In some cases, you'll have data that you need to group into distinct clusters without knowing how many clusters there are or what they indicate. For example, a marketing organization might want to separate customers into distinct segments, and then investigate how those segments exhibit different purchasing behaviors.

Sometimes, clustering is used as an initial step towards creating a classification model. You start by identifying distinct groups of data points, and then assign class labels to those clusters. You can then use this labelled data to train a classification model.



In the case of the seeds data, the different species of seed are already known and encoded as 0 (*Kama*), 1 (*Rosa*), or 2 (*Canadian*), so we can use these identifiers to compare the species classifications to the clusters identified by our unsupervised algorithm.

In [None]:
seed_species = data[data.columns[7]]
seed_species

In [None]:
plot_clusters(features_2d, seed_species.values)

There may be some differences between the cluster assignments and class labels, but the K-Means model should have done a reasonable job of clustering the observations so that seeds of the same species are generally in the same cluster.

## Hierarchical Clustering

Hierarchical clustering methods make fewer distributional assumptions when compared to K-Means methods. However, K-Means methods are generally more scalable, sometimes very much so.

Hierarchical clustering creates clusters by using either a *divisive* method or an *agglomerative* method. The divisive method is a "top down" approach starting with the entire dataset and then finding partitions in a stepwise manner. Agglomerative clustering is a "bottom up" approach. In this lab you will work with agglomerative clustering which works as follows:

1. The linkage distances between each of the data points are computed.
2. Points are clustered pairwise with their nearest neighbor.
3. Linkage distances between the clusters are computed.
4. Clusters are combined pairwise into larger clusters.
5. Steps 3 and 4 are repeated until all data points are in a single cluster.

The linkage function can be computed in a number of ways:
- *Ward* linkage measures the increase in variance for the clusters being linked.
- *Average* linkage uses the mean pairwise distance between the members of the two clusters.
- *Complete* or *maximal* linkage uses the maximum distance between the members of the two clusters.

Several different distance metrics are used to compute linkage functions:
- *Euclidean* or *l2* distance is the most widely used. This metric is only choice for the Ward linkage method.
- *Manhattan* or *l1* distance is robust to outliers and has other interesting properties.
- *Cosine similarity* is the dot product between the location vectors, divided by the magnitudes of the vectors. Note that this metric is a measure of similarity, whereas the other two metrics are measures of difference. Similarity can be quite useful when working with data such as images or text documents.


##### Agglomerative Clustering

Let's see an example of clustering the seeds data using an agglomerative clustering algorithm.

In [None]:
from sklearn.cluster import AgglomerativeClustering

agg_model = AgglomerativeClustering(n_clusters=3)
agg_clusters = agg_model.fit_predict(features.values)
print(agg_clusters)

In [None]:
plot_clusters(features_2d, agg_clusters, "Agglomerative Clustering Assignments")


To calculate the silhouette score for the agglomerative clustering model:

In [None]:
score = silhouette_score(features.values, agg_clusters)
print(f"Silhouette Score: {score:.3f}")

# Interpretation
if score > 0.5:
    print("✓ Strong cluster separation - clusters are well-defined")
elif score > 0.3:
    print("→ Reasonable cluster structure - acceptable separation")
else:
    print("✗ Weak or overlapping clusters - consider adjusting k or reviewing data")

## Summary

Here we practiced using K-Means and hierarchical clustering. This unsupervised learning has the ability to take unlabelled data and identify which data points are similar to others.