In [None]:
low_memory=False
# Import required libraries
# Your code here

## 8.1 Introduction & Motivation

The k-Means algorithm is excellent at detecting clusters when we know beforehand exactly how many clusters we expect to find. However, as we've seen, it becomes more challenging when the number of clusters is unknown. We could calculate evaluation metrics (like we did with the elbow method), fit multiple k-Means models with different k values, and compare them all. Alternatively, we can use **Hierarchical Clustering**, which offers a more flexible approach.

**Key Advantage:** Hierarchical clustering doesn't require us to specify the number of clusters upfront. Instead, it creates a hierarchy of clusters that we can cut at any level to get our desired number of groups.

## 8.2 Problem Setting

We're continuing to work with the unsupervised version of the digits dataset from the previous chapter. As a reminder:

* **Dataset:** Handwritten digits (0-9)
* **Features:** 64 variables representing pixel intensities in 8x8 images
* **Observations:** 1,797 digit images
* **Task:** Group similar digits together without using the true labels

This allows us to compare how hierarchical clustering performs compared to k-Means on the same problem.

In [None]:
# Load the digits dataset
# Your code here

## 8.3 Model

### 8.3.1 Model

The idea behind hierarchical clustering is remarkably simple and intuitive. Consider the following dataset:

![](https://s3.amazonaws.com/stackabuse/media/hierarchical-clustering-python-scikit-learn-1.png)

We can clearly see two clusters in this visualization, but we need an algorithm to identify them systematically. Here's how **agglomerative (bottom-up) hierarchical clustering** works:

**Step-by-Step Process:**
1. **Start:** Each observation begins as its own cluster (n clusters for n points)
2. **Measure:** Calculate the distance between all pairs of clusters
3. **Merge:** Combine the two closest clusters into one
4. **Repeat:** Continue steps 2-3 until all points are in a single cluster

**Result:** We create a hierarchy that contains solutions for every possible number of clusters (from n down to 1). This process can be visualized as follows:

![](https://s3.amazonaws.com/stackabuse/media/hierarchical-clustering-python-scikit-learn-2.png)

The power of hierarchical clustering lies in its flexibility. We can "cut" the dendrogram (tree diagram) at any height to obtain our desired number of clusters. The horizontal line we draw across the tree determines how many clusters we end up with:

* **Cut high:** Fewer, larger clusters
* **Cut low:** More, smaller clusters

This visual approach makes it easy to choose the optimal number of clusters:

![](https://s3.amazonaws.com/stackabuse/media/hierarchical-clustering-python-scikit-learn-4.png)

### 8.3.2 Model Estimation

Let's create a dendrogram to visualize the hierarchical structure of our digits dataset:

**Note:** A dendrogram is a tree diagram that shows the arrangement of clusters. The height at which branches merge indicates the distance between clusters.

In [None]:
# Create a dendrogram
# Your code here

**Interpreting the Dendrogram:**

Ideally, we draw our cutoff line where there's the largest vertical distance without any merges occurring. This represents a natural separation between clusters.

One of the key advantages of hierarchical clustering is its flexibility: if certain digits are too similar and get confused, we can easily adjust by choosing a different number of clusters to improve separation.

**For this example:** Since we know there are 10 distinct digits (0-9), we'll fit our model with 10 clusters:

In [None]:
# Fit hierarchical clustering with 10 clusters
# Your code here

## 8.4 Exercises

##### Question 1: Try to fit the elbow plot for the hierarchical clustering model on the digits dataset.

**Understanding the Metrics:**

For k-Means, we used the distortion (inertia) metric to create elbow plots. However, hierarchical clustering doesn't use centroids, so distortion isn't applicable here.

Instead, we use the **Linkage Criterion**, which represents the distance between clusters before they merge. The specific calculation depends on the linkage method:

* **Ward linkage** (most common): Minimizes the variance within clusters
* **Complete linkage**: Uses the maximum distance between cluster members
* **Average linkage**: Uses the average distance between all pairs

**Hint:** The Ward method is generally preferred because it creates compact, spherical clusters similar to k-Means.

In [None]:
# Create elbow plot using linkage criterion
# Your code here

**Analysis Questions:**

* Where does the elbow occur in your plot?
* Does this match our expectation of 10 clusters?
* Would 9 clusters also be a reasonable choice? Why or why not?
* Why does the graph "wobble" more than k-Means elbow plots?

**Your analysis here:**

##### Question 2: Try to split the dataset between test and train and check how accurate the best possible hierarchical clustering model is.

In [None]:
# Split data and evaluate hierarchical clustering
# Your code here

**Analysis Questions:**

* What accuracy did you achieve?
* How does this compare to k-Means (which achieved ~75%)?
* Which digits are most commonly confused?
* Why do you think hierarchical clustering performs the way it does on this dataset?

**Your analysis here:**

##### Question 3: Compare the hierarchical clustering and the k-means clustering algorithm for this dataset. Report which one provides the best fit. Is this what you would expect?

**Your comparative analysis here:**

Consider:
* Performance differences (accuracy)
* Advantages of each method
* When would you prefer hierarchical clustering?
* When would you prefer k-Means?
* Why might hierarchical clustering perform better or worse on this specific dataset?