# Foundations of Data Science (GDW) 2023



# Exercise VII: Hierarchical Clustering

This week's exercise will be all about clustering (again).

## Part 1: Agglomerative Hierarchical Clustering

Our plan now is to test different linkage strategies for agglomerative hierarchical clustering.

We start by generating artificial data the way we did last week:

In [None]:
import numpy as np
from sklearn import cluster, datasets
import matplotlib.pyplot as plt

n_samples = 1000

# blobs with varied variances
X, _ = datasets.make_blobs(
    n_samples=n_samples, cluster_std=[1.0, 2.5, 0.5], random_state=170
)
plt.scatter(X[:,0], X[:,1])

### Task 1.1
Try different linkage strategies on the data above. You can run the algorithm by using the sklearn function `cluster.AgglomerativeClustering`.

When calling the function, the parameter `linkage=` sets the type of measure you want to use. The options are `'single'`, `'complete'`, `'average'` and `'ward'`. 

Additionally, you have the option to set either `distance_threshold` (the maximal distance clusters will be merged) or `n_clusters` (number of clusters the algorithm has to find). In case you do, the other one has to be set to `None`.

Then, note your observations.

In [None]:
# write your code here

*Write your answer here*

## Part 2: Dendrograms

Hierarchical Clusters can be visualized with Dendrograms. These can be visual aids to help chosing where to cut and thereby the numbers of clusters.

We begin by creating a custom plotting function.

In [None]:
def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack(
        [model.children_, model.distances_, counts]
    ).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

With this, we can plot the dendrogram of the cluster model. To increase readibility of the plot, we can set `truncate_mode="level` and control the number of levels with `p=`.

If you have not installed `scipy` yet, you need to do so with the package manager of your choice.

In [None]:
from scipy.cluster.hierarchy import dendrogram
plot_dendrogram(cluster_model, truncate_mode="level", p=3)

### Task 2.1
How many clusters did the algorithm find?

*write your answer here*

## Part 3: HDBSCAN
There is a hierarchical version of *DBSCAN* that we will take a look at now.

But first, let us consider the following scenario on artificially generated data.

In [None]:
plot_kwds = {'alpha' : 0.5, 's' : 80, 'linewidths':0}

moons, _ = datasets.make_moons(n_samples=50, noise=0.05)
blobs, _ = datasets.make_blobs(n_samples=50, centers=[(-0.75,2.25), (1.0, 2.0)], cluster_std=0.3)
test_data = np.vstack([moons, blobs])
plt.scatter(test_data.T[0], test_data.T[1], color='b', **plot_kwds)

### Task 3.1
Run the DBSCAN algorithm on the dataset `test_data`. Again, tune the hyperparameters `epsilon` and `min_samples` yourself.

In [None]:
# write your code here

Now, run the HDBSCAN algorithm on the same data. You can call the function with `cluster.HDBSCAN`. Can you see the advantage?

In [None]:
# write your code here

*Note your finding here*