# Clustering

Data arising in nature often forms into *clusters*. In a low-dimensional data space, humans can often see the clusters with their naked eyes; see below. In a high-dimensional data space, however, visualization is hard in general. In this case, we need to use *clustering algorithms* to find the clusters.

The essential idea in a clustering algorithm is to take a set of observations

$\{x_1,\dots,x_n\}$

and assign an integer in 

$\{1, \dots, k\}$ 

to each one, i.e. a cluster label, where observations that are close to one another (usually in the Euclidean metric) are assigned the same cluster label. The number $k$ of clusters is sometimes given as input to the algoirthm.

We will compare diﬀerent clustering algorithms on the same input set.
Concretely, we want to compare

- K-Means
- Mean shift
- DBSCAN
- Birch

We will use again the data from the Z → e¯e decay of Exercise 1.2. The library sklearn
implements all these algorithms (and many more). The documentation (as well as a nice
comparison and an explanation of the various algorithms) can be found at https://scikit-
learn.org/stable/modules/clustering.html.

## The Data

First we plot data points in the $(\Delta \eta, \Delta \phi)$-plane. Remember that $\eta_{1,2}$ are the pseudo-rapidities and $\phi_{1,2}$ are the azimuthal angles. What do you observe? What is the physical reason for the distribution. Would it look diﬀerent at a diﬀerent collider?

## K-means Clustering

Now we will perform K means clustering. This is one of the most basic types of clustering algorithms. The idea is to partition the observations into subsets

$$S = \{S_1,\dots,S_k\}$$

such that the within cluster sum-of-squares 

$$W(S) = \sum_{i=1}^k \sum_{x \in S_i} ||x - \mu_i||^2$$

is minimized, where the mean (a.k.a centroid or cluster center) is 

$$\mu_i = \frac{1}{|S_i|} \sum_{x \in S_i} x$$

The K-means cluster algorithm simply tries to find a partition $S$ that minimizes $W(S)$,

$$\argmin_S W(S)$$


We wish to perform K-means clustering on the 

$$Z \to e^+ e^-$$ 

data from CERN.

What do you observe? What is the physical meaning of the cluster center?

## Mean Shift Clustering

Now we will perform Mean shift clustering. In mean shift, is a density based algorithm. The idea is to find the modes of the data distribution, which will serve as cluster centers. Unlike K-means, the number of clusters is not given as input and is determined by the algorithm. You can read more about it [here](https://scikit-learn.org/stable/modules/clustering.html#mean-shift).

Where are the cluster centers? What does that mean
physically?

# DBSCAN Clustering

Now let's try DBSCAN clustering. Like Mean-shift, it is also a density based clustering algorithm and the number of clusters is not given as input. You can read more about it [here](https://scikit-learn.org/stable/modules/clustering.html#dbscan).

DBSCAN categorizes points into three categories: core, border, and noise. Core points are points that have at least min_samples points within a distance of $\epsilon$. Border points are points that have fewer than min_samples within a distance of $\epsilon$, but are in the neighborhood of a core point. Noise points are all other points.

Note: as $\epsilon$ increases, we see that there are fewer noise points, as we would expect.

## Birch Clustering

BIRCH is another clustering algorithm, that stands for "balanced iterative reducing and clustering using hierarchies". It works by first clustering the data into subclusters, and then clustering the subclusters into larger clusters.

 You can read more about it [here](https://scikit-learn.org/stable/modules/clustering.html#birch). The Wikipedia entry on [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) is also a good resource, especially for visualizing the hierarchy. For instance, given data points 

![image.png](attachment:image.png)

the hierarchies are encoded in the so-called Dendrogram

![image-3.png](attachment:image-3.png)

With a little bit of staring, it is easy to see how the algorithm starts with subclusters but then clumps them into larger clusters.

