# Cluster Analysis

## Content

1. **[An introduction to cluster analysis](#Cluster-Analysis)**  
    1.1 Types of clusters  
    1.2 Distance measures  
    1.3 
2. **[K-means](#K-means)**  
3. **[Hierarchical clustering](#Hierarchical-clustering)**  

Differently from supervised learning algorithms, unsupervised learning consists into looking for undetected patterns in a dataset, with no pre-existing labels attached to each entry. Its goal is to infer properties of the probability density governing the population from which the available observations come, without the help of a supervisor/teacher providing a degree of error for each observation.

**Cluster analysis**, or _data segmentation_, is a type of unsupervised learning technique, that aims at grouping collection of objects into **clusters**, such that elements within each cluster are more related (or similar, according to a suitable notion of similarity, which is application-dependent) to one another than objects in different clusters.  In other words, it aims at minimizing the intra-cluster distances, while maximizing inter-cluster ones.

A **clustering** is a set of clusters. We can distinguish between:
- **Partitional clustering**: each object belongs in exactly one cluster. A famous algorithm that belongs to this family is ***k-means***.
- **Hierarchical clustering**: consists in a set of *nested clusters** organized in a tree.

Partitional clustering           |  Hierarchical clustering
:-------------------------:|:-------------------------:
<img src="images/cluster_analysis/partitional.jpg" alt="Partitional clustering"/>  |  <img src="images/cluster_analysis/hierarchical.jpg" alt="Hierarchical clustering"/>

Moreover, clustering can be further distinguished into:
- **Exclusive vs Non-exclusive**: in exclusive, points can belong simultaneously to multiple clusters.
- **Fuzzy vs Non-fuzzy**: in fuzzy, points belong to clusters with a weight between 0 and 1. Weights must sum to 1.
- **Partial vs Complete**: in partial, we want only a subset of the data to be clustered.
- **Heterogeneous vs Homogeneous**: in heterogeneous, we allow cluster of different sizes, shapes and densities.

## Types of clusters

- **Well-separated**: any point in a cluster is closer (in terms of similarity measure) to *every other* points in the same cluster than to any point in other clusters.
- **Center-based**: clusters such that elements they contain are closer to its center (which is often a *centroid* or a *medoid*), than to any of the other clusters' centers.
- **Contiguity-based**: each cluster is a set of points such that every point in that cluster is closer to *one or more* other points in the same cluster than to any other point.
- **Density-based**: a cluster is a dense region of points, separated by low-density regions from other high-density ones. 
- **Conceptual clusters**: aims at finding clusters that share some common property or share a particular concept.
- **Clusters defined by an objective function**: clusters are found such that a certain objective function is optimized. All possible ways to grouping points into clusters are enumerated and the corresponding "goodness" is evaluated (this is a NP-hard problem). Typically, hierarchical clustering algorithms have local objectives while partitional clustering algorithms have globals. In order to make the problem computationally tractable, we can try to fit the data to a parameterized model.

## Distance measures



# K-means

# Hierarchical clustering

Differently from K-means, which requires to choose the number of clusters a-priori and a starting configuration, hirarchical clustering algorithms do not. Instead, we need to specify a **dissimilarity measure** between **disjoint groups of observations**. That measure is based on pairwise dissimilarities between the observations of the two groups. These algorithms, produce a hierarchy of clusters, where clusters of a certain level are created by merging clusters at the next lower level. The root of this structure is a cluster containing all the data, while the leaves are clusters of a single observation.

There are two basic paradigms:
- **Agglomerative**: it consists in **merging the pair** of clusters at a certain level **having the smallest intergroup dissimilarity**, in order to produce a single, bigger, cluster at the upper level. In agglomerative methods, the dissimilarity between merged clusters is *monotone increasing*.
- **Divisive**: it consists in **splitting** an existing cluster in order to produce two new groups having the **largest between-group dissimilarity**.

Each level of the hierarchy represents a grouping of the data into disjoint sets. We have to choose which is the level that represents a satisfying clustering.

### The dendrogram

We can graphically represent the sequence of groupings (the hierarchy) using a **dendrogram**: a tree where on the abscissa axis it represents the logical distance (according to the defined metric) of the clusters, while the height of each node is proportional to the value of intergroup dissimilarity between its two daughters. In other words, the higher the link between two clusters is, the more different their features are. The lower in the tree groups of observations fuse together, the more similar their observations are. On the other hand, observations that fuse later, near the root of the tree, can be quite different.

If we cut the dendrogram horizontally at a particular height, we partition the data into disjoint clusters represented by the vertical lines that intersect the cut. Therefore, a single dendrogram can be used to obtain any number of clusters depending on the height we cut it horizontally.

**Note:** we don't typically use the dendrogram by just looking at it to determine the "right" number of clusters.

<img src="images/cluster_analysis/dendrogram.jpg" alt="Dendrogram corresponding to a dataset of 9 elements with 2 features" width="500em"/>

Note in the image that, even if $2$ and $9$ are relatively close to each other in the dendrogram, they are quite different, since $9$ is not more similar to $2$ than it is to $5, 8$ and $7$. Therefore, we don't make conclusion about two observations' dissimilarity according to their position along the horizontal axis, but we do that along the vertical axis where the two groups containing them are fused.

## Agglomerative clustering

At the beginning every observation constitutes a singleton cluster. At each of the $N-1$, we **merge the two least dissimilar clusters**, producing one less cluster at the upper level. Three common dissimilarity measures $d(G,H)$ between two groups $G$ and $H$ are:

#### Single linkage (nearest neighbor)

Takes the least dissimilar pair distance as the intergroup dissimilarity: 

$$d_{SL}(G,H) = \min_{i\in G,\; i' \in H} d_{ii'}$$

This dissimilarity tends to produce clusters with very large diameter $D_G = \max_{i\in G,\; i'\in G}d_{ii'}$, facing the risk of violating the *compactness* property of having observations in a clusters that are all relatively similar to observations in the same cluster.

#### Complete linkage (furthest neighbor)

Takes the most dissimilar pair distance as the intergroup dissimilarity:

$$d_{CL}(G,H) = \max_{i\in G,\; i' \in H} d_{ii'}$$

This dissimilarity tends to produce clusters with low diameter, considering two clusters close only if all of their observations are relatively similar. Consequently, it can violate the *closeness* property, by assigning observations to a cluster even if they are closer to members of other clusters.


#### Group average

Takes the average dissimilarity between the two groups $G$ and $H$, containing $N_G$ and $N_H$ observations respectively:

$$d_{GA}(G,H) = \frac{1}{N_G N_H} \sum_{i \in G} \sum_{i' \in H} d_{ii'}$$

It tends to produce relatively compact clusters that are relatively far apart.


