# Cluster Analysis

## Content

1. **[An introduction to cluster analysis](#Cluster-Analysis)**  
    1.1 Types of clusters  
    1.2 Distance measures  
    1.3 

Differently from supervised learning algorithms, unsupervised learning consists into looking for undetected patterns in a dataset, with no pre-existing labels attached to each entry. Its goal is to infer properties of the probability density governing the population from which the available observations come, without the help of a supervisor/teacher providing a degree of error for each observation.

**Cluster analysis**, or _data segmentation_, is a type of unsupervised learning technique, that aims at grouping collection of objects into **clusters**, such that elements within each cluster are more related (or similar, according to a suitable notion of similarity, which is application-dependent) to one another than objects in different clusters.  In other words, it aims at minimizing the intra-cluster distances, while maximizing inter-cluster ones.

A **clustering** is a set of clusters. We can distinguish between:
- **Partitional clustering**: each object belongs in exactly one cluster. A famous algorithm that belongs to this family is ***k-means***.
- **Hierarchical clustering**: consists in a set of *nested clusters** organized in a tree.

Partitional clustering           |  Hierarchical clustering
:-------------------------:|:-------------------------:
<img src="images/cluster_analysis/partitional.jpg" alt="Partitional clustering"/>  |  <img src="images/cluster_analysis/hierarchical.jpg" alt="Hierarchical clustering"/>

Moreover, clustering can be further distinguished into:
- **Exclusive vs Non-exclusive**: in exclusive, points can belong simultaneously to multiple clusters.
- **Fuzzy vs Non-fuzzy**: in fuzzy, points belong to clusters with a weight between 0 and 1. Weights must sum to 1.
- **Partial vs Complete**: in partial, we want only a subset of the data to be clustered.
- **Heterogeneous vs Homogeneous**: in heterogeneous, we allow cluster of different sizes, shapes and densities.

## Types of clusters

- **Well-separated**: any point in a cluster is closer (in terms of similarity measure) to *every other* points in the same cluster than to any point in other clusters.
- **Center-based**: clusters such that elements they contain are closer to its center (which is often a *centroid* or a *medoid*), than to any of the other clusters' centers.
- **Contiguity-based**: each cluster is a set of points such that every point in that cluster is closer to *one or more* other points in the same cluster than to any other point.
- **Density-based**: a cluster is a dense region of points, separated by low-density regions from other high-density ones. 
- **Conceptual clusters**: aims at finding clusters that share some common property or share a particular concept.
- **Clusters defined by an objective function**: clusters are found such that a certain objective function is optimized. All possible ways to grouping points into clusters are enumerated and the corresponding "goodness" is evaluated (this is a NP-hard problem). Typically, hierarchical clustering algorithms have local objectives while partitional clustering algorithms have globals. In order to make the problem computationally tractable, we can try to fit the data to a parameterized model.

## Distance measures

