# Agglomerative Clustering

- https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
- Take a look at https://stats.stackexchange.com/questions/195446/choosing-the-right-linkage-method-for-hierarchical-clustering

* üí° Agglomerative clustering is a **bottom-up approach** to clustering.
    * We first divide **each instance into its own cluster** and **merge them into couples** based on a similarity metric.
    * Subsequently we compute the **similarity** again and **merge the couples into bigger groups** (clusters).
    * These groups are then merged into the bigger ones **till there is only a one big group containing all the instances present at the top**.

### üöÄ **We don't have to know the number of clusters beforehand.**  
* üí° We can perform the whole clustering process and **select the appropriate number of clusters afterward** based on the obtained results. 
* We usually use the **dendrogram** for the distance threshold estimation.

## üîé You may ask how is the similarity among groups computed. 
* üí° **Similarity is pretty hard to define.**
* There are various ways how we can compute the value called **Linkage**.

## üí° Linkage variants
- **Maximum or Complete linkage**:
    - The distance between two clusters is defined as the maximum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2.
    - It tends to produce more compact clusters.
    - It is less prone to outliers than Single linkage.
    - Complete linkage methods tend to break large clusters.


- **Minimum or Single linkage**:
    - The distance between two clusters is defined as the minimum value of all pairwise distances between the elements in cluster 1 and the elements in cluster 2.
    - It tends to produce long, ‚Äúloose‚Äù clusters.
    - Single linkage method is prone to "chain" and form clusters of irregular, often thread-like curved shapes.
        - The reason for that is obvious. With this method, at any step, **two clusters are merged if their closest edges are close enough**.
        - No proximity between other parts of the two clusters is taken into consideration.


- **Mean or Average linkage**:
    - The distance between two clusters is defined as the average distance between the elements in cluster 1 and the elements in cluster 2.

![img1](https://github.com/lowoncuties/VSB-FEI-Machine-Learning-Exercises/blob/main/images/ml_03_linkages.png?raw=true)


## Metrics
![img2](https://github.com/lowoncuties/VSB-FEI-Machine-Learning-Exercises/blob/main/images/ml_03_euclid.png?raw=true)

![img3](https://github.com/lowoncuties/VSB-FEI-Machine-Learning-Exercises/blob/main/images/ml_03_manhattan.png?raw=true)



## Imports

In [None]:
import numpy as np
import pandas as pd
from scipy.spatial import distance
import csv
from matplotlib import pyplot as plt
import math
from sklearn.metrics import pairwise

## Load files

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/lowoncuties/VSB-FEI-Machine-Learning-Exercises/master/datasets/ml_03/clusters3.csv', sep=';', names=["x","y"])
df