# Session 04 - Clustering

Clustering (or cluster analysis) is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

Clustering is an unsuperwised learning technique.

There are dozens of clustering algorithms, but they can be broadly classified into categories based on how they create clusters:

1. Connectivity-based clustering (hierarchical clustering)
1. Centroid-based clustering
1. Distribution-based clustering
1. Density-based clustering

In this notebook, we implement the following algorithms:

1. [K-Means Clustering](#k-means)
1. DBSCAN (**D**ensity-**B**ased **S**patial **C**lustering of **A**pplications with **N**oise)
1. Agglomerative Heirarchical Clustering
1. Mean-Shift Clustering

In [1]:
import numpy as np
import pandas as pd

from tqdm import tqdm_notebook as tqdm

In [2]:
ENABLE_ASSERTIONS = True

## <a name="k-means"></a>K-Means Clustering



K-Means is arguably one of the most popular (and simple) clustering algorithms. It creates `k` clusters in the following manner:

1. Randomly initialize k data points as _centroids_
1. Classify each data point as belonging to the nearest centroid
1. Recalculate _centroid_ for each cluster as the mean of all vectors belonging to that cluster
1. Repeat these steps for either `n` iterations, or until there is no change in centroids

In [3]:
def k_means(data, k, similarity_func, n_iter=10, show_progress=False):
    centroids = initialize_centroids(data, k)
    cluster_ids = classify(data, centroids)

    for _ in tqdm(range(n_iter)):
        centroids = calculate_centroids(data, cluster_ids)
        cluster_ids = classify(data, centroids)
        
    return cluster_ids, centroids

For initializing the random centroids, we sample k points from each column independently

In [4]:
def initialize_centroids(data, k, random_state=np.random):
    return pd.DataFrame.from_dict({
        col: random_state.choice(data[col].unique(), k).tolist() 
        for col in data.columns
    })

We calculate the Euclidean distance of the entire dataset from each centroid, and then assign cluster ids based on the nearest centroid

In [5]:
def classify(data, centroids):
    centroid_distances = np.array([
        np.sum(np.power(af - centroid, 2), axis=1) 
        for _, centroid in centroids.iterrows()
    ])
    
    k, n = len(centroids), data.shape[0]

    # centroid_distances[i, j] is the distance to the ith centroid for the jth data point
    if ENABLE_ASSERTIONS: assert centroid_distances.shape == (k, n)

    cluster_ids = centroid_distances.argmin(axis=0)
    
    # Each data point is assigned to a cluster
    if ENABLE_ASSERTIONS: assert cluster_ids.shape == (n, )

    return cluster_ids

We determine new centroids by calculating the means of each cluster

In [6]:
def calculate_centroids(data, cluster_ids):
    return data.groupby(cluster_ids).mean()

TODO: Showcase sample image compression using K-Means

# Summary

sklearn has a fantastic summary image for different clustering algorithms:

![Clustering Algorithms Comparison](https://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_001.png)