### Clustering

https://www.analyticsvidhya.com/blog/2015/12/10-machine-learning-algorithms-explained-army-soldier/

* Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters).

* It is a main task of EDA, and a common technique for statistical data analysis, used in many fields -
    - pattern recognition
    - image analysis
    - information retrieval
    - bioinformatics
    - data compression
    - computer graphics
    - machine learning

$$D = \{x_i\} \ - \ \text{no} \ y_i$$

<img src="https://editor.analyticsvidhya.com/uploads/56854k%20means%20clustering.png">

**Credits** - Image from Internet

* The main task in clustering is to group the data values on the basis of **" similarity "**.
    - Similarity can be assumed not just in one context but many. Perhaps, it is more towards problem specific.

* Clustering is unsupervised leaning because there are no $y_i$

* Supervised is when we have a dataset which has $x_i$ and $y_i$ both.

* Semi-supervised is when a small part of dataset has both $x_i$ and $y_i$ and large part has only $x_i$. It happens when cost of accuring $y_i$ is large.

### Applications (Clustering)

* E-Commerce
    - Customer grouping based on their purchasing behaviour
        - Money
        - Credit card
        - Debit card
        - product category
        - zip code

* Image Segmentation (grouping or clustering similar pixels)
    - Computer Vision
    - Image Processing
    - Satellite Imagery Analysis

* Amazon Food Reviews
    - Sentiment Analysis
    - NLP

* ...

### Metrics

* Intra-cluster → within the cluster
    - the distance between any two points is small
* Inter-cluster → acorss or between clusters
    - the distance between any two points is large

![cluster-metrics](https://user-images.githubusercontent.com/63333753/133963519-31bef7ae-933f-4f80-b141-bd2cb2d95525.png)

> The above two are the basis on how we measure the effectiveness of a clustering analysis.

**Dunn-Index**

$$D = \frac{\text{max}_{(i, j)}d(i, j)}{max_{(k)}d^1(k)}$$

* $k \in \{C_1, C_2, C_3, \dots C_k\}$
* $d$ → distance between $C_i$ and $C_j$ (inter-cluster distance)
* numerator → maximal inter cluster distance
* $d^1$ → maximum distance between two points in a cluster (intra-cluster distance)

> If $D$ is high, it implies good clustering else, not a good metric.

### `k`-Means Clustering (Geometric Intuition)

* Popular and simple centroid clustering algorithm.

* `k` represents total number of points.
    - Hyperparameter (cross validation)

![k-means-gi](https://user-images.githubusercontent.com/63333753/133965526-257aac1d-bf88-45c1-83ef-92ea724f7761.png)

* $C_1, C_2, c_3$ → are the centroids of the respective clusters (mean of all the points in a cluster)

    - $C_i = \frac{1}{n} \sum_{x_j \in S_i} x_j$

* $S_1, S_2, S_3$ → are the clusters

    - $S_1 \cap S_2 = \phi$
    - $S_2 \cap S_3 = \phi$
    - $S_3 \cap S_1 = \phi$
    - $S_1 \cup S_2 \cup S_3 = D$

* Every data point is assigned to a cluster whose centroid is the closest.

> The core idea of `k`-Means is to find the `k` central points and assign each point to a cluster by certains conditions.

### Mathematical Formulation of `k`-Means

https://www.saedsayad.com/clustering_kmeans.htm

* Once, if we find the clusters, it becomes easy to arrange the sets based on proximity.
    * k-centroids : $C_1, C_2, \dots, C_k \implies \forall_{(i, j)} \ x_i \in S_j$
    * k-sets : $S_1, S_2, \dots S_k \implies \forall_{(i, j)} \ S_i \cap S_j = \phi$
    
    $$\text{argmin}_{\{C_1, \dots, C_k\}} = \sum_{i=1}^k \sum_{x \in S_i} ||x - C_i||^2$$

> To solve this optimization, we prefer to choose lloyd's algorithm which uses the approximations.

### `k`-Means Algorithm (Lloyd's)

https://youtu.be/5I3Ei69I40s

1. Initialization
    - randomly pick `k` points from $D$ and consider them as $C_1, \dots, C_k$

2. Assignment
    - for each point $x_j$ in $D$ select the nearest centroid $C_j$ by computing the distance and thus add $x_i$ to set $S_j$

3. Re-calculate centroid
    - update the centroid as this → $C_j = \frac{1}{|S_j|} \sum_{\{x_i \in S_j\}} x_i$

4. Repeat $2$ and $3$ until convergence
    - convergence → centroids don't change much

### Initialization Sensitivity

https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf

* `k`-Means has the problem of initialization of random `k` points. Or, one of the drawbacks of random initialization.
    - the final result is sensitive or dependent on the `k` points that are been picked.

**How to tackle this?**

* Repeat `k`-Means multiple times with different initializations.
    - pick the best clustering result based on smaller intra-cluster and larger inter-cluster distances.
    - takes more computation effort

* `k`-Means++
    - it replaces the random initializations with a smart initializations scheme
    - **procedure**
        - pick the first centroid randomly → $C_1$ for $D$
        - for each $x_i \in D$ create a distribution such as $x_i \rightarrow \text{dist}^2(x_i, \text{nearest-centroid}) \implies d_i$
        - pick a point from $\{D - C_1\}$ with a probability proportional to $d_i$ (probabilistic approach)
        - continue this until you have `k`centroids

> `k`-Means++ does get effected by outliers.

### Limitations

* `k`-Means has problems when clusters are of different
    - sizes
    - densities
    - non-globular shapes

* `k`-Means has problems when the data contains outliers.

**Solutions**

* One way is to use many clusters and (put them together - which is not easy).

### `k`-Medoids

* `k`-Mean centroids cannot be interpretted as the centroid will not be a part of the dataset.

* If we want the centroids to be interpretable, then we must prefer to have centroids to be part of the dataset.

* For such cases, we use `k`-Medoids (popular algo to interpret centroids).
    - Partitioning around medoids (PAM)
    - **Initialization** : `k`-Means++ → probabilistic approach
    - **Assignment** : closest medoid → $x_i \in S_j$ if medoid_j is the closest medoid to $x_i$
    - **Update / Recomputation** :
        - swap each medoid point with non-medoid point
        - if loss decreases, keep the swap; else undo the swap
        
        $$\text{loss} = \sum_{i=1}^k \sum_{x \in S_i} ||x - m_j||^2$$
        
        - if swap is successful (loss ↓), then redo the **Assignment** step

### Time & Space Complexity

* `k`-Means
    - **Time** → $O(nkdi)$
        - $n$ → number of points
        - $k$ → number of clusters
        - $d$ → dimensionality of the data
        - $i$ → number of iterations
        - typicall $k \leq 10$ and $i \leq 300$
    - **Space** → $O(nd + kd)$