# K-Means Clustering - 1

## Content

- Setting up the Context: WHy do we need Clustering?
- Intuition: What is Clustering?
- Metrics for Clustering
- K-means Clustering and its Variations
- K-Means: Mathematical Formulation: Objective Function
- Lloyd’s Algorithm
- Some Problems with Lloyd's Algorithm


***

## Setting up the Context: Why do we need Clustering?

- So far we have seen classification and regression problems.

<img src='https://drive.google.com/uc?id=1V8D6nffRjwCGGWzT62ecQkKC-2Pdyxm-'>


- For a given dataset $D = {x_i, y_i}$
  - If $y_i∈{0,1}$, then we call it a **2-class classification**
  - If $y_i∈R$, then we call it a **regression**.

- **In the case of both classification and regression**, we are trying to find a function that is used to predict ‘$y_i$’, when ‘$x_i$’ is given as the input. Both of these are **supervised learning problems**.
- **But in case of clustering**, we are given a set of points $D = {x_i}$ and there are no $y_i$’s. The task in clustering is to group similar points. This makes clustering an **unsupervised learning problem**.


### Let's understand it further with some examples

**1. We are given data of 100 million Amazon customers**
- For these customers, we are only given $x_i$'s

<img src='https://drive.google.com/uc?id=16dMAQDZLgTSBA25wWHswVceCAe7Uo-8h'>

#### **Can we group similar customers?**
- In E-commerce, companies group similar customers based on their purchasing behaviour. Here the similarity maybe based upon the type of products they purchase, or the type of the debit/credit cards they use, or their geo location, etc.
- Once these customers are grouped into different clusters, then depending on their purchasing habits, different deals, discounts are offered.

<img src='https://drive.google.com/uc?id=1PWsO4MU2riD_Q354JbB2uyeWPLg1YcsZ'>

**2. Given 1,00,000 words, can you cluster similar words?**

**3. Given 1 million newspaper articles, can you group together similar articles?**

<img src='https://drive.google.com/uc?id=1spBOwCTO5QWry9pbtOKzc42f73i8loHb'>

- Clustering is essentially an **unsupervised learning problem**, i.e., We do not have $y_i$’s.

- The dataset for unsupervised learning is represented as $D = \{x_i\}_{i=1}^n$

**There are other types of learning problems as well**

<img src='https://drive.google.com/uc?id=1eBAh1yPPc5cMZai-XEtT_BFSR9UJC-10'>


- In **semi-supervised learning**, we have a majority of the points without
labels and a very few points with the labels. This happens when the cost of labelling is expensive.
- Here for a few points, we have the values for ‘$y_i$’, and for most of the points, we do not have the values for ‘$y_i$’. This is in-between supervised and unsupervised, and hence we call it semi-supervised.

- There is another type of problems called **self-supervised learning**. It has a special application when it comes to images, where we use the image data itself to learn more properties or features about the images. We'll get to it later during the course.

***

## Intuition: What is Clustering?

- Intuitively, It is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than those in other groups.
- Each group here is called a **cluster**. The points in the same cluster
are close together. The points in different clusters are far away from each
other.

<img src='https://drive.google.com/uc?id=1B7owbWs8CJYh2wgosnnB6XfGtLzYuTlX'>

- So, the task in clustering is **grouping the similar points**.

- We perform clustering when we want to group similar items in our dataset based on **our definition of similarity**.

- Since **clustering doesn’t have the class labels or the ground truth**, it is hard to measure if a clustering is good or bad in a very rigorous and critical way.

- It all depends on the problem and the context we are working in, i.e., the **Business Case**

<img src='https://drive.google.com/uc?id=1vu6eMV4qdF0k5UdB9Jp1JaRlG88j5bJC'>

**Clustering, just like classification, is very dependent on the features used.**

- The **clustering output can sometimes help us come up with new features**.

- If we get a **new datapoint** in future, **for example, a new customer of Amazon**, we can identify which cluster or group does this new datapoint or this new customer belongs to.

<img src='https://drive.google.com/uc?id=1ZbIuq61woMWP-uVvzCoa2lrzc9nI7lFj' width="500">


***

## Metrics for Clustering

**What metric is then used to evaluate whether the clustering model is giving accurate results?**

- The resulting clusters should make **business sense**

- To evaluate the performance of a clustering algorithm, we can use the same metrics we used for supervised learning problems, like Euclidean Distance, Manhattan Distance, Cosine Similarity, etc.

<img src='https://drive.google.com/uc?id=1YJCq1u0yXn3BVBd2v7T1OlZtXmlwqJMy'>

- There is another metric called DUNN-Index which will be explained in some time.

- For clustering, the dataset is denoted as $D = \{x_i\}$.

- So far in the classification and the regression tasks, we had $y_i$’s, but here in clustering, we do not have $y_i$’s. **All the performance metrics in Classification and Regression need $y_i$’s**. So as all the performance metrics known to us so far require the ‘$y_i$’ values, those metrics do not work in clustering.

**Intra and Inter-Cluster Points**
- If a group of points lie in the same cluster, we call it **Intra-cluster**, and if a group of points are spread across different clusters, then we call it **Inter-cluster**.

- All the points from the given data set are grouped in such a way that the intra-cluster distances are small and inter-cluster distances are high.

- The core idea/basis of the measure of clustering is the **intra-cluster
distance has to be low and the inter-cluster distance has to be high**.

<img src='https://drive.google.com/uc?id=1_Ow-GdD96AyhqJn_qzEBYrnk5iej2B6Q'>

#### **Dunn Index**
- It is denoted by **‘D’** and is given as:

$D = \frac{min_{i,j} distance(i,j)}{max_k distance^{|}(k)}$

Where
- distance(i,j) → distance between the farthest points of the clusters ‘$C_i$’ and ‘$C_j$’ → **Inter-Cluster distance**

<img src='https://drive.google.com/uc?id=1YZndvns6oIZqKxnOEU4uxquvWozCs87s'>

- distance^{|}(k) → **Intra-Cluster distance**

<img src='https://drive.google.com/uc?id=1-apq41uFx5Yl5YpLq-_-Atz5moZlJ7qk'>

- If ‘$D$’ is high, it implies good clustering. For every pair of points from ‘$C_i$’ and ‘$C_j$’, we have to compute **$distance(i,j)$**.

<img src='https://drive.google.com/uc?id=1K4FX--I0QDIROe0BmUDdzImMbkjOeKLK'>

- **For ideal clusters, the value of the Dunn Index should be high**

- For this, the distances between the points in the same cluster should be as much small as possible, and the distance between the different clusters should be as much large as possible.

<img src='https://drive.google.com/uc?id=1_b1r30YLuOgF2DItLJnNKBHto0aEhqXf' width='500'>

- If we have two Clustering algorithms and we need to decided which one is performing better, we can compare the Dunn Index of the two algorithms and pick the one with higher Dunn Index.

- Dunn Index takes into consideration the **minimum Inter-Cluster distance** and the **maximum Intra-Cluster distance** - so in both cases it considers the worst-case scenario.

<img src='https://drive.google.com/uc?id=1NK1JK_tTiBBoCRRwBI4vX1Tuktqwam_J'>


***

## K-Means Clustering

- K-Means clustering is one of the popular and the simplest clustering
algorithms. The value **‘K’ in the K-means algorithm denotes the number of
clusters**.

<img src='https://drive.google.com/uc?id=1rKN23CFKliCrrpqDyc2PUP3zb74Tjktv'>

<img src='https://drive.google.com/uc?id=1wNEivXxtrHXqbm9GyqCiyFPEwBWkmO2V'>



- Let us assume we are given a 2-Dimensional dataset **‘D’** with number
of clusters (K) = 3. So the number of centroids is also equal to 3.
- Let ‘S1’, ‘S2’, ‘S3’ be different sets of elements, and ‘C1’, ‘C2’ and ‘C3’ be their respective centroids.

  $S_1 ∪ S_2 ∪ S_3 = D$

  $S_1 ∩ S_2 ∩ S_3 = Φ$

  $S_1 ∩ S_2 = Φ, S_1 ∩ S_3 = Φ, S_2 ∩ S_3 = Φ$

- Here all the data points belong to one or the other set and no data point exists in more than one set.

<img src='https://drive.google.com/uc?id=1WOKeZ64-KNsVKDWA71mJ4Gvk8WzW_QpO'>

Number of Sets = K (i.e., $S_1, S_2, S_3, ….., S_k$)

Number of Clusters = K (i.e., $C_1, C_2, C_3, ….., C_k$)

For any set ‘$S_i$’, its centroid is given as $C_i = (1/n) * Σ_{x_j∈Si} x_j$

**K-means clustering is a centroid based clustering scheme**.

- Every point is assigned to the cluster closer to it.
- The core idea of K-means clustering is to find ‘K’ centroids and each point is assigned to the cluster whose centroid is nearer to it. The biggest challenge is to find these 'K' centroids.
- There are algorithms to find out these 'K' centroids and one of the most commonly used algorithms is **Lloyd's algorithm**.

<img src='https://drive.google.com/uc?id=1wN2P20hwzhHlk02lmq-a_HTIdKV0Cz5X'>


***

### K-Means: Mathematical Formulation: Objective Function

So far we were given a dataset ‘**D**’ of ‘**n**’ data points and our job is to find the ‘**K**’ centroids.

$D = \{x_i\}_{i=1}^n$

The ‘**K**’ sets are $‘S_1’, ‘S_2’, ‘S_3’, …., ‘S_k$’

The ‘K’ centroids are $‘C_1’, ‘C_2’, ‘C_3’, …., ‘C_k’$

The **objective function for K-means algorithm** is given as:

$arg-min_{C_1,C_2,C_3,..,C_k} Σ_{i=1}^k Σ_{x∈S_i} ||x-C_i||^2$,
such that $x∈S_i$ and $S_i⋂S_j=∅$

<img src='https://drive.google.com/uc?id=1ni3v4x0PaViRZ5BSN32g05fgRI__uXhr'>

- We have to find the cluster centroids such that the points belonging to the respective clusters should be as much nearer as possible to these entroids, so that the **intra-cluster distance is minimized**.

<img src='https://drive.google.com/uc?id=1UZk5Ff3pa-Bsm4eQ_jWXVk-PnbdJIBAG'>



#### **Can we use Gradiant Descent to converge the K-Means Clustering Algorithm?**

- $S_{ij}$ has to be either 1 or 0, based on whether $x_j∈S_i$ or not.

- It cannot be a fraction, so GD does not work.

<img src='https://drive.google.com/uc?id=1twpYopy7UrQeHj43JewLfXR0H8UjxYg5'>

<img src='https://drive.google.com/uc?id=1-tTlJn_6e7mfx7tsi4ZeVW33ZL34WmzQ'>

- This optimization problem is very hard to solve from the point of computation. In such cases, we go with the approximation algorithms, and find out the approximate solution for this problem, but not the exact solution, using a few hacks. One such approximation algorithm is **Lloyd’s algorithm**. It is a very simple and a good approximation algorithm.

<img src='https://drive.google.com/uc?id=1eYt48azC1cTA5Ls4SEXfWk1WDP4hWKrt'>




***

## K-Means Algorithm (Lloyd’s Algorithm)
**1. Initialization**

From the given dataset ‘**D**’, we have to pick ‘**K**’ points randomly, and assume them to be the centroids. Let us denote them as $C_1, C_2, C_3, …, C_k$.

<img src='https://drive.google.com/uc?id=1hv9_bd-31sy9XO3lA0ri21h4Gdnf75ji'>

**2. Assignment**

For each point ‘$x_i$’ in the dataset ‘**D**’, we have to compute the distance of each of the above ‘K’ centroids from this point, and pick the nearest centroid. Let us denote this nearest centroid as ‘$C_j$’.

Add the point ‘$x_i$’ to the set ‘$S_i$’(which is associated with the centroid ‘$C_j$’).

<img src='https://drive.google.com/uc?id=1Mp6n0AjrHSZHIAMAgaBv7FAoMKpX_lML'>

**3. Recompute Centroid (Update Stage)**

Recalculate/update ‘$C_j$’ as follows:

$C_j = (1/|S_j|) * Σ_{x_i∈S_j} x_i$

<img src='https://drive.google.com/uc?id=1bix-ArmjCvesifVBHRs2HKLBZM6Pl-4s'>

**4. Repeat the steps 2 and 3 until convergence**. Here convergence is the stage where the centroids do not change much.

<img src='https://drive.google.com/uc?id=1ddspk-Zef4qZNyopCY2Aei2EAr60q_fY'>

For example, at the end of stage 2, if the centroids are $\{C_1, C_2, C_3, …., C_k\}$ and at the end of stage 3, if the centroids are $\{C_1^|, C_2^|, C_3^|, ….., C_k^|\}$, then during convergence the distance between the old and
the new centroids is very small.

(i.e., $C_1 - C_1^|, C_2 - C_2^|, C_3 - C_3^|, …, C_k - C_k^|$ has to be very small)

Now finally the centroids we get are $C_1, C_2, C_3, …., C_k$ and the final sets/clusters of points are $S_1, S_2, S_3, …., S_k$.



***

## Some Problems with Lloyd's Algorithm

- It is **Initialisation Sensitive**

  Initialization Sensitivity means the final clusters and centroids depend on how randomly we pick the points as centroids during the initialization. Differences in initialization of the centroids results in differences in clustering.

<img src='https://drive.google.com/uc?id=1Le4Y1qo31VhDx7BbkNv3IhfGtZpIjREI'>



***

## Closing Notes

- In next lecture, we'll study this problem in detail and techniques available as a work around to these limitations of Llyod's Algorithm.

<img src='https://drive.google.com/uc?id=1Pn6-nU0VGJW7zrzrUaH7kGovhBXWSQvJ'>