# K-Means Clustering Algorithm

K-Means is an unsupervised machine learning algorithm used for clustering data into groups or clusters based on similarity. It aims to partition the data points into K clusters, where each point belongs to the cluster with the nearest mean. The algorithm iteratively refines cluster assignments until convergence.

## Mathematical Formulation

Let's define the key components of the K-Means algorithm:

- $K$: The number of clusters to create.
- $X$: The dataset with $N$ data points, where each data point is $x_i$ in $d$-dimensional space, $x_i \in \mathbb{R}^d$.
- $C$: The set of $K$ cluster centroids, denoted as $C = \{c_1, c_2, \ldots, c_K\}$, where $c_k$ is the centroid of cluster $k$.
- $R$: The cluster assignment vector, where $r_i$ represents the cluster assignment for data point $x_i$, such that $r_i \in \{1, 2, \ldots, K\}$.
- $J$: The objective function, which is the sum of squared distances from each point to its assigned centroid:

  $$J = \sum_{i=1}^{N} \sum_{k=1}^{K} \left\| x_i - c_k \right\|^2 \cdot \mathbb{I}(r_i = k)$$

  Where $\mathbb{I}(r_i = k)$ is an indicator function that equals 1 if $r_i = k$ and 0 otherwise.

## K-Means Algorithm Steps

The K-Means algorithm follows these iterative steps:

1. **Initialization**: Initialize the cluster centroids $C$ either randomly or using the K-Means++ method.

2. **Assignment Step**: Assign each data point $x_i$ to the nearest cluster centroid $c_k$ by computing the Euclidean distance:

   $$r_i = \arg\min_k \left\| x_i - c_k \right\|^2$$

3. **Update Step**: Recalculate the cluster centroids $c_k$ by taking the mean of all data points assigned to cluster $k$:

   $$c_k = \frac{1}{|S_k|} \sum_{i \in S_k} x_i$$

   Where $S_k$ is the set of data points assigned to cluster $k$.

4. **Convergence Check**: Repeat the assignment and update steps until convergence. Convergence can be determined by checking if the cluster assignments $r_i$ no longer change or if the objective function $J$ becomes stable.

## K-Means++ Initialization

K-Means++ is an improved initialization method that aims to distribute the initial cluster centroids more effectively. The steps for K-Means++ initialization are as follows:

1. Select the first centroid $c_1$ uniformly at random from the dataset $X$.

2. For each subsequent centroid $c_k$, choose the next data point $x_i$ as the centroid with a probability proportional to the squared distance from $x_i$ to the nearest existing centroid $c_j$. In other words, select $x_i$ with probability $\frac{d(x_i, c_j)^2}{\sum_{j} d(x_i, c_j)^2}$, where $d(x_i, c_j)$ is the Euclidean distance.

3. Repeat step 2 until $K$ centroids are chosen.

K-Means++ initialization often leads to faster convergence and better clustering results compared to purely random initialization.

K-Means is widely used for clustering tasks and has applications in various fields, including image compression, customer segmentation, and data analysis.
