# K-means Clustering
K-means clustering is an unsupervised learning algorithm used to partition a dataset into a predefined number of groups, or "clusters," where each data point belongs to the cluster with the nearest mean. It‚Äôs popular for grouping and understanding data structures, pattern recognition, and image compression.
## Key Concepts
- __Clusters and Centroids__: Clusters are the groups that data points are divided into, and each cluster has a central point called a centroid, which is the mean position of all points in the cluster.
- __K (Number of Clusters)__: The user defines the number of clusters, denoted by K. Choosing K can be tricky and is often done through methods like the Elbow Method or Silhouette Score.
- __Distance Measure__: K-means often uses Euclidean distance to measure the similarity between points and the centroids.
- __Iterations__: K-means iteratively refines the clusters by recalculating the centroids and reassigning points until the algorithm converges (the cluster assignments no longer change or a maximum number of iterations is reached).

## Implementation

In [2]:
import pandas as pd
data = {
    'X': [2, 3, 6, 8, 5, 9],
    'Y': [3, 3, 7, 8, 4, 7]
}
df = pd.DataFrame(data)
print(df)

   X  Y
0  2  3
1  3  3
2  6  7
3  8  8
4  5  4
5  9  7


### 1. Initialize Centroids
Select ùêæ points randomly from the dataset as the initial centroids.

__Example:__ Assume we set K=2, meaning we want to divide the data into two clusters. Randomly pick two points as the initial centroids. Let's say we start with points (2, 3) and (8, 8).

### 2. Assign Points to Nearest Centroid
For each data point, compute the distance to each centroid and assign the point to the closest one.

__Distances from Centroid 1: (2, 3)__
- Distance to (2, 3): 0 (same point)
- Distance to (3, 3): 1
- Distance to (6, 7): 5.66
- Distance to (8, 8): 7.81
- Distance to (5, 4): 3.16
- Distance to (9, 7): 8.06

__Distances from Centroid 2: (8, 8)__
- Distance to (2, 3): 7.81
- Distance to (3, 3): 7.07
- Distance to (6, 7): 2.24
- Distance to (8, 8): 0 (same point)
- Distance to (5, 4): 5.0
- Distance to (9, 7): 1.41

__Assign Points to Nearest Centroid__
Now, based on the distances, each point will be assigned to the centroid it is closest to:
- (2, 3): Closer to Centroid 1 (distance 0)
- (3, 3): Closer to Centroid 1 (distance 1)
- (5, 4): Closer to Centroid 1 (distance 3.16)
- (6, 7): Closer to Centroid 2 (distance 2.24)
- (8, 8): Closer to Centroid 2 (distance 0)
- (9, 7): Closer to Centroid 2 (distance 1.41)

Thus, the clusters based on this iteration would be:
- Cluster 1: (2,3),(3,3),(5,4)
- Cluster 2: (6,7),(8,8),(9,7)

### 3. Update Centroids
For each cluster, recalculate the centroid as the mean of all points assigned to it

Calculate the mean position of points in each cluster:
- Cluster 1 new centroid:  ($ \frac {2+3+5}{3}, \frac {3+3+4}{3}) =(3.33,3.33) $
- Cluster 1 new centroid:  ($ \frac {6+8+9}{3}, \frac {7+8+7}{3} =(7.67,7.33) $

### 4. Repeat
Repeat steps 2 and 3 until the centroids do not change (convergence) or reach a pre-set maximum number of iterations.