# K-Means Clustering – Step-by-Step (Unsupervised)
This notebook introduces **K-Means**, an **unsupervised** algorithm that groups data into **K clusters**.

You will:
1) Create unlabeled data
2) Scale features
3) Fit K-Means
4) Visualize clusters + centroids
5) Choose K with the **Elbow method**
6) Try clustering new points


In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

## 1) Create unlabeled data (no y labels)

In [None]:
X = np.array([
    [1, 0],
    [2, 0],
    [2, 1],
    [3, 1],
    [3, 2],
    [6, 3],
    [7, 3],
    [8, 4],
    [8, 3],
    [7, 4],
])

print('X shape:', X.shape)
X

## 2) Scale features (recommended for distance-based clustering)

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print('First row before:', X[0])
print('First row after :', X_scaled[0])

## 3) Choose K and fit K-Means

In [None]:
k = 2
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_scaled)

labels = kmeans.labels_
centers = kmeans.cluster_centers_

print('Cluster labels:', labels)
print('Centers (scaled):\n', centers)

## 4) Visualize clusters + centroids (scaled space)

In [None]:
plt.figure()
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels)
plt.scatter(centers[:, 0], centers[:, 1], marker='X', s=250)
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.title(f'K-Means Clustering (K={k})')
plt.show()

## 5) Choose the best K (Elbow method)
K-Means has an objective called **inertia** (sum of squared distances to centroids). Lower is better.
The elbow point is often a good K.

In [None]:
inertias = []
Ks = range(1, 9)

for k in Ks:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

plt.figure()
plt.plot(list(Ks), inertias, marker='o')
plt.xlabel('K (number of clusters)')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

## 6) Fit again with your chosen K (example: K=3)

In [None]:
k = 3
kmeans3 = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans3.fit(X_scaled)

labels3 = kmeans3.labels_
centers3 = kmeans3.cluster_centers_

plt.figure()
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels3)
plt.scatter(centers3[:, 0], centers3[:, 1], marker='X', s=250)
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.title('K-Means Clustering (K=3)')
plt.show()

## 7) Predict the cluster for new points

In [None]:
new_points = np.array([
    [2.5, 1.0],
    [7.5, 3.5],
])

new_points_scaled = scaler.transform(new_points)
pred_clusters = kmeans3.predict(new_points_scaled)

for p, c in zip(new_points, pred_clusters):
    print(f'Point {p} -> cluster {c}')

## 8) Notes / Limitations
- K-Means prefers roughly **round** clusters.
- Sensitive to **outliers**.
- K must be chosen (Elbow method helps, but it’s not perfect).
