# Customer Segmentation with Clustering

Segment customers into meaningful groups using unsupervised clustering.

**Dataset:** [https://www.kaggle.com/datasets/yasserh/customer-segmentation-dataset/data](https://www.kaggle.com/datasets/yasserh/customer-segmentation-dataset/data)  
**Type:** Unsupervised Clustering

> **TODO:** Download the dataset, place it in `../../data/raw/`, then update `DATA_PATH` below.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.impute import SimpleImputer
sns.set_theme(style='whitegrid')

## 1. Load Data

In [None]:
DATA_PATH = "../../data/raw/customers.csv"

df = pd.read_csv(DATA_PATH)
print(f'Shape: {df.shape}')
df.head()

## 2. EDA

In [None]:
print(df.info())
print('\nNull counts:')
print(df.isnull().sum())
df.describe().T

In [None]:
# Pairplot of numeric features (sample if large)
num_df = df.select_dtypes(include='number')
sample = num_df.sample(min(500, len(num_df)), random_state=42)
sns.pairplot(sample, diag_kind='kde', plot_kws={'alpha': 0.3, 's': 10})
plt.suptitle('Feature Pairplot', y=1.01)
plt.tight_layout(); plt.show()

## 3. Feature Selection & Scaling

In [None]:
# TODO: Select relevant features for clustering
# Drop ID / date columns if present
feature_cols = df.select_dtypes(include='number').columns.tolist()
# feature_cols = ['col1', 'col2', ...]  # or specify manually

X = df[feature_cols].copy()
X = SimpleImputer(strategy='median').fit_transform(X)
X_scaled = StandardScaler().fit_transform(X)
print(f'Feature matrix shape: {X_scaled.shape}')

## 4. Determine Optimal K

In [None]:
inertias, silhouettes = [], []
K_range = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    inertias.append(km.inertia_)
    silhouettes.append(silhouette_score(X_scaled, labels))

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(list(K_range), inertias, 'bo-')
axes[0].set_title('Elbow Method'); axes[0].set_xlabel('k')
axes[1].plot(list(K_range), silhouettes, 'ro-')
axes[1].set_title('Silhouette Score'); axes[1].set_xlabel('k')
plt.tight_layout(); plt.show()

best_k = list(K_range)[silhouettes.index(max(silhouettes))]
print(f'Best k by silhouette: {best_k}')

## 5. Final Clustering

In [None]:
K = best_k  # TODO: override if domain knowledge suggests otherwise

km_final = KMeans(n_clusters=K, random_state=42, n_init=10)
df['cluster'] = km_final.fit_predict(X_scaled)

sil = silhouette_score(X_scaled, df['cluster'])
db = davies_bouldin_score(X_scaled, df['cluster'])
print(f'Silhouette Score: {sil:.4f}')
print(f'Davies-Bouldin Score: {db:.4f}  (lower is better)')
print(df['cluster'].value_counts())

## 6. PCA Visualisation

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
explained = pca.explained_variance_ratio_.sum()

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1],
                     c=df['cluster'], cmap='tab10', alpha=0.6, s=15)
plt.colorbar(scatter, label='Cluster')
plt.title(f'PCA Projection â€” {K} Clusters (var explained: {explained:.1%})')
plt.xlabel('PC1'); plt.ylabel('PC2')
plt.tight_layout(); plt.show()

## 7. Segment Profiles

In [None]:
cluster_profile = df.groupby('cluster')[feature_cols].mean().T
cluster_profile.columns = [f'Cluster {c}' for c in cluster_profile.columns]
print(cluster_profile.round(2))

# Heatmap
plt.figure(figsize=(10, max(4, len(feature_cols) * 0.4)))
sns.heatmap(cluster_profile, cmap='RdYlGn', annot=True, fmt='.2f', linewidths=0.5)
plt.title('Cluster Profiles (mean feature values)')
plt.tight_layout(); plt.show()

## 8. Conclusion

| Cluster | Size | Interpretation |
|---|---|---|
| *(fill after running)* | | |

**Observations:**
- 

**Next steps:**
- Try DBSCAN for density-based clustering
- Add categorical features via Gower distance
- Use clusters for downstream classification/regression tasks