# 🛍️ Real-World Clustering Application: Customer Segmentation

This notebook demonstrates clustering on the **Mall Customers dataset** (available on Kaggle).

We will:
1. Load the dataset
2. Explore the data
3. Apply dimensionality reduction
4. Cluster customers using KMeans and HDBSCAN
5. Evaluate and visualize results

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from src.clustering import run_kmeans, run_hdbscan
from src.evaluation import evaluate_clustering
from src.visualization import plot_embedding, plot_clusters

# === 1. Load dataset ===
### Download from: https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python


In [None]:
df = pd.read_csv('data/Mall_Customers.csv')
df.head()

# === 2. Select features ===

In [None]:
X = df[["Age", "Annual Income (k$)", "Spending Score (1-100)"]].values
X_scaled = StandardScaler().fit_transform(X)
print('Shape:', X_scaled.shape)

# === 3. PCA for 2D embedding ===


In [None]:
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_scaled)
plot_embedding(X_pca, labels=None, title='PCA embedding of customers').show()

# === 4. Clustering ===

In [None]:
labels_kmeans, _ = run_kmeans(X_pca, n_clusters=5)
labels_hdbscan = run_hdbscan(X_pca, min_cluster_size=20)
plot_clusters(X_pca, labels_kmeans,
              title='KMeans Clusters (PCA-reduced)').show()
plot_clusters(X_pca, labels_hdbscan,
              title='HDBSCAN Clusters (PCA-reduced)').show()

# === 5. Evaluation ===

In [None]:
print('KMeans metrics:', evaluate_clustering(X_pca, labels_kmeans))
print('HDBSCAN metrics:', evaluate_clustering(X_pca, labels_hdbscan))

# ✅ Conclusion
- KMeans segments customers into **fixed 5 groups**.
- HDBSCAN finds clusters **adaptively**, may detect outliers as noise.
- PCA helps reduce noise and makes clusters more separable.