### ðŸ§© Problem Statement
- **Problem:** How do we interpret the results of K-Means clustering for business stakeholders?
- **Why it matters:** Raw cluster centers (e.g., "Standard Deviation = 1.2") are meaningless to marketing teams. We must translate them into "Dollars" and "Scores".

### ðŸªœ Steps to Solve the Problem
1. Load Data
2. Scale Features (StandardScaler)
3. Fit K-Means ($K=5$)
4. **Inverse Transform Centroids** (The Key Step)
5. Profile and Visualize

### ðŸŽ¯ Expected Output
- A Cluster Profile Table with readable averages.
- A PCA Plot showing the logical separation of customers.

### ðŸ”¹ Imports
#### 2.1 What the line does
Imports necessary libraries for data manipulation (pandas), math (numpy), plotting (matplotlib, seaborn), and machine learning (sklearn).
#### 2.2 Why it is used
We need `pandas` for tables, `StandardScaler` for preprocessing, `KMeans` for clustering, and `PCA` for visualization.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import os

### ðŸ”¹ Load Dataset
#### 2.1 What the line does
Loads 'Mall_Customers.csv' if capable, otherwise creates a synthetic dataset for demonstration.
#### 2.2 Why it is used
To provide the input data for our analysis.


In [None]:
# Try to load from local path, otherwise create sample data
try:
    data_path = '../data/Mall_Customers.csv' 
    if os.path.exists(data_path):
        df = pd.read_csv(data_path)
        print(f"Loaded dataset from {data_path}")
    else:
        raise FileNotFoundError
except:
    print("Dataset not found. Creating sample Mall Customer data.")
    np.random.seed(42)
    n_samples = 200
    ids = np.arange(1, n_samples + 1)
    genders = np.random.choice(['Male', 'Female'], n_samples)
    ages = np.random.randint(18, 70, n_samples)
    income = np.concatenate([
        np.random.normal(25, 5, 40), np.random.normal(55, 10, 80), np.random.normal(90, 10, 80)
    ]).astype(int)
    score = np.concatenate([
        np.random.normal(80, 10, 40), np.random.normal(50, 10, 80), 
        np.random.normal(20, 10, 40), np.random.normal(85, 10, 40)
    ]).astype(int)
    
    df = pd.DataFrame({
        'CustomerID': ids,
        'Gender': genders,
        'Age': ages,
        'Annual Income (k$)': income,
        'Spending Score (1-100)': score
    })

### ðŸ”¹ Feature Scaling
#### 2.1 What the line does
Selects relevant features ($) and scales them to Mean=0, Std=1.
#### 2.2 Why it is used
K-Means uses distance. Income (range 0-140) implies larger distances than Score (0-100). Scaling ensures fair weighting.


In [None]:
features = ['Annual Income (k$)', 'Spending Score (1-100)']
X = df[features]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### ðŸ”¹ K-Means Clustering
#### 2.1 What the line does
Initializes K-Means with 5 clusters (optimal for this data) and fits it.
#### 2.6 How it works internally
It places 5 random centroids, assigns points to nearest centroid, moves centroids to mean of points, and repeats until stable.


In [None]:
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
df['Cluster'] = kmeans.fit_predict(X_scaled)

### ðŸ”¹ Inverse Transformation (VITAL STEP)
#### 2.1 What the line does
Takes the centroid coordinates (which are in Z-score format, e.g., 1.5) and transforms them back to original units (e.g., 90k).
#### 2.2 Why it is used
Business stakeholders cannot interpret Z-scores. They need to see real values to name the clusters.


In [None]:
centroids_scaled = kmeans.cluster_centers_
centroids_original = scaler.inverse_transform(centroids_scaled)

# Create Profile Table
cluster_profile = pd.DataFrame(centroids_original, columns=features)
cluster_profile['Cluster_ID'] = range(5)
cluster_profile['Count'] = df['Cluster'].value_counts().sort_index().values
cluster_profile['Percent'] = (cluster_profile['Count'] / len(df)) * 100
cluster_profile = cluster_profile[['Cluster_ID', 'Count', 'Percent'] + features].round(2)

print("--- Cluster Profile (Original Scale) ---")
print(cluster_profile)

### ðŸ”¹ PCA Visualization
#### 2.1 What the line does
Reduces the 2D data (Income, Score) to 2 Principal Components. (Here features are already 2D, but this works even for 10D data).
#### 2.2 Why it is used
To plot the clusters on a flat screen and inspect separation.


In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
df['PCA1'] = X_pca[:, 0]
df['PCA2'] = X_pca[:, 1]

plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='PCA1', y='PCA2', hue='Cluster', palette='viridis', s=100, alpha=0.8)

# Annotate Centroids
centroids_pca = pca.transform(centroids_scaled)
for i in range(5):
    plt.text(centroids_pca[i, 0], centroids_pca[i, 1]+0.2, f'Cluster {i}', 
             fontsize=12, fontweight='bold', color='black', ha='center')

plt.title('Customer Segments (PCA Projection)')
plt.legend()
plt.show()