# 🧩 Problem Statement

## What is the Problem?
Imagine you are the manager of a big shopping mall. You have data about hundreds of customers—their age, how much money they make, and how much they spend visually. You want to group these customers into "clusters" so you can send them special offers. For example, "Big Spenders" might get VIP coupons, while "Careful Savers" might get discount codes. 

The problem is: **We don't know how many groups (clusters) there should be.** Should we have 2 groups? 3? 5? 

If we pick the wrong number, the groups won't make sense. We need a mathematical way to check if our groups are "good" or "bad."

## Real-Life Analogy
Think of a school cafeteria. You want to group students so that friends sit together.
- **Good Grouping**: All the football players at one table, all the chess club members at another. Everyone is happy and close to their friends.
- **Bad Grouping**: You mix half the football players with half the chess club. Students feel "out of place" and want to move.

**Silhouette Score** is a measure of how "happy" a data point is in its group. It checks:
1. Is it close to its own group members? (Cohesion)
2. Is it far away from other groups? (Separation)

## 🪜 Steps to Solve the Problem

1.  **Get the Data**: We will use a "Mall Customers" dataset with Age, Income, and Spending Score.
2.  **Prepare the Data (Feature Engineering)**: We will select the best numbers to inspect (Income, Spending, Age) and "scale" them so they are all on the same playing field (0 to 1).
3.  **Try Different Groups (K-Means)**: We will try grouping customers into 2, 3, 4, and 5 groups.
4.  **Check the Quality (Silhouette Score)**: For each number of groups, we will calculate the Silhouette Score to see how well-separated the groups are.
5.  **Visualize**: We will draw charts to see the "shadow" (silhouette) of each cluster.
6.  **Decide**: We will pick the best number of groups and explain why.

## 🎯 Expected Output

1.  **A Clean Dataset**: Prepared numbers ready for the computer.
2.  **Silhouette Plots**: Beautiful diagrams showing the shape of our clusters.
3.  **A Decision**: A final recommendation (e.g., "We should use 5 groups").
4.  **Comparison Table**: Showing Inertia (another metric) vs. Silhouette Score.

### 🔹 Import Libraries
#### 2.1 What the line does
Imports necessary tools: `pandas` for tables, `numpy` for math, `matplotlib` for plotting, and `sklearn` for machine learning.
#### 2.2 Why it is used
We need these specific tools to handle data and run algorithms. Python doesn't have K-Means built-in.
#### 2.3 When to use it
At the very beginning of your notebook.
#### 2.4 Where to use it
Every data science project.
#### 2.5 How to use it
`import pandas as pd`
#### 2.6 How it works internally
Loads the code from these libraries into memory so we can use their functions.
#### 2.7 Output
No visible output, but the tools are now ready.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, silhouette_samples
from sklearn.preprocessing import MinMaxScaler
import os

### 🔹 Environment Setup
#### 2.1 What the line does
Creates a folder named `outputs` if it doesn't already exist.
#### 2.2 Why it is used
To keep our project organized. We want to save our plots and tables in one place.
#### 2.3 When to use it
Before generating any files.
#### 2.4 Where to use it
Any script that produces artifacts.
#### 2.5 How to use it
`os.makedirs(path, exist_ok=True)`
#### 2.6 How it works internally
Asks the Operating System to check the file system and create a directory entry.
#### 2.7 Output
A new folder appears in your project directory.

In [None]:
output_dir = r"outputs"
os.makedirs(output_dir, exist_ok=True)

### 🔹 Data Loading / Generation Function
#### 2.1 What the line does
Defines a function `load_or_create_data` that makes a fake Mall Customers dataset.
#### 2.2 Why it is used
To ensure we have data to work with, even if we can't download the file. It also makes the lesson reproducible.
#### 2.3 When to use it
When creating tutorials or testing code without dependency on external files.
#### 2.4 Where to use it
Teaching environments, unit tests.
#### 2.5 How to use it
`df = load_or_create_data()`
#### 2.6 How it works internally
Uses `numpy` random number generators to create arrays for Age, Income, and Score, then builds a DataFrame.
#### 2.7 Output
A Pandas DataFrame containing customer data.

In [None]:
def load_or_create_data():
    """
    Creates a simulated Mall Customer dataset for teaching purposes.
    """
    np.random.seed(42)
    n_samples = 200
    
    customer_ids = np.arange(1, n_samples + 1)
    age = np.random.randint(18, 70, n_samples)
    
    # Generate income with clusters
    income = np.concatenate([
        np.random.normal(30, 10, 60),
        np.random.normal(70, 15, 80),
        np.random.normal(110, 10, 60)
    ])
    
    # Generate spending with clusters
    spending = np.concatenate([
        np.random.normal(80, 10, 60),
        np.random.normal(50, 15, 80),
        np.random.normal(20, 10, 60)
    ])
    
    df = pd.DataFrame({
        'CustomerID': customer_ids,
        'Age': age,
        'Annual Income (k$)': np.abs(income).astype(int),
        'Spending Score (1-100)': np.clip(spending, 1, 100).astype(int)
    })
    
    return df

### 🔹 Use the Function
#### 2.1 What the line does
Calls the function we just created to get the data.
#### 2.2 Why it is used
To actually execute the data generation logic.
#### 2.3 When to use it
After defining the function.
#### 2.4 Where to use it
Main execution flow.
#### 2.5 How to use it
`df.head()` inspects the first 5 rows.
#### 2.6 How it works internally
Allocates memory for the dataframe.
#### 2.7 Output
A table showing the first few customers.

In [None]:
print("LOADING DATA...")
df = load_or_create_data()
print("Data Loaded Successfully!")
df.head()

### 🔹 Feature Engineering
#### 2.1 What the line does
Creates a new feature `Spending_to_Income_Ratio`.
#### 2.2 Why it is used
Sometimes the **relationship** between numbers is more important than the numbers themselves. A person earning 100k and spending 50 is different from a person earning 50k and spending 50.
#### 2.3 When to use it
To capture complex behaviors.
#### 2.4 Where to use it
Financial analysis, fraud detection.
#### 2.5 How to use it
`df['new'] = df['a'] / df['b']`
#### 2.6 How it works internally
Performs element-wise division.
#### 2.7 Output
A new column added to the dataframe.

In [None]:
df['Spending_to_Income_Ratio'] = df['Spending Score (1-100)'] / (df['Annual Income (k$)'] + 1)

features = ['Age', 'Annual Income (k$)', 'Spending Score (1-100)']
X = df[features]

### 🔹 Scaling Features
#### 2.1 What the line does
Scales all features to be between 0 and 1.
#### 2.2 Why it is used
K-Means calculates distances. 'Income' (range 0-140) implies larger distances than 'Age' (range 18-70). Scaling makes them equally important.
#### 2.3 When to use it
ALWAYS with K-Means, SVM, KNN.
#### 2.4 Where to use it
Preprocessing pipelines.
#### 2.5 How to use it
`MinMaxScaler().fit_transform(X)`
#### 2.6 How it works internally
Formula: `(value - min) / (max - min)`
#### 2.7 Output
A numpy array of scaled values.

In [None]:
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

### 🔹 K-Means Loop Initialization
#### 2.1 What the line does
Prepares lists to store our results (scores and inertia) and defines the range of K (2, 3, 4, 5) we want to test.
#### 2.2 Why it is used
We don't know the best K yet, so we have to try multiple possibilities.
#### 2.3 When to use it
Hyperparameter tuning.
#### 2.4 Where to use it
Model selection phase.
#### 2.5 How to use it
Create empty lists `[]`.
#### 2.6 How it works internally
Allocates memory for lists.
#### 2.7 Output
Empty lists ready to be filled.

In [None]:
K_values = [2, 3, 4, 5]
silhouette_scores = []
inertia_values = []

### 🔹 The Main Loop: K-Means & Silhouette
This loop does the heavy lifting:
1.  **Train K-Means** for each K.
2.  **Calculate Silhouette Score**: How well separated are the clusters?
3.  **Calculate Inertia**: How tight are the clusters?
4.  **Plot Silhouette Diagram**: A detailed view of each cluster's quality.

**Note on Silhouette Score arguments:**
- **X**: The data itself.
- **labels**: The group assigned by K-Means.
- **Returns**: A number between -1 (Wrong group) and 1 (Perfect group).

In [None]:
for k in K_values:
    # Initialize and Fit K-Means
    kmeans = KMeans(n_clusters=k, random_state=42, n_init='auto')
    kmeans.fit(X_scaled)
    cluster_labels = kmeans.labels_
    
    # Calculate Metrics
    score = silhouette_score(X_scaled, cluster_labels)
    silhouette_scores.append(score)
    inertia_values.append(kmeans.inertia_)
    
    # PLOTTING SILHOUETTE DIAGRAM
    fig, ax1 = plt.subplots(1, 1)
    fig.set_size_inches(8, 6)
    ax1.set_xlim([-0.1, 1])
    ax1.set_ylim([0, len(X_scaled) + (k + 1) * 10])
    
    sample_silhouette_values = silhouette_samples(X_scaled, cluster_labels)
    y_lower = 10
    
    for i in range(k):
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
        ith_cluster_silhouette_values.sort()
        
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i
        
        color = plt.cm.nipy_spectral(float(i) / k)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        y_lower = y_upper + 10
        
    ax1.set_title(f"Silhouette Plot for K={k}")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")
    ax1.axvline(x=score, color="red", linestyle="--")
    ax1.set_yticks([])
    plt.show()

### 🔹 Comparison Table
#### 2.1 What the line does
Puts our findings into a neat table.
#### 2.2 Why it is used
Numbers are easier to read in rows and columns than in a list.
#### 2.3 When to use it
Reporting final results.
#### 2.4 Where to use it
Dashboards, reports.
#### 2.5 How to use it
`pd.DataFrame({'col': list})`
#### 2.6 How it works internally
Constructs a structured object from dictionaries.
#### 2.7 Output
A table showing K, Inertia, and Score.

In [None]:
comparison_df = pd.DataFrame({
    'K': K_values,
    'Inertia': inertia_values,
    'Silhouette Score': silhouette_scores
})

comparison_df

### 🔹 Visual Comparison (Inertia vs Score)
#### 2.1 What the line does
 plots two different metrics on the same chart using two Y-axes.
#### 2.2 Why it is used
Inertia scales differently from Silhouette Score. One is 0-1, the other is 0-Thousands. A dual-axis plot helps check them together.
#### 2.3 When to use it
Comparing metrics with different units.
#### 2.4 Where to use it
Advanced plotting.
#### 2.5 How to use it
`ax2 = ax1.twinx()`
#### 2.6 How it works internally
Creates a transparent overlay plot that shares the X-axis.
#### 2.7 Output
A chart with lines for both metrics.

In [None]:
fig, ax1 = plt.subplots(figsize=(10, 6))

color = 'tab:red'
ax1.set_xlabel('Number of Clusters (K)')
ax1.set_ylabel('Inertia (Lower is Better)', color=color)
ax1.plot(K_values, inertia_values, color=color, marker='o')
ax1.tick_params(axis='y', labelcolor=color)

ax2 = ax1.twinx()
color = 'tab:blue'
ax2.set_ylabel('Silhouette Score (Higher is Better)', color=color)
ax2.plot(K_values, silhouette_scores, color=color, marker='o', linestyle='--')
ax2.tick_params(axis='y', labelcolor=color)

plt.title('Inertia vs Silhouette Score for Different K')
plt.show()

### 🔹 Final Interpretation
#### 2.1 What the line does
Automatically finds the best K and prints a decision.
#### 2.2 Why it is used
To automate the decision-making process.
#### 2.3 When to use it
At the end of analysis.
#### 2.4 Where to use it
Recommendation engines.
#### 2.5 How to use it
`np.argmax(scores)` finds the index of the highest score.
#### 2.6 How it works internally
Scans the array for the max value and returns its position.
#### 2.7 Output
A text explaining the result.

In [None]:
best_k_idx = np.argmax(silhouette_scores)
best_k = K_values[best_k_idx]
best_score = silhouette_scores[best_k_idx]

print(f"Best K is {best_k} with Silhouette Score of {best_score:.3f}")