# Clustering and Classification
## Side-by-Side Comparison with Fake Customer Data

This notebook demonstrates:
1. **Clustering (Unsupervised)**: Find natural groups without labels
2. **Classification (Supervised)**: Predict labels using a labeled dataset

## Setup

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Set seed for reproducibility
np.random.seed(42)

## Create Fake Customer Data

100 customers with two features:
- **Age**: 20-60 years
- **Annual Spending**: \$1,000 - \$100,000

There are naturally 3 groups (but we'll pretend not to know that in clustering)

In [None]:
# Create 3 natural customer groups

# Group 1: Young, low spenders (students, budget-conscious)
group1_age = np.random.normal(25, 3, 30)
group1_spending = np.random.normal(5000, 1000, 30)

# Group 2: Middle-aged, medium spenders (working professionals)
group2_age = np.random.normal(40, 5, 40)
group2_spending = np.random.normal(30000, 5000, 40)

# Group 3: Older, high spenders (wealthy, established)
group3_age = np.random.normal(55, 4, 30)
group3_spending = np.random.normal(60000, 8000, 30)

# Combine all groups
age = np.concatenate([group1_age, group2_age, group3_age])
spending = np.concatenate([group1_spending, group2_spending, group3_spending])
spending = np.maximum(spending, 1000)  # No negative spending

# Create DataFrame
df = pd.DataFrame({
    'age': age,
    'spending': spending
})

# For classification later: create TRUE labels (high/low spender)
# We'll hide these during clustering, then use them for classification
df['high_spender'] = (df['spending'] > df['spending'].median()).astype(int)

print(f"Dataset: {len(df)} customers")
print(f"Age range: {df['age'].min():.1f} - {df['age'].max():.1f}")
print(f"Spending range: ${df['spending'].min():,.0f} - ${df['spending'].max():,.0f}")
print(f"\nFirst 5 customers:")
print(df.head())

## Visualize Raw Data

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(df['age'], df['spending'], alpha=0.6, s=100)
plt.xlabel('Age', fontsize=12)
plt.ylabel('Annual Spending ($)', fontsize=12)
plt.title('100 Customers (unlabeled)', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Question 1: Are there natural groups here?")
print("Question 2: What are they? (we'll find out with clustering)")

---
# PART 1: CLUSTERING (Unsupervised)
## Find natural groups WITHOUT labels

## Step 1: Elbow Method - Find the Right k

In [None]:
# Try k = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
k_values = range(1, 11)
wcss = []  # Within-Cluster Sum of Squares

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(df[['age', 'spending']])
    wcss.append(kmeans.inertia_)  # inertia_ is the WCSS

# Print results
print("K-Means Results:")
print("-" * 40)
for k, w in zip(k_values, wcss):
    print(f"k={k:2d}: WCSS = {w:12,.0f}")

## Plot the Elbow

In [None]:
plt.figure(figsize=(10, 6))
plt.plot(k_values, wcss, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (k)', fontsize=12)
plt.ylabel('Within-Cluster Sum of Squares (WCSS)', fontsize=12)
plt.title('Elbow Method: Finding the Right k', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)

# Mark the elbow (visual inspection suggests k=3)
plt.axvline(x=3, color='red', linestyle='--', linewidth=2, label='Elbow at k=3')
plt.legend(fontsize=11)

plt.tight_layout()
plt.show()

print("Notice the sharp drop from k=1 to k=3, then it flattens out.")
print("The 'elbow' is at k=3 → this is our optimal number of clusters.")

## Step 2: Fit K-Means with k=3

In [None]:
# Train k-means with k=3
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
df['cluster'] = kmeans.fit_predict(df[['age', 'spending']])

# Get cluster centers
centers = kmeans.cluster_centers_

print("Cluster Centers:")
print("-" * 60)
for i, center in enumerate(centers):
    print(f"Cluster {i}: Age={center[0]:.1f}, Spending=${center[1]:,.0f}")

print("\nCluster Assignments:")
print(df[['age', 'spending', 'cluster']].head(15))

## Visualize Clusters

In [None]:
plt.figure(figsize=(10, 6))

# Plot each cluster in a different color
colors = ['red', 'green', 'blue']
for i in range(3):
    cluster_points = df[df['cluster'] == i]
    plt.scatter(cluster_points['age'], cluster_points['spending'],
               c=colors[i], label=f'Cluster {i}', alpha=0.6, s=100)

# Plot cluster centers
plt.scatter(centers[:, 0], centers[:, 1],
           c='black', marker='X', s=400, edgecolors='white', linewidth=2,
           label='Centroids')

plt.xlabel('Age', fontsize=12)
plt.ylabel('Annual Spending ($)', fontsize=12)
plt.title('K-Means Clustering (k=3): Found 3 Natural Customer Groups', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n✓ Clustering discovered 3 natural customer segments!")
print("  Cluster 0: Young, low spenders")
print("  Cluster 1: Middle-aged, medium spenders")
print("  Cluster 2: Older, high spenders")

---
# PART 2: CLASSIFICATION (Supervised)
## Predict labels using a labeled dataset

Now imagine someone labeled some customers as "high spender" or "low spender" (above/below median spending). We'll use this labeled data to train a classifier.

In [None]:
# Split into train/test
X = df[['age', 'spending']]
y = df['high_spender']  # 1 = high spender, 0 = low spender

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train decision tree classifier
clf = DecisionTreeClassifier(max_depth=3, random_state=42)
clf.fit(X_train, y_train)

# Evaluate
train_acc = accuracy_score(y_train, clf.predict(X_train))
test_acc = accuracy_score(y_test, clf.predict(X_test))

print("Classification Results:")
print("-" * 40)
print(f"Training accuracy: {train_acc:.1%}")
print(f"Testing accuracy: {test_acc:.1%}")
print(f"\n✓ The model learned to predict 'high spender' vs 'low spender'")

## Make Predictions

In [None]:
# Show some test predictions
sample_predictions = X_test.head(8).copy()
sample_predictions['predicted'] = clf.predict(X_test.head(8))
sample_predictions['actual'] = y_test.head(8).values
sample_predictions['predicted'] = sample_predictions['predicted'].map({1: 'High', 0: 'Low'})
sample_predictions['actual'] = sample_predictions['actual'].map({1: 'High', 0: 'Low'})

print(sample_predictions)
print("\n✓ The model predicted whether each customer is a high or low spender")

## Visualize Classification Decision Boundary

In [None]:
# Create a mesh to visualize the decision boundary
age_min, age_max = df['age'].min() - 5, df['age'].max() + 5
spending_min, spending_max = df['spending'].min() - 5000, df['spending'].max() + 5000

xx, yy = np.meshgrid(np.linspace(age_min, age_max, 100),
                      np.linspace(spending_min, spending_max, 100))

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 6))

# Plot decision boundary
plt.contourf(xx, yy, Z, alpha=0.3, levels=[0, 0.5, 1], colors=['lightblue', 'lightcoral'])
plt.contour(xx, yy, Z, colors='black', linewidths=2, levels=[0.5])

# Plot actual data points
low_spenders = df[df['high_spender'] == 0]
high_spenders = df[df['high_spender'] == 1]

plt.scatter(low_spenders['age'], low_spenders['spending'],
           c='blue', label='Low Spender', alpha=0.6, s=100)
plt.scatter(high_spenders['age'], high_spenders['spending'],
           c='red', label='High Spender', alpha=0.6, s=100)

plt.xlabel('Age', fontsize=12)
plt.ylabel('Annual Spending ($)', fontsize=12)
plt.title('Decision Tree Classification: Predicting High vs Low Spenders', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("The shaded regions show what the model predicts:")
print("  - Blue region: Low Spender")
print("  - Red region: High Spender")
print("  - Black line: Decision boundary")

---
# COMPARISON: Clustering vs Classification

| Aspect | Clustering | Classification |
|--------|---|---|
| **Type** | Unsupervised | Supervised |
| **Input** | Unlabeled data | Labeled data |
| **Goal** | Find natural groups | Predict labels for new data |
| **Output** | Cluster assignments | Label predictions |
| **Example** | "These 100 customers form 3 groups" | "This new customer is a high spender" |
| **Question** | "What groups exist?" | "Which group does this belong to?" |

**Key Insight**: Clustering discovers structure. Classification uses known structure to make predictions.

## One More Thing: Clustering → Classification Workflow

In practice, you often:

1. **Cluster** the data to find natural groups
2. **Inspect** the clusters (a human looks at them)
3. **Label** a few examples from each cluster
4. **Train** a classifier on those labeled examples
5. **Predict** for all remaining unlabeled data

This is called **weak supervision** and is how many real-world systems work when you have tons of unlabeled data but limited resources to label it.