# Class 4: Cross-Validation and Mini-Project

**Week 7: Unsupervised Learning and Advanced Data Analysis**

**Objective**: Learn cross-validation for model evaluation and apply unsupervised learning in a clustering mini-project.

**Agenda**:
- Understand cross-validation and its role in unsupervised learning.
- Combine k-means, PCA, and feature selection.
- Mini-Project: Cluster the mall customer dataset and interpret results.

Let’s synthesize our skills and uncover customer segments!

## 1. Cross-Validation in Unsupervised Learning

**Why Cross-Validation?**
- Ensures models are robust and not overly sensitive to data splits.
- In unsupervised learning, we evaluate metrics like clustering quality rather than prediction accuracy.

**K-Fold Cross-Validation**:
- Split data into *k* folds, train on *k-1* folds, evaluate on the held-out fold.
- Repeat *k* times, averaging results.

**Evaluation Metric**:
- **Silhouette Score**: Measures how similar points are within their cluster vs. other clusters (ranges from -1 to 1).
  - Higher score = better-defined clusters.
  - Useful for choosing the number of clusters (*k*) in k-means.

**Application**:
- Validate k-means clusters on customer data.
- Compare different *k* values or preprocessing steps.

## 2. Synthesizing Unsupervised Learning

**Recap**:
- **Class 1**: K-means clustering to group data.
- **Class 2**: PCA to reduce dimensions and visualize.
- **Class 3**: Feature selection and data exploration to clean and prepare data.

**Today’s Goal**:
- Combine these techniques in a mini-project.
- Cluster the mall customer dataset, visualize with PCA, and evaluate with silhouette score.

**Workflow**:
1. Load and preprocess data (use feature selection).
2. Apply k-means clustering.
3. Reduce dimensions with PCA for visualization.
4. Evaluate clusters using silhouette score.
5. Interpret results (e.g., what do clusters represent?).

## 3. Demo: Cross-Validation with Silhouette Score

We’ll demonstrate silhouette score to evaluate k-means clusters on the mall customer dataset.

**Setup**: Ensure libraries are installed:
```bash
pip install numpy pandas scikit-learn matplotlib seaborn
```

**Dataset**: Use `Mall_Customers.csv` (download from [Kaggle](https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python) and place in your working directory).

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler

# Load and preprocess data
data = pd.read_csv('Mall_Customers.csv')
data = data.drop(columns=['CustomerID'], errors='ignore')
data = data.rename(columns={'Annual Income (k$)': 'Income', 'Spending Score (1-100)': 'Spending'})

# Select numeric features (excluding Gender for simplicity)
X = data[['Age', 'Income', 'Spending']]

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Evaluate silhouette score for different k
sil_scores = []
K = range(2, 8)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    sil_scores.append(score)
    print(f'Silhouette Score for k={k}: {score:.3f}')

# Plot silhouette scores
plt.plot(K, sil_scores, 'bo-')
plt.title('Silhouette Score vs. Number of Clusters')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.show()

**Discussion**:
- Which *k* gives the highest silhouette score?
- How does this compare to the elbow method (Class 1)?
- Why might silhouette score be useful for our mini-project?

## 4. Mini-Project: Cluster the Mall Customer Dataset

Your task is to cluster the mall customer dataset, visualize the results, and interpret the clusters.

**Steps**:
1. Preprocess: Select features (use Class 3 insights) and standardize.
2. Cluster: Apply k-means with a chosen *k* (use silhouette score or elbow method).
3. Visualize: Use PCA to reduce to 2D and plot clusters.
4. Evaluate: Compute silhouette score.
5. Interpret: Describe what each cluster represents (e.g., customer types).

**Instructions**:
- Follow the code below, filling in the blanks.
- Choose *k* based on silhouette score or experimentation.
- Work in groups or individually, with instructor support.
- Save your results for the homework submission.

In [None]:
# Step 1: Load and preprocess
data_mp = pd.read_csv('Mall_Customers.csv')
data_mp = data_mp.drop(columns=['CustomerID'], errors='ignore')
data_mp = data_mp.rename(columns={'Annual Income (k$)': 'Income', 'Spending Score (1-100)': 'Spending'})

# Select features (based on Class 3)
X_mp = data_mp[['Age', 'Income', 'Spending']]  # Adjust if you dropped features

# Standardize
scaler_mp = StandardScaler()
X_scaled_mp = scaler_mp.fit_transform(X_mp)

In [None]:
# Step 2: Apply k-means
# Choose k (e.g., from silhouette score above)
k_chosen = 5  # Replace with your choice
kmeans_mp = KMeans(n_clusters=k_chosen, random_state=42)
labels_mp = kmeans_mp.fit_predict(X_scaled_mp)

# Compute silhouette score
sil_score_mp = silhouette_score(X_scaled_mp, labels_mp)
print(f'Silhouette Score for k={k_chosen}: {sil_score_mp:.3f}')

In [None]:
# Step 3: Visualize with PCA
from sklearn.decomposition import PCA

pca_mp = PCA(n_components=2)
X_pca_mp = pca_mp.fit_transform(X_scaled_mp)

# Create DataFrame
pca_df_mp = pd.DataFrame(X_pca_mp, columns=['PC1', 'PC2'])
pca_df_mp['Cluster'] = labels_mp

# Plot
plt.figure(figsize=(8, 6))
sns.scatterplot(x='PC1', y='PC2', hue='Cluster', palette='Set1', data=pca_df_mp, s=100, alpha=0.7)
plt.title(f'Customer Clusters (k={k_chosen}) in PCA Space')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

# Explained variance
print('Explained Variance Ratio:', pca_mp.explained_variance_ratio_)

In [None]:
# Step 4: Interpret clusters
# Add cluster labels to original data
data_mp['Cluster'] = labels_mp

# Group by cluster to see feature means
cluster_summary = data_mp.groupby('Cluster')[['Age', 'Income', 'Spending']].mean()
print('Cluster Characteristics:\n', cluster_summary)

# Visualize feature distributions per cluster
plt.figure(figsize=(12, 4))
for i, col in enumerate(['Age', 'Income', 'Spending'], 1):
    plt.subplot(1, 3, i)
    sns.boxplot(x='Cluster', y=col, data=data_mp, palette='Set1')
    plt.title(f'{col} by Cluster')
plt.tight_layout()
plt.show()

**Your Interpretation**:
- What does each cluster represent? (e.g., "young high-spenders", "older low-income")
- Is the silhouette score high enough to trust the clusters?
- How does PCA visualization help understand the results?

## 5. Wrap-Up

**Key Takeaways**:
- Cross-validation (via silhouette score) evaluates clustering quality.
- Combining feature selection, k-means, and PCA creates meaningful insights.
- Interpreting clusters turns data into stories (e.g., customer segments).

**Discussion Questions**:
- What customer types did you find?
- How did PCA and feature selection help?
- What would you try differently (e.g., different *k*, features)?

**Homework**:
- Finalize the mini-project:
  - Submit your notebook with code, visualizations, and a short write-up.
  - Describe each cluster (1–2 sentences each) and why they make sense.
  - Suggest one business application (e.g., targeted marketing).
- Due date: [Insert your deadline].

Amazing work this week! You’ve mastered unsupervised learning!