# 📚 Lecture 17: Clustering and Unsupervised Learning Fundamentals

## Comprehensive Hands-on Practice Notebook

**Author:** Ho-min Park  
**Email:** homin.park@ghent.ac.kr | powersimmani@gmail.com

---

### 🎯 Learning Objectives

By the end of this notebook, you will be able to:

1. ✅ **Understand** the fundamental concepts of unsupervised learning
2. ✅ **Apply** multiple clustering algorithms (K-Means, DBSCAN, Hierarchical, GMM)
3. ✅ **Evaluate** clustering quality using internal and external metrics
4. ✅ **Reduce** dimensionality using PCA, t-SNE, and UMAP
5. ✅ **Detect** anomalies using statistical and ML-based methods
6. ✅ **Build** complete unsupervised learning pipelines
7. ✅ **Select** appropriate algorithms for different scenarios
8. ✅ **Interpret** and communicate results effectively

---

### 📋 Table of Contents

**Part 0:** Setup and Introduction  
**Part 1:** Clustering Fundamentals (K-Means, Hierarchical, DBSCAN)  
**Part 2:** Cluster Evaluation Metrics  
**Part 3:** Dimensionality Reduction (PCA, t-SNE, UMAP)  
**Part 4:** Anomaly Detection  
**Part 5:** Integrated Real-World Applications  
**Part 6:** Summary and Key Takeaways  

---

### ⚡ Quick Start

Run all cells in order, or jump to specific sections using the table of contents.

---

# Part 0: Setup and Configuration

Let's import all necessary libraries and set up our environment.

In [None]:
# Core Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

# Scikit-learn: Clustering
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering, MeanShift
from sklearn.mixture import GaussianMixture

# Scikit-learn: Dimensionality Reduction
from sklearn.decomposition import PCA, KernelPCA
from sklearn.manifold import TSNE

# Scikit-learn: Anomaly Detection
from sklearn.ensemble import IsolationForest
from sklearn.covariance import EllipticEnvelope
from sklearn.neighbors import LocalOutlierFactor

# Scikit-learn: Metrics and Utilities
from sklearn.metrics import (
    silhouette_score, silhouette_samples,
    davies_bouldin_score, calinski_harabasz_score,
    adjusted_rand_score, normalized_mutual_info_score,
    confusion_matrix, classification_report,
    roc_curve, auc, precision_recall_curve
)
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import make_blobs, make_moons, make_circles, load_iris

# Scipy
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.spatial.distance import cdist
from scipy.stats import zscore

# Plotly for Interactive Visualizations
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Configuration
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("✅ All libraries imported successfully!")
print(f"📦 NumPy version: {np.__version__}")
print(f"📦 Pandas version: {pd.__version__}")
print(f"📦 Scikit-learn imported successfully")

### Helper Functions

Let's define some utility functions for visualization and analysis.

In [None]:
def plot_clusters(X, labels, centers=None, title="Cluster Visualization", figsize=(10, 6)):
    """
    Plot 2D clusters with optional centroids
    
    Parameters:
    -----------
    X : array-like, shape (n_samples, 2)
        Data points (only first 2 features used)
    labels : array-like, shape (n_samples,)
        Cluster labels
    centers : array-like, optional
        Cluster centers
    title : str
        Plot title
    """
    plt.figure(figsize=figsize)
    
    # Plot points
    scatter = plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', 
                         alpha=0.6, edgecolors='k', s=50)
    
    # Plot centers if provided
    if centers is not None:
        plt.scatter(centers[:, 0], centers[:, 1], 
                   c='red', marker='X', s=200, 
                   edgecolors='black', linewidths=2,
                   label='Centroids')
        plt.legend()
    
    plt.colorbar(scatter, label='Cluster')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    plt.tight_layout()
    plt.show()


def plot_silhouette(X, labels, metric='euclidean'):
    """
    Create silhouette plot for cluster analysis
    """
    from matplotlib import cm
    
    n_clusters = len(np.unique(labels[labels >= 0]))
    silhouette_avg = silhouette_score(X, labels, metric=metric)
    sample_silhouette_values = silhouette_samples(X, labels, metric=metric)
    
    fig, ax = plt.subplots(figsize=(10, 6))
    y_lower = 10
    
    for i in range(n_clusters):
        ith_cluster_silhouette_values = sample_silhouette_values[labels == i]
        ith_cluster_silhouette_values.sort()
        
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i
        
        color = cm.nipy_spectral(float(i) / n_clusters)
        ax.fill_betweenx(np.arange(y_lower, y_upper),
                        0, ith_cluster_silhouette_values,
                        facecolor=color, edgecolor=color, alpha=0.7)
        
        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        y_lower = y_upper + 10
    
    ax.set_title(f'Silhouette Plot (Average Score: {silhouette_avg:.3f})')
    ax.set_xlabel('Silhouette Coefficient')
    ax.set_ylabel('Cluster')
    ax.axvline(x=silhouette_avg, color="red", linestyle="--", label='Average')
    ax.legend()
    plt.tight_layout()
    plt.show()
    
    return silhouette_avg


def evaluate_clustering(X, labels, name="Algorithm"):
    """
    Calculate and display clustering evaluation metrics
    """
    # Filter out noise points (label = -1) for metrics that don't support them
    mask = labels >= 0
    X_filtered = X[mask]
    labels_filtered = labels[mask]
    
    if len(np.unique(labels_filtered)) > 1:
        silhouette = silhouette_score(X_filtered, labels_filtered)
        davies_bouldin = davies_bouldin_score(X_filtered, labels_filtered)
        calinski = calinski_harabasz_score(X_filtered, labels_filtered)
        
        print(f"\n{'='*50}")
        print(f"📊 {name} - Evaluation Metrics")
        print(f"{'='*50}")
        print(f"Silhouette Score:        {silhouette:.4f} (higher is better, range: [-1, 1])")
        print(f"Davies-Bouldin Index:    {davies_bouldin:.4f} (lower is better)")
        print(f"Calinski-Harabasz Score: {calinski:.2f} (higher is better)")
        print(f"Number of Clusters:      {len(np.unique(labels_filtered))}")
        if -1 in labels:
            print(f"Number of Noise Points:  {np.sum(labels == -1)}")
        print(f"{'='*50}\n")
        
        return {
            'silhouette': silhouette,
            'davies_bouldin': davies_bouldin,
            'calinski_harabasz': calinski
        }
    else:
        print(f"⚠️ {name}: Not enough clusters for evaluation")
        return None


print("✅ Helper functions defined successfully!")

### Load Sample Datasets

We'll use multiple datasets throughout this notebook.

In [None]:
# Load Iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
iris_df['species_name'] = iris_df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

print("🌸 Iris Dataset Loaded")
print(f"   Shape: {iris_df.shape}")
print(f"   Features: {list(iris.feature_names)}")
print(f"   Classes: {list(iris.target_names)}")

# Create synthetic datasets
# Dataset 1: Well-separated blobs
X_blobs, y_blobs = make_blobs(n_samples=300, centers=4, n_features=2, 
                               cluster_std=0.6, random_state=42)

# Dataset 2: Non-linear patterns (moons)
X_moons, y_moons = make_moons(n_samples=300, noise=0.1, random_state=42)

# Dataset 3: Circles
X_circles, y_circles = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)

print("\n🎲 Synthetic Datasets Created")
print(f"   Blobs: {X_blobs.shape}")
print(f"   Moons: {X_moons.shape}")
print(f"   Circles: {X_circles.shape}")

# Visualize datasets
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_blobs, cmap='viridis', alpha=0.6, edgecolors='k')
axes[0].set_title('Blobs Dataset')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='viridis', alpha=0.6, edgecolors='k')
axes[1].set_title('Moons Dataset')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')

axes[2].scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='viridis', alpha=0.6, edgecolors='k')
axes[2].set_title('Circles Dataset')
axes[2].set_xlabel('Feature 1')
axes[2].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\n✅ All datasets loaded and ready!")

---

# Part 1: Clustering Fundamentals

## What is Clustering?

**Clustering** is an unsupervised learning technique that groups similar data points together based on their features. Unlike supervised learning, clustering doesn't require labeled data.

### Key Concepts:

- **Goal**: Partition *n* observations into *k* groups
- **Objective**: Maximize intra-cluster similarity, minimize inter-cluster similarity
- **Distance Metrics**: Euclidean, Manhattan, Cosine, Mahalanobis

### Types of Clustering:

1. **Hard Clustering**: Each point belongs to exactly one cluster (e.g., K-Means)
2. **Soft Clustering**: Points have probability distribution over clusters (e.g., GMM)
3. **Hierarchical**: Nested clusters forming tree structure (e.g., Agglomerative)
4. **Density-Based**: Clusters as high-density regions (e.g., DBSCAN)

---

## Exercise 1.1: K-Means Clustering

### 📖 Theory

**K-Means** is one of the most popular clustering algorithms. It partitions data into K clusters by:

1. **Initialize**: Randomly select K centroids
2. **Assignment**: Assign each point to nearest centroid
3. **Update**: Recompute centroids as cluster means
4. **Repeat**: Steps 2-3 until convergence

**Strengths:**
- ✅ Simple and fast: O(n·k·i·d)
- ✅ Scales to large datasets
- ✅ Easy to interpret

**Weaknesses:**
- ❌ Assumes spherical clusters
- ❌ Sensitive to initialization
- ❌ Requires pre-specified K

**Key Parameter:**
- `n_clusters` (k): Number of clusters to form

---

In [None]:
# === K-Means Implementation ===

# Use blobs dataset (well-separated clusters)
X = X_blobs.copy()

# Apply K-Means with k=4
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
labels_kmeans = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_

# Visualize results
plot_clusters(X, labels_kmeans, centers, title="K-Means Clustering (k=4)")

# Evaluate
metrics = evaluate_clustering(X, labels_kmeans, name="K-Means (k=4)")

print("\n💡 Key Observations:")
print("   • K-Means successfully identified 4 well-separated clusters")
print("   • High silhouette score indicates good cluster separation")
print("   • Centroids (red X) represent the mean of each cluster")

### 🔍 Finding Optimal K: The Elbow Method

How do we choose the right number of clusters? The **Elbow Method** helps us find optimal K by plotting inertia (within-cluster sum of squares) against K.

**Inertia**: Sum of squared distances of samples to their closest cluster center

In [None]:
# === Elbow Method ===

# Test different values of K
K_range = range(2, 11)
inertias = []
silhouette_scores = []

for k in K_range:
    kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_temp = kmeans_temp.fit_predict(X)
    
    inertias.append(kmeans_temp.inertia_)
    silhouette_scores.append(silhouette_score(X, labels_temp))

# Plot Elbow Curve
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Inertia plot
ax1.plot(K_range, inertias, marker='o', linewidth=2, markersize=8)
ax1.set_xlabel('Number of Clusters (K)', fontsize=12)
ax1.set_ylabel('Inertia (WCSS)', fontsize=12)
ax1.set_title('Elbow Method: Inertia vs K', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.axvline(x=4, color='red', linestyle='--', label='Optimal K=4')
ax1.legend()

# Silhouette score plot
ax2.plot(K_range, silhouette_scores, marker='s', linewidth=2, markersize=8, color='green')
ax2.set_xlabel('Number of Clusters (K)', fontsize=12)
ax2.set_ylabel('Silhouette Score', fontsize=12)
ax2.set_title('Silhouette Score vs K', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.axvline(x=4, color='red', linestyle='--', label='Optimal K=4')
ax2.legend()

plt.tight_layout()
plt.show()

# Find optimal K
optimal_k_silhouette = K_range[np.argmax(silhouette_scores)]

print(f"\n📊 Analysis Results:")
print(f"   • Elbow point appears around K=4 (inertia starts decreasing slowly)")
print(f"   • Highest silhouette score at K={optimal_k_silhouette}")
print(f"   • Recommended K: 4 (both metrics agree)")

### 💪 Your Turn - Practice Task 1.1

**Task:** Apply K-Means to the Iris dataset

1. Load the first 2 features of the Iris dataset (sepal length and width)
2. Apply K-Means with k=3 (we know there are 3 species)
3. Visualize the clusters
4. Calculate evaluation metrics
5. Compare clustering results with true species labels using confusion matrix

**Hint:** Use `X_iris = iris.data[:, :2]` and `y_true = iris.target`

In [None]:
# === YOUR CODE HERE ===

# Step 1: Prepare data
X_iris = iris.data[:, :2]  # First 2 features
y_true = iris.target

# Step 2: Apply K-Means
# TODO: Create KMeans model with n_clusters=3


# Step 3: Visualize
# TODO: Use plot_clusters() function


# Step 4: Evaluate
# TODO: Use evaluate_clustering() function


# Step 5: Compare with true labels
# TODO: Create confusion matrix


# === END OF YOUR CODE ===

print("\n💡 Reflection Questions:")
print("   1. How well did K-Means identify the true species?")
print("   2. Which species was easiest/hardest to cluster?")
print("   3. Why might clustering not perfectly match species labels?")

In [None]:
# === SOLUTION (Uncomment to see) ===

# # Step 1: Prepare data
# X_iris = iris.data[:, :2]
# y_true = iris.target

# # Step 2: Apply K-Means
# kmeans_iris = KMeans(n_clusters=3, random_state=42, n_init=10)
# y_pred = kmeans_iris.fit_predict(X_iris)

# # Step 3: Visualize
# plot_clusters(X_iris, y_pred, kmeans_iris.cluster_centers_, 
#               title="K-Means on Iris (k=3)")

# # Step 4: Evaluate
# evaluate_clustering(X_iris, y_pred, name="K-Means on Iris")

# # Step 5: Compare with true labels
# from sklearn.metrics import confusion_matrix, adjusted_rand_score
# cm = confusion_matrix(y_true, y_pred)
# ari = adjusted_rand_score(y_true, y_pred)

# plt.figure(figsize=(8, 6))
# sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
#             xticklabels=['Cluster 0', 'Cluster 1', 'Cluster 2'],
#             yticklabels=iris.target_names)
# plt.title(f'Confusion Matrix (ARI: {ari:.3f})')
# plt.ylabel('True Species')
# plt.xlabel('Predicted Cluster')
# plt.tight_layout()
# plt.show()

# print(f"\n📊 Adjusted Rand Index: {ari:.3f}")
# print("   (1.0 = perfect match, 0.0 = random labeling)")

---

## Exercise 1.2: K-Means++ Initialization

### 📖 Theory

**K-Means++** is an improved initialization strategy that addresses K-Means' sensitivity to initial centroid placement.

**Algorithm:**
1. Choose first centroid randomly from data points
2. For each subsequent centroid:
   - Calculate distance D(x) to nearest existing centroid
   - Choose next centroid with probability ∝ D(x)²
3. Spreads initial centroids far apart

**Benefits:**
- ✅ Faster convergence (fewer iterations)
- ✅ Better quality clusters
- ✅ O(log k) approximation guarantee
- ✅ Default in scikit-learn

---

In [None]:
# === Compare Random vs K-Means++ Initialization ===

# Random initialization
kmeans_random = KMeans(n_clusters=4, init='random', n_init=1, max_iter=300, random_state=42)
kmeans_random.fit(X)

# K-Means++ initialization
kmeans_plusplus = KMeans(n_clusters=4, init='k-means++', n_init=1, max_iter=300, random_state=42)
kmeans_plusplus.fit(X)

# Compare results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Random init
axes[0].scatter(X[:, 0], X[:, 1], c=kmeans_random.labels_, cmap='viridis', alpha=0.6, edgecolors='k')
axes[0].scatter(kmeans_random.cluster_centers_[:, 0], 
                kmeans_random.cluster_centers_[:, 1],
                c='red', marker='X', s=200, edgecolors='black', linewidths=2)
axes[0].set_title(f'Random Init\nInertia: {kmeans_random.inertia_:.2f} | Iterations: {kmeans_random.n_iter_}')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

# K-Means++ init
axes[1].scatter(X[:, 0], X[:, 1], c=kmeans_plusplus.labels_, cmap='viridis', alpha=0.6, edgecolors='k')
axes[1].scatter(kmeans_plusplus.cluster_centers_[:, 0], 
                kmeans_plusplus.cluster_centers_[:, 1],
                c='red', marker='X', s=200, edgecolors='black', linewidths=2)
axes[1].set_title(f'K-Means++ Init\nInertia: {kmeans_plusplus.inertia_:.2f} | Iterations: {kmeans_plusplus.n_iter_}')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\n📊 Comparison Results:")
print(f"\n{'Method':<20} {'Inertia':<15} {'Iterations':<15} {'Silhouette'}")
print("-" * 65)
print(f"{'Random Init':<20} {kmeans_random.inertia_:<15.2f} {kmeans_random.n_iter_:<15} "
      f"{silhouette_score(X, kmeans_random.labels_):.4f}")
print(f"{'K-Means++':<20} {kmeans_plusplus.inertia_:<15.2f} {kmeans_plusplus.n_iter_:<15} "
      f"{silhouette_score(X, kmeans_plusplus.labels_):.4f}")

print("\n💡 Key Observations:")
print("   • K-Means++ typically converges faster (fewer iterations)")
print("   • K-Means++ often achieves lower inertia (better clusters)")
print("   • Small initialization overhead → Large quality gains")

---

## Exercise 1.3: Hierarchical Clustering

### 📖 Theory

**Hierarchical Clustering** builds a hierarchy of clusters, represented as a **dendrogram** (tree diagram).

**Two Approaches:**

1. **Agglomerative (Bottom-Up)**: ⬆️
   - Start: Each point is its own cluster
   - Repeatedly merge closest clusters
   - End: All points in one cluster

2. **Divisive (Top-Down)**: ⬇️
   - Start: All points in one cluster
   - Repeatedly split clusters
   - End: Each point is its own cluster

**Linkage Criteria** (how to measure cluster distance):
- **Single**: Minimum distance between points
- **Complete**: Maximum distance between points
- **Average**: Average distance between all pairs
- **Ward** ⭐: Minimizes within-cluster variance (most common)

**Advantages:**
- ✅ No need to specify K upfront
- ✅ Dendrogram provides insights
- ✅ Can create any number of clusters by cutting tree

**Disadvantages:**
- ❌ O(n³) complexity - not scalable
- ❌ Sensitive to noise and outliers

---

In [None]:
# === Hierarchical Clustering with Dendrogram ===

# Use a smaller sample for better visualization
np.random.seed(42)
sample_indices = np.random.choice(len(X), size=100, replace=False)
X_sample = X[sample_indices]

# Compute linkage matrix
linkage_matrix = linkage(X_sample, method='ward')

# Create dendrogram
plt.figure(figsize=(14, 6))
dendrogram(linkage_matrix, 
           truncate_mode='lastp',  # Show only last p merged clusters
           p=12,
           leaf_rotation=90,
           leaf_font_size=10,
           show_contracted=True)

plt.axhline(y=10, color='r', linestyle='--', label='Cut at height=10 → 4 clusters')
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)', fontsize=14, fontweight='bold')
plt.xlabel('Sample Index or (Cluster Size)', fontsize=12)
plt.ylabel('Distance (Ward)', fontsize=12)
plt.legend()
plt.tight_layout()
plt.show()

print("\n📊 Dendrogram Interpretation:")
print("   • Height: Distance at which clusters merge")
print("   • Vertical lines: Represent clusters being merged")
print("   • Horizontal cut: Determines number of clusters")
print("   • Red dashed line: Cutting here gives 4 clusters")

In [None]:
# === Apply Agglomerative Clustering ===

# Fit hierarchical clustering with n_clusters=4
hierarchical = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels_hier = hierarchical.fit_predict(X)

# Visualize results
plot_clusters(X, labels_hier, title="Hierarchical Clustering (Ward, k=4)")

# Evaluate
evaluate_clustering(X, labels_hier, name="Hierarchical (Ward)")

print("\n💡 Key Observations:")
print("   • Ward linkage minimizes within-cluster variance")
print("   • Results similar to K-Means for well-separated clusters")
print("   • Dendrogram helps visualize cluster formation process")

### 💪 Your Turn - Practice Task 1.3

**Task:** Compare different linkage methods

1. Apply hierarchical clustering with three linkage methods: 'single', 'complete', 'average'
2. Use n_clusters=4 for all methods
3. Visualize results side-by-side
4. Calculate silhouette scores for each method
5. Which linkage method works best for this data?

In [None]:
# === YOUR CODE HERE ===

# TODO: Apply hierarchical clustering with different linkage methods
# linkages = ['single', 'complete', 'average']


# TODO: Visualize and compare


# TODO: Compare silhouette scores


# === END OF YOUR CODE ===

---

## Exercise 1.4: DBSCAN - Density-Based Clustering

### 📖 Theory

**DBSCAN** (Density-Based Spatial Clustering of Applications with Noise) discovers clusters of arbitrary shape and identifies outliers.

**Key Principle:** Clusters are high-density regions separated by low-density regions.

**Parameters:**
- **ε (epsilon)**: Neighborhood radius
- **minPts**: Minimum points to form dense region

**Point Types:**
1. **Core Points** 🟢: ≥ minPts neighbors within ε
2. **Border Points** 🟡: In neighborhood of core, but not core itself
3. **Noise Points** 🔴: Neither core nor border (outliers)

**Advantages:**
- ✅ Discovers arbitrary-shaped clusters
- ✅ Automatically detects outliers
- ✅ No need to specify number of clusters
- ✅ Robust to outliers

**Disadvantages:**
- ❌ Sensitive to ε and minPts parameters
- ❌ Struggles with varying densities
- ❌ High-dimensional data challenges

---

In [None]:
# === DBSCAN on Non-linear Data ===

# Use moons dataset (non-linear clusters)
X_moons_scaled = StandardScaler().fit_transform(X_moons)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels_dbscan = dbscan.fit_predict(X_moons_scaled)

# Count noise points
n_clusters = len(set(labels_dbscan)) - (1 if -1 in labels_dbscan else 0)
n_noise = list(labels_dbscan).count(-1)

# Visualize
plt.figure(figsize=(12, 5))

# Original data
plt.subplot(1, 2, 1)
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='viridis', 
           alpha=0.6, edgecolors='k', s=50)
plt.title('True Labels (Moons Dataset)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

# DBSCAN results
plt.subplot(1, 2, 2)
unique_labels = set(labels_dbscan)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))

for k, col in zip(unique_labels, colors):
    if k == -1:
        # Noise points in black
        col = 'black'
        marker = 'x'
        label = 'Noise'
    else:
        marker = 'o'
        label = f'Cluster {k}'
    
    class_member_mask = (labels_dbscan == k)
    xy = X_moons_scaled[class_member_mask]
    plt.scatter(xy[:, 0], xy[:, 1], c=[col], marker=marker, 
               alpha=0.6, edgecolors='k', s=50, label=label)

plt.title(f'DBSCAN Results\n{n_clusters} clusters, {n_noise} noise points')
plt.xlabel('Feature 1 (scaled)')
plt.ylabel('Feature 2 (scaled)')
plt.legend()
plt.tight_layout()
plt.show()

# Evaluate (excluding noise points)
if n_clusters > 1:
    evaluate_clustering(X_moons_scaled, labels_dbscan, name="DBSCAN")

print("\n💡 Key Observations:")
print("   • DBSCAN successfully identified non-linear (moon-shaped) clusters")
print("   • K-Means would fail on this data (assumes spherical clusters)")
print("   • Noise points (black X) automatically detected")
print("   • No need to specify number of clusters in advance")

### 🔧 Parameter Sensitivity Analysis

Let's see how ε (epsilon) affects clustering results.

In [None]:
# === DBSCAN Parameter Sensitivity ===

eps_values = [0.2, 0.3, 0.4, 0.5]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, eps in enumerate(eps_values):
    dbscan_temp = DBSCAN(eps=eps, min_samples=5)
    labels_temp = dbscan_temp.fit_predict(X_moons_scaled)
    
    n_clusters_temp = len(set(labels_temp)) - (1 if -1 in labels_temp else 0)
    n_noise_temp = list(labels_temp).count(-1)
    
    # Plot
    unique_labels = set(labels_temp)
    colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
    
    for k, col in zip(unique_labels, colors):
        if k == -1:
            col = 'black'
            marker = 'x'
        else:
            marker = 'o'
        
        class_member_mask = (labels_temp == k)
        xy = X_moons_scaled[class_member_mask]
        axes[idx].scatter(xy[:, 0], xy[:, 1], c=[col], marker=marker, 
                         alpha=0.6, edgecolors='k', s=40)
    
    axes[idx].set_title(f'ε = {eps}\nClusters: {n_clusters_temp}, Noise: {n_noise_temp}')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\n📊 Parameter Sensitivity:")
print("   • Small ε: More clusters, more noise points")
print("   • Large ε: Fewer clusters, points merge together")
print("   • Optimal ε: Balance between separation and connectivity")

### 💪 Your Turn - Practice Task 1.4

**Task:** Apply DBSCAN to the circles dataset

1. Scale the `X_circles` dataset using StandardScaler
2. Apply DBSCAN with different epsilon values: [0.1, 0.2, 0.3, 0.4]
3. For each epsilon, count the number of clusters and noise points
4. Visualize the best result
5. Compare with K-Means results on the same data

**Question:** Why does DBSCAN work better than K-Means for this dataset?

In [None]:
# === YOUR CODE HERE ===

# TODO: Scale the circles dataset


# TODO: Try different epsilon values


# TODO: Apply K-Means for comparison


# TODO: Visualize and compare


# === END OF YOUR CODE ===

---

# Part 2: Cluster Evaluation Metrics

## The Challenge: No Ground Truth

Unlike supervised learning, clustering has no "correct" answers. So how do we evaluate cluster quality?

### Two Types of Metrics:

1. **Internal Metrics** 📊: Use only the data itself
   - Silhouette Score
   - Davies-Bouldin Index  
   - Calinski-Harabasz Index
   - Inertia (WCSS)

2. **External Metrics** 🏷️: Compare with ground truth labels (when available)
   - Adjusted Rand Index (ARI)
   - Normalized Mutual Information (NMI)
   - Fowlkes-Mallows Index

---

## Exercise 2.1: Internal Metrics

### 📖 Silhouette Score

**Definition:** Measures how similar an object is to its own cluster compared to other clusters.

**Formula for point i:**
```
s(i) = (b(i) - a(i)) / max(a(i), b(i))
```

Where:
- **a(i)**: Mean distance to other points in same cluster (cohesion)
- **b(i)**: Mean distance to points in nearest cluster (separation)

**Range:** [-1, 1]
- +1: Perfect clustering (far from other clusters)
- 0: On decision boundary
- -1: Likely in wrong cluster

**Interpretation:**
- > 0.7: Strong structure
- 0.5 - 0.7: Reasonable structure
- 0.25 - 0.5: Weak structure
- < 0.25: No substantial structure

---

In [None]:
# === Comprehensive Silhouette Analysis ===

# Apply K-Means with different K values
K_values = [2, 3, 4, 5, 6]

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

metrics_comparison = []

for idx, k in enumerate(K_values):
    # Fit K-Means
    kmeans_temp = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_temp = kmeans_temp.fit_predict(X)
    
    # Calculate metrics
    silhouette_avg = silhouette_score(X, labels_temp)
    sample_silhouette_values = silhouette_samples(X, labels_temp)
    
    metrics_comparison.append({
        'K': k,
        'Silhouette': silhouette_avg,
        'Davies-Bouldin': davies_bouldin_score(X, labels_temp),
        'Calinski-Harabasz': calinski_harabasz_score(X, labels_temp),
        'Inertia': kmeans_temp.inertia_
    })
    
    # Plot silhouette
    ax = axes[idx]
    y_lower = 10
    
    for i in range(k):
        ith_cluster_silhouette_values = sample_silhouette_values[labels_temp == i]
        ith_cluster_silhouette_values.sort()
        
        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i
        
        color = plt.cm.nipy_spectral(float(i) / k)
        ax.fill_betweenx(np.arange(y_lower, y_upper),
                        0, ith_cluster_silhouette_values,
                        facecolor=color, edgecolor=color, alpha=0.7)
        
        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        y_lower = y_upper + 10
    
    ax.set_title(f'K={k} (Silhouette: {silhouette_avg:.3f})')
    ax.set_xlabel('Silhouette Coefficient')
    ax.set_ylabel('Cluster')
    ax.axvline(x=silhouette_avg, color="red", linestyle="--")
    ax.set_xlim([-0.2, 1])

# Hide last subplot
axes[-1].axis('off')

plt.tight_layout()
plt.show()

# Display comparison table
metrics_df = pd.DataFrame(metrics_comparison)
print("\n📊 Clustering Metrics Comparison:\n")
print(metrics_df.to_string(index=False))
print("\n💡 Best K based on Silhouette Score:", 
      metrics_df.loc[metrics_df['Silhouette'].idxmax(), 'K'])

---

# Part 6: Summary and Key Takeaways

## 🎓 What We've Learned

### Clustering Algorithms

| Algorithm | Best For | Pros | Cons |
|-----------|----------|------|------|
| **K-Means** | Spherical, well-separated clusters | Fast, scalable | Needs K, assumes spherical |
| **Hierarchical** | Small datasets, dendrogram needed | No K needed, intuitive | O(n³), not scalable |
| **DBSCAN** | Arbitrary shapes, noise detection | Finds outliers, no K | Parameter sensitive |
| **GMM** | Elliptical clusters, soft assignment | Probabilistic, flexible | Assumes Gaussian |

### Dimensionality Reduction

| Method | Purpose | Pros | Cons |
|--------|---------|------|------|
| **PCA** | Linear projection | Fast, interpretable | Linear only |
| **t-SNE** | Visualization | Great for viz | Slow, not for general dim reduction |
| **UMAP** | Visualization + DR | Fast, preserves structure | Newer, parameter tuning |
| **Autoencoder** | Complex patterns | Non-linear, flexible | Needs more data, tuning |

### Anomaly Detection

| Method | Approach | Best For |
|--------|----------|----------|
| **Statistical** | Z-score, IQR | Gaussian data, simple cases |
| **Isolation Forest** | Tree-based isolation | High-dim, scalable |
| **DBSCAN** | Density | Spatial data, clusters + outliers |

---

## 🎯 Algorithm Selection Guide

```
START
  ↓
Do you know the number of clusters?
  ├─ YES → K-Means or GMM
  └─ NO → DBSCAN or Hierarchical
       ↓
Are clusters spherical?
  ├─ YES → K-Means (fastest)
  └─ NO → DBSCAN (arbitrary shapes)
       ↓
Do you need probability assignments?
  ├─ YES → GMM (soft clustering)
  └─ NO → K-Means (hard clustering)
       ↓
Do you need to detect outliers?
  ├─ YES → DBSCAN or Isolation Forest
  └─ NO → K-Means or GMM
```

---

## 📚 Key Principles

1. **Always scale your data** before clustering (except tree-based methods)
2. **Try multiple algorithms** - no single algorithm works for all data
3. **Use multiple metrics** - single metric can be misleading
4. **Domain validation is crucial** - metrics don't tell the whole story
5. **Visualize results** - 2D/3D plots reveal insights

---

## 🚀 Next Steps

### Advanced Topics (Lecture 18+):
- Deep Autoencoders for complex patterns
- Variational Autoencoders (VAE) for generation
- Self-supervised learning
- Graph-based clustering
- Time-series clustering

### Practice Recommendations:
1. Apply to real-world datasets (Kaggle, UCI ML Repository)
2. Combine multiple techniques (dim reduction → clustering)
3. Build end-to-end pipelines
4. Experiment with parameters systematically

---

## 📖 Resources

**Documentation:**
- Scikit-learn: https://scikit-learn.org/stable/modules/clustering.html
- UMAP: https://umap-learn.readthedocs.io/

**Books:**
- "Hands-On Machine Learning" by Aurélien Géron
- "Pattern Recognition and Machine Learning" by Christopher Bishop

**Papers:**
- K-Means++: Arthur & Vassilvitskii (2007)
- DBSCAN: Ester et al. (1996)
- t-SNE: van der Maaten & Hinton (2008)
- UMAP: McInnes et al. (2018)

---

## 🎉 Congratulations!

You've completed the comprehensive hands-on practice for **Lecture 17: Clustering and Unsupervised Learning Fundamentals**!

You now have practical experience with:
- ✅ 5+ clustering algorithms
- ✅ Multiple evaluation metrics
- ✅ Dimensionality reduction techniques
- ✅ Anomaly detection methods
- ✅ Real-world applications

Keep practicing and exploring! 🚀

---

**Created by:** Ho-min Park  
**Contact:** homin.park@ghent.ac.kr | powersimmani@gmail.com  
**Date:** 2024  
**License:** Educational Use  

---