# Clustering Tutorial: K-Means and DBSCAN

**Instructor: Dr. Arun B Ayyar**

**Date:** 17.02.2025 AN

---

This notebook contains practical exercises on K-Means and DBSCAN clustering algorithms with diverse datasets.

Each problem includes:
- **Problem statement** with dataset description
- **Hints** including scikit-learn functions to use
- **Complete solution** with code and visualizations

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN
from sklearn.datasets import make_blobs, make_moons, make_circles, load_iris, load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, davies_bouldin_score
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 5)

## Problem 1: Iris Dataset (4D → 2D with PCA)

**Dataset:** Iris flower measurements (150 samples, 4 features)

**Tasks:**
**Note:** Data is already generated for you. Focus on:
1. Standardize the features
2. Use the elbow method to find optimal k (test k=2 to 10)
3. Apply K-Means with optimal k
4. Reduce to 2D using PCA for visualization
5. Visualize clusters and calculate silhouette score

**Hints:**
- Use `load_iris()` from sklearn.datasets
- Use `StandardScaler()` for standardization
- Use `KMeans(n_clusters=k, random_state=42)` for clustering
- Plot inertia (WCSS) vs k for elbow method
- Use `PCA(n_components=2)` for dimensionality reduction
- Use `silhouette_score()` to evaluate clustering quality

In [None]:
# Load and prepare data
iris = load_iris()
X_iris = iris.data
y_true = iris.target




In [None]:
# Standardize
# Your code here
# Apply clustering algorithm and visualize results


## Problem 2: Customer Segmentation (Synthetic Data)

**Dataset:** Customer data with Annual Income and Spending Score (200 samples)

**Tasks:**
**Note:** Data is already generated for you. Focus on:
1. Use elbow method to find optimal k
2. Apply K-Means clustering
3. Visualize and interpret customer segments
4. Provide business insights for each segment

**Hints:**
- Use `make_blobs(n_samples=200, n_features=2, centers=5, cluster_std=1.0, random_state=42)`
- Features represent: [Annual Income (k$), Spending Score (1-100)]
- Use `KMeans(n_clusters=k, random_state=42)`
- Calculate cluster statistics (mean income, mean spending) for each segment

In [None]:
# Generate customer data
X_customers, _ = make_blobs(n_samples=200, n_features=2, centers=5, cluster_std=1.0, random_state=42)

# Scale to realistic ranges
X_customers[:, 0] = X_customers[:, 0] * 10 + 50  # Annual Income: 30-70k
X_customers[:, 1] = X_customers[:, 1] * 10 + 50  # Spending Score: 30-70


In [None]:
# Your code here
# Apply clustering algorithm and visualize results


## Problem 3: Anisotropic Blobs (Stretched Clusters)

**Dataset:** 3 clusters with different shapes and orientations (300 samples)

**Tasks:**
**Note:** Data is already generated for you. Focus on:
1. Apply K-Means with k=3
2. Visualize the results
3. Analyze why K-Means struggles with this data

**Hints:**
- Use `make_blobs()` with `cluster_std=[1.0, 2.5, 0.5]` for different cluster spreads
- Apply transformation matrix to create anisotropy
- Use `KMeans(n_clusters=3, random_state=42)`
- Compare cluster assignments with true labels
- Note: K-Means assumes spherical clusters!

In [None]:
# Generate anisotropic data
X_aniso, y_aniso = make_blobs(n_samples=300, n_features=2, centers=3, cluster_std=[1.0, 2.5, 0.5], random_state=42)


In [None]:
# Your code here
# Apply clustering algorithm and visualize results


## Problem 4: Concentric Circles (Non-Convex Clusters)

**Dataset:** Two concentric circles (400 samples)

**Tasks:**
**Note:** Data is already generated for you. Focus on:
1. Apply K-Means with k=2
2. Visualize and analyze the failure
3. Explain why K-Means fails on this data

**Hints:**
- Use `make_circles(n_samples=400, factor=0.5, noise=0.05, random_state=42)`
- Use `KMeans(n_clusters=2, random_state=42)`
- K-Means cannot handle non-convex shapes!
- This is where DBSCAN or Spectral Clustering excel

In [None]:
# Generate concentric circles
X_circles, y_circles = make_circles(n_samples=400, factor=0.5, noise=0.05, random_state=42)


In [None]:
# Your code here
# Apply clustering algorithm and visualize results


## Problem 5: Crescent Moons (Non-Convex Shapes)

**Dataset:** Two interleaving crescent moons (300 samples)

**Tasks:**
**Note:** Data is already generated for you. Focus on:
1. Apply both K-Means and DBSCAN
2. Compare their performance
3. Tune DBSCAN parameters (eps and min_samples)

**Hints:**
- Use `make_moons(n_samples=300, noise=0.05, random_state=42)`
- For K-Means: `KMeans(n_clusters=2, random_state=42)`
- For DBSCAN: `DBSCAN(eps=0.2, min_samples=5)`
- Try different eps values: [0.1, 0.2, 0.3]
- Noise points in DBSCAN are labeled as -1

In [None]:
# Generate moons data
X_moons, y_moons = make_moons(n_samples=300, noise=0.05, random_state=42)


In [None]:
# Your code here
# Apply clustering algorithm and visualize results


## Problem 6: Concentric Circles with DBSCAN

**Dataset:** Two concentric circles (same as Problem 4)

**Tasks:**
**Note:** Data is already generated for you. Focus on:
1. Use the concentric circles data
1. Apply DBSCAN with appropriate parameters
2. Compare with K-Means results
3. Experiment with different eps values

**Hints:**
- Use `make_circles(n_samples=400, factor=0.5, noise=0.05, random_state=42)`
- Try `DBSCAN(eps=0.15, min_samples=5)`
- Visualize how eps affects clustering
- DBSCAN should separate inner and outer circles correctly

In [None]:
# We already have X_circles and y_circles from Problem 4

In [None]:
# Your code here
# Apply clustering algorithm and visualize results


## Problem 7: Clusters with Varied Density

**Dataset:** 3 clusters with different densities (400 samples)

**Tasks:**
**Note:** Data is already generated for you. Focus on:
1. Apply K-Means and DBSCAN
2. Compare their ability to handle density variations
3. Analyze the trade-offs

**Hints:**
- Use `make_blobs()` with `cluster_std=[0.5, 1.5, 3.0]` for different densities
- K-Means: `KMeans(n_clusters=3, random_state=42)`
- DBSCAN: `DBSCAN(eps=0.5, min_samples=5)`
- DBSCAN may struggle with varied densities (single eps for all clusters)

In [None]:
# Generate varied density data
X_varied, y_varied = make_blobs(n_samples=[100, 150, 150], n_features=2, centers=3, 
                                cluster_std=[0.5, 1.5, 3.0], random_state=42)


In [None]:
# Your code here
# Apply clustering algorithm and visualize results


## Problem 8: Data with Outliers/Noise

**Dataset:** 3 well-separated clusters + random noise points (350 samples)

**Tasks:**
**Note:** Data is already generated for you. Focus on:
1. Apply K-Means and DBSCAN
2. Compare outlier detection capabilities
3. Analyze robustness to noise

**Hints:**
- Use `make_blobs()` for main clusters + `np.random.uniform()` for noise
- K-Means: `KMeans(n_clusters=3, random_state=42)`
- DBSCAN: `DBSCAN(eps=0.5, min_samples=5)`
- K-Means assigns all points to clusters (no outlier detection)
- DBSCAN labels outliers as -1 (noise)

In [None]:
# Generate data with noise
np.random.seed(42)
X_clean, y_clean = make_blobs(n_samples=300, n_features=2, centers=3, cluster_std=0.6, random_state=42)

# Add noise points
noise_points = np.random.uniform(low=X_clean.min()-2, high=X_clean.max()+2, size=(50, 2))
X_noisy = np.vstack([X_clean, noise_points])
y_noisy = np.hstack([y_clean, np.full(50, -1)])  # -1 for noise


In [None]:
# Your code here
# Apply clustering algorithm and visualize results


---
# Summary: When to Use Each Algorithm

## K-Means
✅ **Use when:**
- Clusters are spherical/globular
- Clusters are similar in size
- Number of clusters is known
- Fast computation is needed
- Data has no outliers

❌ **Avoid when:**
- Clusters have arbitrary shapes (moons, circles)
- Clusters have very different sizes or densities
- Data contains many outliers
- Number of clusters is unknown

## DBSCAN
✅ **Use when:**
- Clusters have arbitrary shapes
- Number of clusters is unknown
- Data contains outliers/noise
- Outlier detection is important
- Clusters are well-separated by density

❌ **Avoid when:**
- Clusters have very different densities
- High-dimensional data (curse of dimensionality)
- Parameters (eps, min_samples) are hard to tune
- All points must be assigned to clusters

## Key Takeaways
1. **No single algorithm is best for all data**
2. **Visualize your data** before choosing an algorithm
3. **K-Means**: Fast, simple, but assumes spherical clusters
4. **DBSCAN**: Flexible shapes, detects outliers, but sensitive to parameters
5. **Always evaluate** with metrics (silhouette score, visual inspection)
6. **Consider domain knowledge** when interpreting results

---
**End of Tutorial**