# Class 1: Introduction to Unsupervised Learning and Clustering

**Week 7: Unsupervised Learning and Advanced Data Analysis**

**Objective**: Understand unsupervised learning and apply k-means clustering to group data.

**Agenda**:
- Learn what unsupervised learning is and how it differs from supervised learning.
- Explore k-means clustering: how it works and its applications.
- Demo: Apply k-means to synthetic data.
- Exercise: Cluster a dataset and visualize the results.

Let’s dive into finding patterns in data without labels!

## 1. What is Unsupervised Learning?

- **Definition**: Unsupervised learning finds patterns in data without predefined labels or target outputs.
- **Contrast with Supervised Learning**:
  - Supervised: Predicts labels (e.g., classify emails as spam/not spam using labeled data).
  - Unsupervised: Groups or simplifies data (e.g., segment customers based on behavior).
- **Applications**:
  - Customer segmentation (grouping similar customers).
  - Anomaly detection (identifying unusual patterns).
  - Data compression (reducing dimensions).

**Clustering** is a key unsupervised learning technique where we group similar data points together. Today, we’ll focus on **k-means clustering**.

## 2. K-Means Clustering: How It Works

**Goal**: Divide data into *k* groups (clusters) where points in a cluster are similar to each other.

**Algorithm Steps**:
1. Choose *k* (number of clusters).
2. Randomly initialize *k* centroids (cluster centers).
3. Assign each data point to the nearest centroid.
4. Update centroids by computing the mean of points in each cluster.
5. Repeat steps 3–4 until centroids stabilize.

**Choosing k**:
- Use the **elbow method**: Plot the sum of squared distances (inertia) vs. *k* and look for an “elbow” where adding clusters yields little benefit.

**Applications**:
- Segmenting customers by purchasing patterns.
- Organizing documents by topics.

Let’s see it in action with a demo!

## 3. Demo: K-Means on Synthetic Data

We’ll generate a simple 2D dataset with clear clusters, apply k-means, and visualize the results.

**Setup**: Ensure you have the required libraries installed:
```bash
pip install numpy pandas scikit-learn matplotlib
```

In [None]:
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans

# Generate synthetic data with 3 clusters
X, y = make_blobs(n_samples=300, centers=3, random_state=42)

# Visualize the data
plt.scatter(X[:, 0], X[:, 1], c='blue', s=50, alpha=0.5)
plt.title('Synthetic Data (Before Clustering)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

In [None]:
# Apply k-means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(X)

# Get cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis', alpha=0.5)
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, marker='X', label='Centroids')
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

**Discussion**:
- Notice how k-means grouped similar points together.
- The red X’s are centroids, the “center” of each cluster.
- What happens if we change *k*? Let’s try the elbow method.

In [None]:
# Elbow method to choose k
inertia = []
K = range(1, 10)
for k in K:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertia.append(kmeans.inertia_)

# Plot elbow curve
plt.plot(K, inertia, 'bo-')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()

**Observation**:
- The “elbow” appears around *k=3*, suggesting 3 clusters are optimal for this data.
- Inertia decreases as *k* increases, but we want a balance.

## 4. Exercise: Cluster Your Own Data

Now it’s your turn! Use k-means to cluster a dataset and visualize the results.

**Task**:
- Use the synthetic dataset below (or the iris dataset if you prefer).
- Apply k-means with *k=3* and visualize the clusters.
- Experiment with different *k* values (e.g., 2, 4) and observe the results.
- Bonus: Run the elbow method to confirm the best *k*.

**Dataset**:
- We’ll generate new synthetic data to keep it fresh.

**Instructions**:
1. Run the code below to generate data.
2. Apply k-means and plot the clusters.
3. Try different *k* values and note what changes.
4. (Optional) Plot the elbow curve.

In [None]:
# Generate new synthetic data
X_exercise, _ = make_blobs(n_samples=400, centers=3, random_state=123)

# Visualize the data
plt.scatter(X_exercise[:, 0], X_exercise[:, 1], c='blue', s=50, alpha=0.5)
plt.title('Exercise: Synthetic Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

In [None]:
# Your code here: Apply k-means with k=3
# Hint: Follow the demo steps
kmeans_ex = KMeans(n_clusters=3, random_state=123)
kmeans_ex.fit(X_exercise)

# Get labels and centroids
labels_ex = kmeans_ex.labels_
centroids_ex = kmeans_ex.cluster_centers_

# Plot the clusters
plt.scatter(X_exercise[:, 0], X_exercise[:, 1], c=labels_ex, s=50, cmap='viridis', alpha=0.5)
plt.scatter(centroids_ex[:, 0], centroids_ex[:, 1], c='red', s=200, marker='X', label='Centroids')
plt.title('Your K-Means Clustering (k=3)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

In [None]:
# Experiment with different k (e.g., k=2, k=4)
# Your code here:

# Example for k=4
kmeans_ex4 = KMeans(n_clusters=4, random_state=123)
kmeans_ex4.fit(X_exercise)
labels_ex4 = kmeans_ex4.labels_
centroids_ex4 = kmeans_ex4.cluster_centers_

plt.scatter(X_exercise[:, 0], X_exercise[:, 1], c=labels_ex4, s=50, cmap='viridis', alpha=0.5)
plt.scatter(centroids_ex4[:, 0], centroids_ex4[:, 1], c='red', s=200, marker='X', label='Centroids')
plt.title('Your K-Means Clustering (k=4)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()

In [None]:
# Bonus: Elbow method
# Your code here:

# Example
inertia_ex = []
K_ex = range(1, 10)
for k in K_ex:
    kmeans = KMeans(n_clusters=k, random_state=123)
    kmeans.fit(X_exercise)
    inertia_ex.append(kmeans.inertia_)

plt.plot(K_ex, inertia_ex, 'bo-')
plt.title('Elbow Method for Your Data')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')
plt.show()

## 5. Wrap-Up

**Key Takeaways**:
- Unsupervised learning helps find patterns without labels.
- K-means clusters data by minimizing distances to centroids.
- The elbow method helps choose the right number of clusters.

**Discussion Questions**:
- What did you notice when you changed *k*?
- How might clustering apply to real-world problems (e.g., customer segmentation)?

**Homework**:
- Explore the mall customer dataset (`Mall_Customers.csv`).
- Hypothesize what clusters might exist (e.g., based on spending or age).
- Bring your ideas to Class 2!

Great work today! Next, we’ll simplify high-dimensional data with PCA.