# DATASCI 503, Group Work 9: PCA and Clustering

**Instructions:** During lab section, and afterward as necessary, you will collaborate in two-person teams (assigned by the GSI) to complete the problems that are interspersed below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. During lab, feel free to flag down your GSI to ask questions at any point!

## Review of PCA and Clustering

First, let's review how to use Python for PCA and clustering. We start with some theory about PCA.

### PCA


- PCA replaces the original $p$ variables with $d<p$ linear combinations of the original variables that are a “good representation” of the data.


- Mathematical formulation of PCA: The problem is to find the $k$ th new variable $Z_k$ which is a linear combination of the original variables $X_1, X_2, \cdots, X_p$ (i.e. $\left.Z_k=\sum_{j=1}^p w_{k j} X_j\right)$ such that it maximizes
  $$
  w_k^T \Sigma w_k \\
  \text { subject to } w_k^T w_k=1, w_k^T w_{k^{\prime}}=0, k^{\prime}<k
  $$

  The solution to the above problem is given by the eigendecomposition of $\Sigma$: \\
  $\Sigma=W \Lambda W^T$, where $W$ is a  $p \times p$ matrix of (column) eigenvectors and $\Lambda$ is a $p \times p$ diagonal matrix of eigenvalues.


- PCA in practice: Let $X_{n \times p}$ now be the $n \times p$ data matrix (centered). Each data point is represented by a row.
    - Compute the sample covariance matrix $\hat{\Sigma}= \frac{1}{n-1}X^{\top} X$
    - Vectors $w_k$ 's are the eigenvectors of $\hat{\Sigma}$ and are called **PC directions**. The coordinates $w_{k j}$ are called **(factor) loadings**.
    - Vectors $z_k=X w_k(k=1, \ldots, d)$ are called the **principal components** of $X$ and are projections of the data onto the PC directions. Components of $X w_k$ are also called scores.
    - $\operatorname{var}\left(X w_k\right)=\lambda_k$, the eigenvalues of $\hat{\Sigma} ; \lambda_1 \geq \lambda_2 \geq \ldots \lambda_p \geq 0$.
    
    
- How to pick the number of PCs?
  - For visualization, can only use 2 or 3
  - Can choose $d$ to explain certain percent of variation: pick first $d$ so that $\frac{\sum_{k=1}^d \lambda_k}{\sum_{k=1}^p \lambda_k} \geq (1-\alpha)$
    for some pre-specified small alpha (e.g. 0.1)
  - Scree plot: plot $\lambda_k$ or $\sqrt{\lambda_k}$ against $k$ and look for an 'elbow'

In [None]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

warnings.filterwarnings("ignore")

Now we use the crab dataset for a PCA practice.

In [None]:
crabs = pd.read_csv("crabs.csv", index_col=0)
crabs.index = crabs["index"]
crabs.drop(columns=["index"], inplace=True)

In [None]:
crabs.shape

``sp``: species - "B" or "O" for blue or orange.

``sex``: male or female

``index``: index 1:50 within each of the four groups.

``FL``: frontal lobe size (mm).

``RW``: rear width (mm).

``CL``: carapace length (mm).

``CW``: carapace width (mm).

``BD``: body depth (mm).

In [None]:
crabs.head()

In [None]:
crabs.describe()

In [None]:
crabs.describe(include="object")

Now, we use the 4 numerical (continuous) variables to perform PCA.

In [None]:
X = crabs.iloc[:, 3:]
X.head()

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(svd_solver="full")
pca.fit(X)

In [None]:
# By default, it is min(n_features, n_samples)
pca.n_components_

We can get the amount of variance explained by each of the selected components by using the attribute ``explained_variance_``. Note that it corresponds to the eigenvalue $\lambda_k$.

In [None]:
pca.explained_variance_

In [None]:
plt.plot(range(1, pca.n_components_ + 1), pca.explained_variance_, "-o")
plt.xticks(
    range(1, pca.n_components_ + 1)
)  # This line makes sure that the x-axis only shows the integer values
plt.xlabel("k-th PC")
plt.ylabel(r"Variance by each component ($\lambda_k$) ")
plt.show()

The percentage of variance explained by each of the selected components is calculated as follows:

In [None]:
np.round(pca.explained_variance_ratio_, 3)

The tranformed data (the scores) can be calculated as follows:

In [None]:
Z = pca.transform(X)
print(Z.shape)

Now we try to visualize the first two dimensions based on the PCA results.

In [None]:
sns.scatterplot(x=Z[:, 0], y=Z[:, 1], hue=crabs["sex"])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

In [None]:
sns.scatterplot(x=Z[:, 0], y=Z[:, 1], hue=crabs["sp"])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

### Clustering

A dendrogram is a diagram representing a tree. The figure factory called create_dendrogram performs hierarchical clustering on data and represents the resulting tree. Values on the tree depth axis correspond to distances between clusters.

Dendrogram plots are commonly used in computational biology to show the clustering of genes or samples, sometimes in the margin of heatmaps.

In [None]:
import numpy as np
import plotly.figure_factory as ff

np.random.seed(1)

dendrogram_data = np.random.rand(15, 12)  # 15 samples, with 12 dimensions each
fig = ff.create_dendrogram(dendrogram_data)
fig.update_layout(width=800, height=500)
fig.show()

We can also set up a height or threshold manually.

In [None]:
dendrogram_data = np.random.rand(15, 10)  # 15 samples, with 10 dimensions each
fig = ff.create_dendrogram(dendrogram_data, color_threshold=1.5)
fig.update_layout(width=800, height=500)
fig.show()

Now we practice another clustering method: K-means clustering. We work on a synthetic baby data.

In [None]:
import matplotlib.pyplot as plt

x = [4, 5, 10, 4, 3, 11, 14, 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]

plt.scatter(x, y)
plt.show()

"Inertia," also known as within-cluster sum of squares (WSS), measures squared distances between points and the centroid of their cluster.  It is equal to one-half the within-cluster variation (i.e., sum of squared distance between all pairs of points in the same cluster, divided by size of cluster, summed over all clusters).

Now we utilize the elbow method to visualize the intertia for different values of K:

In [None]:
from sklearn.cluster import KMeans

data = list(zip(x, y))
inertias = []

for i in range(1, 11):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(data)
    inertias.append(kmeans.inertia_)

plt.plot(range(1, 11), inertias, marker="o")
plt.title("Elbow method")
plt.xlabel("Number of clusters")
plt.ylabel("Inertia")
plt.show()

The elbow method shows that 2 is a good value for K, so we retrain and visualize the result using K=2.

In [None]:
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)

plt.scatter(x, y, c=kmeans.labels_)
plt.show()

### Clustering Evaluation Metrics: TSS, WSS, and BSS

When evaluating clustering quality, we use several metrics based on sum of squares:

**Notation:**
- Let $x_j\in\mathbb{R}^d$, for $j=1,\ldots,n$, be data points.
- Let $C_1,\ldots,C_K$ be a partition of the indices $1,\ldots,n$ into disjoint clusters.
- For cluster $C_k$ with size $n_k=|C_k|$, define the cluster mean: $\bar{x}_{(k)} = \frac{1}{n_k}\sum_{j\in C_k} x_j$
- Define the grand mean: $\bar{x} = \frac{1}{n} \sum_{j=1}^n x_j$

**Metrics:**

- **Total Sum of Squares (TSS)**: measures spread of points from the grand mean
$$\mathrm{TSS} = \sum_{j=1}^n \|x_j - \bar{x}\|^2$$

- **Within-Cluster Sum of Squares (WSS)**: measures spread of points around their respective cluster means
$$\mathrm{WSS} = \sum_{k=1}^K \sum_{j\in C_k} \|x_j - \bar{x}_{(k)}\|^2$$

- **Between-Cluster Sum of Squares (BSS)**: measures spread of cluster means around the grand mean
$$\mathrm{BSS} = \sum_{k=1}^K n_k \|\bar{x}_{(k)} - \bar{x}\|^2$$

These satisfy the identity: $\mathrm{TSS} = \mathrm{WSS} + \mathrm{BSS}$

**Key insight:** TSS is fixed for a dataset, while WSS and BSS depend on cluster assignment. Good clustering has low WSS (tight clusters) and high BSS (well-separated centers).

In [None]:
# Visualizing different clustering scenarios
np.random.seed(42)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Top-Left: High BSS, Low WSS (OPTIMAL)
ax = axes[0, 0]
c1 = np.random.randn(50, 2) * 0.3 + [1, 1]
c2 = np.random.randn(50, 2) * 0.3 + [4, 1]
c3 = np.random.randn(50, 2) * 0.3 + [2.5, 4]
ax.scatter(c1[:, 0], c1[:, 1], c="red", s=30, alpha=0.6)
ax.scatter(c2[:, 0], c2[:, 1], c="blue", s=30, alpha=0.6)
ax.scatter(c3[:, 0], c3[:, 1], c="green", s=30, alpha=0.6)
ax.set_title("High BSS, Low WSS\n(OPTIMAL)", fontsize=12, fontweight="bold")
ax.set_xlim(-1, 6)
ax.set_ylim(-1, 6)

# Top-Right: High BSS, High WSS (SEPARATED BUT LOOSE)
ax = axes[0, 1]
c1 = np.random.randn(50, 2) * 0.8 + [1, 1]
c2 = np.random.randn(50, 2) * 0.8 + [4, 1]
c3 = np.random.randn(50, 2) * 0.8 + [2.5, 4]
ax.scatter(c1[:, 0], c1[:, 1], c="red", s=30, alpha=0.6)
ax.scatter(c2[:, 0], c2[:, 1], c="blue", s=30, alpha=0.6)
ax.scatter(c3[:, 0], c3[:, 1], c="green", s=30, alpha=0.6)
ax.set_title("High BSS, High WSS\n(SEPARATED BUT LOOSE)", fontsize=12, fontweight="bold")
ax.set_xlim(-1, 6)
ax.set_ylim(-1, 6)

# Bottom-Left: Low BSS, Low WSS (TIGHT BUT OVERLAPPING)
ax = axes[1, 0]
c1 = np.random.randn(50, 2) * 0.3 + [2, 2]
c2 = np.random.randn(50, 2) * 0.3 + [2.5, 2.5]
c3 = np.random.randn(50, 2) * 0.3 + [3, 2]
ax.scatter(c1[:, 0], c1[:, 1], c="red", s=30, alpha=0.6)
ax.scatter(c2[:, 0], c2[:, 1], c="blue", s=30, alpha=0.6)
ax.scatter(c3[:, 0], c3[:, 1], c="green", s=30, alpha=0.6)
ax.set_title("Low BSS, Low WSS\n(TIGHT BUT OVERLAPPING)", fontsize=12, fontweight="bold")
ax.set_xlim(-1, 6)
ax.set_ylim(-1, 6)

# Bottom-Right: Low BSS, High WSS (POOR)
ax = axes[1, 1]
c1 = np.random.randn(50, 2) * 0.8 + [2, 2]
c2 = np.random.randn(50, 2) * 0.8 + [2.5, 2.5]
c3 = np.random.randn(50, 2) * 0.8 + [3, 2]
ax.scatter(c1[:, 0], c1[:, 1], c="red", s=30, alpha=0.6)
ax.scatter(c2[:, 0], c2[:, 1], c="blue", s=30, alpha=0.6)
ax.scatter(c3[:, 0], c3[:, 1], c="green", s=30, alpha=0.6)
ax.set_title("Low BSS, High WSS\n(POOR CLUSTERING)", fontsize=12, fontweight="bold")
ax.set_xlim(-1, 6)
ax.set_ylim(-1, 6)

plt.tight_layout()
plt.show()

### How K-Means Works

K-Means is one of the most popular clustering algorithms. The goal is to partition $n$ data points into $K$ clusters, where each point belongs to the cluster with the nearest **mean** (center).

**The Algorithm:**

Given data points $x_1, \ldots, x_n \in \mathbb{R}^d$ and desired number of clusters $K$:

1. **Initialize**: Randomly select $K$ points as initial cluster centers $\mu_1, \ldots, \mu_K$

2. **Assignment**: Assign each point to the nearest center:
$$C_k = \{i : \|x_i - \mu_k\| \leq \|x_i - \mu_j\| \text{ for all } j\}$$

3. **Update**: Recalculate centers as the mean of assigned points:
$$\mu_k = \frac{1}{|C_k|} \sum_{i \in C_k} x_i$$

4. **Repeat** Steps 2-3 until convergence (labels stop changing)

**Connection to WSS:** The K-Means algorithm minimizes WSS! Each iteration reduces WSS until convergence.

---

**Problem 1: Implement Clustering Metrics**

Implement three functions to compute the clustering evaluation metrics:
- `tss(X)`: Total Sum of Squares
- `wss(X, labels)`: Within-Cluster Sum of Squares  
- `bss(X, labels)`: Between-Cluster Sum of Squares

where `X` is an $n \times d$ NumPy array and `labels` is a 1-D array of length $n$ with integer cluster labels.

In [None]:
# Generate demo data for testing
np.random.seed(0)
cluster_a = np.random.randn(30, 2) + np.array([3, 0])
cluster_b = np.random.randn(30, 2) + np.array([-2, 2])
cluster_c = np.random.randn(30, 2) + np.array([0, -3])

X_demo = np.vstack([cluster_a, cluster_b, cluster_c])
labels_demo = np.array([0] * 30 + [1] * 30 + [2] * 30)


def tss(data):
    """
    Compute Total Sum of Squares.

    Parameters
    ----------
    data : np.ndarray
        Data matrix of shape (n, d).

    Returns
    -------
    float
        Total sum of squared distances from the grand mean.
    """
    # BEGIN SOLUTION
    grand_mean = data.mean(axis=0)
    return np.sum((data - grand_mean) ** 2)
    # END SOLUTION


def wss(data, labels):
    """
    Compute Within-Cluster Sum of Squares.

    Parameters
    ----------
    data : np.ndarray
        Data matrix of shape (n, d).
    labels : np.ndarray
        Cluster labels of shape (n,).

    Returns
    -------
    float
        Sum of squared distances from each point to its cluster mean.
    """
    # BEGIN SOLUTION
    total = 0.0
    for k in np.unique(labels):
        cluster_points = data[labels == k]
        cluster_mean = cluster_points.mean(axis=0)
        total += np.sum((cluster_points - cluster_mean) ** 2)
    return total
    # END SOLUTION


def bss(data, labels):
    """
    Compute Between-Cluster Sum of Squares.

    Parameters
    ----------
    data : np.ndarray
        Data matrix of shape (n, d).
    labels : np.ndarray
        Cluster labels of shape (n,).

    Returns
    -------
    float
        Weighted sum of squared distances from cluster means to the grand mean.
    """
    # BEGIN SOLUTION
    grand_mean = data.mean(axis=0)
    total = 0.0
    for k in np.unique(labels):
        cluster_points = data[labels == k]
        cluster_mean = cluster_points.mean(axis=0)
        n_k = len(cluster_points)
        total += n_k * np.sum((cluster_mean - grand_mean) ** 2)
    return total
    # END SOLUTION

In [None]:
# Test assertions
tss_val = tss(X_demo)
wss_val = wss(X_demo, labels_demo)
bss_val = bss(X_demo, labels_demo)

print(f"TSS: {tss_val:.4f}")
print(f"WSS: {wss_val:.4f}")
print(f"BSS: {bss_val:.4f}")
print(f"WSS + BSS: {wss_val + bss_val:.4f}")

assert abs(wss_val - 179.1138) < 0.01, f"WSS should be ~179.11, got {wss_val}"
assert abs(bss_val - 789.3281) < 0.01, f"BSS should be ~789.33, got {bss_val}"
assert abs(wss_val + bss_val - tss_val) < 1e-7, "TSS should equal WSS + BSS"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert tss_val > 0, "TSS should be positive"
assert wss_val > 0, "WSS should be positive"
assert bss_val > 0, "BSS should be positive"
assert bss_val / tss_val > 0.8, "BSS/TSS ratio should be high for well-separated clusters"
# END HIDDEN TESTS

---

**Problem 2:** Implement K-Means from Scratch

Complete the K-Means implementation below. The initialization is provided. You need to implement:
1. The **assignment step**: compute distances and assign each point to the nearest center
2. The **update step**: recalculate centers as the mean of assigned points

Track `labels_history` and `centers_history` so we can visualize how the algorithm progresses.

**Hint:** Use [`sklearn.metrics.pairwise_distances`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html) to compute distances efficiently. This function returns a distance matrix where entry `[i, j]` is the distance between `data[i]` and `centers[j]`.

In [None]:
from sklearn.metrics import pairwise_distances


def kmeans_custom(data, n_clusters, max_iter=100, random_state=42):
    """
    Custom K-Means implementation with history tracking.

    Parameters
    ----------
    data : np.ndarray
        Data matrix of shape (n, d).
    n_clusters : int
        Number of clusters.
    max_iter : int
        Maximum number of iterations.
    random_state : int
        Random seed for reproducibility.

    Returns
    -------
    labels : np.ndarray
        Final cluster assignments of shape (n,).
    centers : np.ndarray
        Final cluster centers of shape (n_clusters, d).
    labels_history : list of np.ndarray
        Labels after each iteration (for visualization).
    centers_history : list of np.ndarray
        Centers after each iteration (for visualization).
    """
    np.random.seed(random_state)
    n, d = data.shape

    # Step 1: Initialize centers by randomly selecting n_clusters data points
    indices = np.random.choice(n, n_clusters, replace=False)
    centers = data[indices].copy()

    labels_history = []
    centers_history = []

    for iteration in range(max_iter):
        # Step 2: Assignment - assign each point to nearest center
        # BEGIN SOLUTION
        dist_matrix = pairwise_distances(data, centers)
        labels = np.argmin(dist_matrix, axis=1)
        # END SOLUTION

        # Save current state to history
        labels_history.append(labels.copy())
        centers_history.append(centers.copy())

        # Check for convergence (if labels didn't change)
        if iteration > 0 and np.array_equal(labels, labels_history[-2]):
            break

        # Step 3: Update - recalculate centers as mean of assigned points
        new_centers = np.zeros((n_clusters, d))
        # BEGIN SOLUTION
        for k in range(n_clusters):
            mask = labels == k
            if np.any(mask):
                new_centers[k] = data[mask].mean(axis=0)
            else:
                new_centers[k] = centers[k]  # Keep old center if cluster is empty
        # END SOLUTION

        centers = new_centers

    return labels, centers, labels_history, centers_history

In [None]:
# Test assertions
# Generate test data
from sklearn import datasets

X_test, y_test = datasets.make_blobs(n_samples=150, centers=3, random_state=42, cluster_std=2.0)

labels_result, centers_result, labels_hist, centers_hist = kmeans_custom(
    X_test, n_clusters=3, random_state=42
)

print(f"Number of iterations: {len(labels_hist)}")
print(f"Final labels shape: {labels_result.shape}")
print(f"Final centers shape: {centers_result.shape}")

# Visualize final result
plt.figure(figsize=(8, 6))
plt.scatter(X_test[:, 0], X_test[:, 1], c=labels_result, s=30, cmap="viridis", alpha=0.6)
plt.scatter(
    centers_result[:, 0],
    centers_result[:, 1],
    c="red",
    s=200,
    marker="*",
    edgecolors="black",
    linewidths=2,
    label="Centers",
)
plt.xlabel("x1")
plt.ylabel("x2")
plt.title("K-Means Result")
plt.legend()
plt.show()

assert labels_result.shape == (150,), "Labels should have shape (150,)"
assert centers_result.shape == (3, 2), "Centers should have shape (3, 2)"
assert len(labels_hist) > 1, "Should have recorded multiple iterations"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(np.unique(labels_result)) == 3, "Should have 3 unique cluster labels"
assert len(labels_hist) == len(centers_hist), "History lengths should match"
# END HIDDEN TESTS

---

**Problem 3:** Visualize K-Means Algorithm Progress

Complete the visualization function to see how K-Means evolves iteration by iteration. Create a subplot grid showing the cluster assignments at each iteration.

In [None]:
def visualize_kmeans_steps(data, labels_history, centers_history):
    """
    Visualize K-Means algorithm progress across iterations.

    Parameters
    ----------
    data : np.ndarray
        Data matrix of shape (n, 2) - must be 2D for visualization.
    labels_history : list of np.ndarray
        Labels after each iteration.
    centers_history : list of np.ndarray
        Centers after each iteration.
    """
    n_iterations = len(labels_history)

    # Create subplot grid
    n_cols = min(4, n_iterations)
    n_rows = (n_iterations + n_cols - 1) // n_cols

    _fig, axes = plt.subplots(n_rows, n_cols, figsize=(4 * n_cols, 4 * n_rows))
    if n_iterations == 1:
        axes = np.array([axes])
    axes = axes.flatten()

    for idx in range(n_iterations):
        ax = axes[idx]
        labels = labels_history[idx]
        centers = centers_history[idx]

        # BEGIN SOLUTION
        ax.scatter(data[:, 0], data[:, 1], c=labels, s=30, cmap="viridis", alpha=0.6)
        ax.scatter(
            centers[:, 0],
            centers[:, 1],
            c="red",
            s=200,
            marker="*",
            edgecolors="black",
            linewidths=2,
        )
        ax.set_title(f"Iteration {idx}")
        ax.set_xlabel("x1")
        ax.set_ylabel("x2")
        # END SOLUTION

    # Hide unused subplots
    for idx in range(n_iterations, len(axes)):
        axes[idx].axis("off")

    plt.tight_layout()
    plt.show()

In [None]:
# Test assertions
# Run K-Means with different random seeds to see different convergence behaviors
for seed in [0, 42]:
    print(f"\nRandom seed: {seed}")
    labels, centers, labels_hist, centers_hist = kmeans_custom(
        X_test, n_clusters=3, random_state=seed
    )
    print(f"Converged in {len(labels_hist)} iterations")
    visualize_kmeans_steps(X_test, labels_hist, centers_hist)

assert len(labels_hist) >= 1, "Should have at least 1 iteration recorded"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert callable(visualize_kmeans_steps), "visualize_kmeans_steps should be a function"
# END HIDDEN TESTS

---

**Problem 4:** K-Means Convergence (free response)

Based on the visualizations above, answer the following:

1. Does K-Means always converge to the same result with different random initializations?
2. How does the number of iterations to convergence vary with different seeds?
3. What does this tell you about the importance of initialization in K-Means?

> BEGIN SOLUTION

1. No, K-Means does not always converge to the same result. Different random initializations can lead to different local minima of the WSS objective function.

2. The number of iterations varies depending on how close the initial centers are to the final cluster structure. Some seeds may start with centers that are already near optimal, while others require more iterations.

3. Initialization matters: poor initialization can lead to suboptimal clustering or slower convergence. In practice, k-means++ initialization or running K-Means multiple times with different seeds can help.

> END SOLUTION

We have demonstrated how PCA works, but really, moving from 6 variables to 2 principal components is a trivial exercise. PCA is most effective when the data is truly of higher dimension.

---

**Problem 5:** Synthetic Data Generation

Generate some data of rank 3. Create 250 observations of features $z \sim \text{MVN}([5, 3, 1], I)$. Then, create a low-rank data matrix $X$ such that $X = ZL + \epsilon$, where $L$ is a 3x50 matrix of standard normal values and $\epsilon \sim N(0, 0.25)$.

Store your results in the following variables:
- `latent_samples`: the 250x3 matrix of multivariate normal samples (Z)
- `projection_matrix`: the 3x50 matrix of standard normal values (L)
- `data_matrix`: the final 250x50 low-rank data matrix with noise (X)

In [None]:
import numpy as np

mean = [5, 3, 1]
cov = np.identity(3)

# BEGIN SOLUTION
# Generate multivariate normal samples for the latent variables
latent_samples = np.random.multivariate_normal(mean, cov, 250)

# Generate random projection matrix
projection_matrix = np.random.normal(0, 1, (3, 50))

# Create low-rank data matrix with noise
data_matrix = np.matmul(latent_samples, projection_matrix)
data_matrix += np.random.normal(0, 0.25, data_matrix.shape)
# END SOLUTION

In [None]:
# Test assertions
assert latent_samples.shape == (
    250,
    3,
), f"Expected latent_samples shape (250, 3), got {latent_samples.shape}"
assert projection_matrix.shape == (
    3,
    50,
), f"Expected projection_matrix shape (3, 50), got {projection_matrix.shape}"
assert data_matrix.shape == (
    250,
    50,
), f"Expected data_matrix shape (250, 50), got {data_matrix.shape}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert latent_samples.dtype == np.float64, "latent_samples should be float64"
assert data_matrix.dtype == np.float64, "data_matrix should be float64"
assert (
    np.abs(latent_samples.mean(axis=0) - np.array([5, 3, 1])).max() < 1.0
), "Latent samples mean should be approximately [5, 3, 1]"
# END HIDDEN TESTS

---

**Problem 6:** Clustering on Raw Data

Create a 2x3 subplot grid showing K-means clustering results with $K \in \{2, 3, 4, 5, 6, 7\}$ on the first two features of the synthetic data matrix `data_matrix` from Problem 5. Each subplot should show a scatter plot of the data colored by cluster assignment.

**Note:** Your plot must be organized as a subplot grid. Six separate plots arranged vertically will not receive credit.

In [None]:
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans

# BEGIN SOLUTION
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
num_rows, num_cols = 2, 3

for cluster_count in range(2, 8):
    kmeans = KMeans(n_clusters=cluster_count, random_state=0)
    kmeans.fit(data_matrix)

    row_idx = (cluster_count - 2) // num_cols
    col_idx = (cluster_count - 2) % num_cols
    axes[row_idx, col_idx].scatter(
        data_matrix[:, 0], data_matrix[:, 1], c=kmeans.labels_, cmap="viridis", alpha=0.7
    )
    axes[row_idx, col_idx].set_title(f"K={cluster_count}")

plt.tight_layout()
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert fig is not None, "Figure should be created"
assert axes.shape == (2, 3), f"Expected 2x3 subplot grid, got {axes.shape}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(fig.get_axes()) == 6, "Figure should have exactly 6 subplots"
# END HIDDEN TESTS

The 3 clusters we know to exist from construction do not look very separable. Perhaps using just 2 random features in the input space (which we know is a subset of features from a low-rank matrix) was not ideal. Instead, let us see if projecting to a lower-dimensional space (creating latent features) yields more separable clusters.

---

**Problem 7:** Recovering Principal Components

Run PCA with a reasonable number of components on `data_matrix`. Assess the total variance explained by the principal components. Based on this assessment, decide how many PCs to use for downstream clustering and record your answer in the markdown cell below.

Create a plot of the cumulative sum of variance explained by PCA.

Store your results in the following variables:
- `pca_model`: the fitted PCA model
- `explained_variance_cumsum`: a NumPy array of cumulative variance explained ratios

In [None]:
# BEGIN SOLUTION
pca_model = PCA(svd_solver="full", n_components=10)
pca_model.fit(data_matrix)
explained_variance_cumsum = np.cumsum(pca_model.explained_variance_ratio_)

plt.plot(range(1, len(explained_variance_cumsum) + 1), explained_variance_cumsum, marker="o")
plt.title("Cumulative Variance Explained by PCs")
plt.xlabel("Number of PCs")
plt.ylabel("Cumulative Variance Explained")
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert pca_model is not None, "PCA model should be created"
assert len(explained_variance_cumsum) == 10, "Should have 10 cumulative variance values"
assert explained_variance_cumsum[-1] <= 1.0, "Cumulative variance should not exceed 1.0"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert explained_variance_cumsum[0] > 0.3, "First PC should explain significant variance"
assert (
    explained_variance_cumsum[2] > 0.9
), "First 3 PCs should explain most variance (data is rank 3)"
# END HIDDEN TESTS

> BEGIN SOLUTION

**Number of PCs to use:** 3. The first 3 principal components explain nearly all the variance in the data, which makes sense because the data was constructed to be rank 3.
> END SOLUTION


---

---

**Problem 8a:** Clustering on PCA-Transformed Data

Now that we have established how many principal components to use, visualize how many clusters are separable in the reduced PCA space. Create a 2x3 subplot grid showing K-means clustering results with $K \in \{2, 3, 4, 5, 6, 7\}$ on the PCA-transformed data.

**Note:** Your plot must be organized as a subplot grid. Six separate plots arranged vertically will not receive credit.

In [None]:
# BEGIN SOLUTION
fig_pca, axes_pca = plt.subplots(2, 3, figsize=(15, 10))
num_rows, num_cols = 2, 3

# Transform data using 3 principal components (done once, outside the loop)
pca_transform = PCA(svd_solver="full", n_components=3)
pca_data = pca_transform.fit_transform(data_matrix)

for cluster_count in range(2, 8):
    kmeans = KMeans(n_clusters=cluster_count, random_state=0)
    kmeans.fit(pca_data)

    row_idx = (cluster_count - 2) // num_cols
    col_idx = (cluster_count - 2) % num_cols
    axes_pca[row_idx, col_idx].scatter(
        pca_data[:, 0], pca_data[:, 1], c=kmeans.labels_, cmap="viridis", alpha=0.7
    )
    axes_pca[row_idx, col_idx].set_title(f"K={cluster_count}")

plt.tight_layout()
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert fig_pca is not None, "Figure should be created"
assert axes_pca.shape == (2, 3), f"Expected 2x3 subplot grid, got {axes_pca.shape}"
assert pca_data.shape[1] == 3, "PCA should reduce to 3 components"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(fig_pca.get_axes()) == 6, "Figure should have exactly 6 subplots"
assert pca_data.shape[0] == 250, "Should have 250 observations after PCA"
# END HIDDEN TESTS

---

**Problem 8b:** Interpreting PCA Space

In 1-2 sentences, explain how to interpret the axes of the cluster plots using PCA data and why the clusters were visually separable in PCA space.

> BEGIN SOLUTION

The axes represent the principal component scores (projections onto the eigenvectors of the covariance matrix), which capture the directions of maximum variance in the data. The clusters are more separable in PCA space because PCA concentrates the true underlying structure into fewer dimensions, removing noisy features that obscure the latent groupings.
> END SOLUTION


---

## Working with Gene Expression Data

Gene expression data is a good example of real data that typically requires dimensionality reduction. Some datasets have tens of thousands of genes. We will work with a dataset that has been reduced for us thanks to the [scanpy](https://scanpy.readthedocs.io/en/stable/generated/scanpy.datasets.pbmc68k_reduced.html) library.

We have included the relevant code needed to load the dataset into the notebook. The data is an [annotated data matrix](https://anndata.readthedocs.io/en/stable/). It is a class that handles data organization, usually for single-cell data.

Here is a quick summary of each anndata component (you will not need to use them all for this lab):

- **X**: The primary two-dimensional data matrix (e.g., gene expression) where rows usually correspond to cells and columns to features (genes).
- **obs**: A DataFrame storing per-observation (often per-cell) annotations such as cluster labels or metadata.
- **var**: A DataFrame storing per-variable (often per-gene) annotations like gene symbols or feature quality metrics.
- **uns**: A dictionary-like structure for unstructured annotations, typically holding things like color schemes, parameter settings, or additional metadata.
- **obsm**: A dictionary of matrices aligned with observations (cells), commonly used for embeddings (e.g., PCA, UMAP coordinates).
- **varm**: A dictionary of matrices aligned with variables (genes), often used for storing feature loadings in dimensionality reduction.
- **layers**: A dictionary of additional data layers (e.g., raw counts, imputed data) that share the same dimensionality as X but may differ in values.
- **raw**: An optional structure holding the unmodified or "raw" version of the data matrix (plus corresponding var), often used to preserve counts before normalization or filtering.

---

**Problem 9:** Working with Scanpy Datasets

We have included the installation of scanpy and the dataset load-in.

(a) Using the data matrix `X` and the PCA matrix `X_pca`, determine the number of observations and features in each dataset. Fill in the correct values for `num_cells_x`, `num_genes_x`, `num_cells_pca`, and `num_pcs`.

(b) The `bulk_labels` object has the annotations of each cell's cell type. In the markdown cell below, state how many unique cell types there are.

In [None]:
# JUST RUN, DO NOT EDIT
import scanpy as sc

# Load the pbmc68k dataset
adata = sc.datasets.pbmc68k_reduced()

# Inspect the AnnData object
print(adata)

In [None]:
# BEGIN SOLUTION
# Extract shapes from the data
gene_expression_pca = adata.obsm["X_pca"]
print(f"X shape: {adata.X.shape}")
print(f"X_pca shape: {gene_expression_pca.shape}")
print(f"Number of unique cell types: {adata.obs['bulk_labels'].nunique()}")

# Fill in the values
num_cells_x = adata.X.shape[0]
num_genes_x = adata.X.shape[1]
num_cells_pca = adata.obsm["X_pca"].shape[0]
num_pcs = adata.obsm["X_pca"].shape[1]
# END SOLUTION

In [None]:
# Test assertions
assert num_cells_x == 700, f"Expected 700 cells in X, got {num_cells_x}"
assert num_genes_x == 765, f"Expected 765 genes in X, got {num_genes_x}"
assert num_cells_pca == 700, f"Expected 700 cells in X_pca, got {num_cells_pca}"
assert num_pcs == 50, f"Expected 50 PCs in X_pca, got {num_pcs}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert adata.X.shape == (700, 765), "X shape should be (700, 765)"
assert adata.obsm["X_pca"].shape == (700, 50), "X_pca shape should be (700, 50)"
# END HIDDEN TESTS

> BEGIN SOLUTION

**Answer:** There are 10 unique cell types.
> END SOLUTION


---

---

**Problem 10a:** K-Means Clustering on Raw Gene Expression Data

Perform K-means clustering on the entire gene expression dataset (`adata.X`). Use $K \in \{5, 10, 15\}$. Store the cluster assignments in the per-observation annotations (`obs`) with keys `kmeans5_ALL`, `kmeans10_ALL`, and `kmeans15_ALL`. The $K=5$ example is provided for you.

In [None]:
gene_data = adata.X

# kmeans with k=5 (provided example)
kmeans = KMeans(n_clusters=5, random_state=0).fit(gene_data)
adata.obs["kmeans5_ALL"] = kmeans.labels_.astype(str)

# BEGIN SOLUTION
# kmeans with k=10
kmeans = KMeans(n_clusters=10, random_state=0).fit(gene_data)
adata.obs["kmeans10_ALL"] = kmeans.labels_.astype(str)

# kmeans with k=15
kmeans = KMeans(n_clusters=15, random_state=0).fit(gene_data)
adata.obs["kmeans15_ALL"] = kmeans.labels_.astype(str)
# END SOLUTION

# Visualize clusters in UMAP space
sc.pl.umap(adata, color=["kmeans5_ALL", "kmeans10_ALL", "kmeans15_ALL", "bulk_labels"])

In [None]:
# Test assertions
assert "kmeans5_ALL" in adata.obs.columns, "kmeans5_ALL should be in obs"
assert "kmeans10_ALL" in adata.obs.columns, "kmeans10_ALL should be in obs"
assert "kmeans15_ALL" in adata.obs.columns, "kmeans15_ALL should be in obs"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(adata.obs["kmeans5_ALL"].unique()) == 5, "kmeans5_ALL should have 5 clusters"
assert len(adata.obs["kmeans10_ALL"].unique()) == 10, "kmeans10_ALL should have 10 clusters"
assert len(adata.obs["kmeans15_ALL"].unique()) == 15, "kmeans15_ALL should have 15 clusters"
# END HIDDEN TESTS

---

**Problem 10b:** K-Means Clustering on PCA-Transformed Gene Data

Now perform the same K-means clustering ($K \in \{5, 10, 15\}$) but using the principal components as clustering input. Store results with keys `kmeans5_PC`, `kmeans10_PC`, and `kmeans15_PC`.

In [None]:
# Extract PCA coordinates
gene_pca_data = adata.obsm["X_pca"]

# BEGIN SOLUTION
# kmeans with k=5
kmeans = KMeans(n_clusters=5, random_state=0).fit(gene_pca_data)
adata.obs["kmeans5_PC"] = kmeans.labels_.astype(str)

# kmeans with k=10
kmeans = KMeans(n_clusters=10, random_state=0).fit(gene_pca_data)
adata.obs["kmeans10_PC"] = kmeans.labels_.astype(str)

# kmeans with k=15
kmeans = KMeans(n_clusters=15, random_state=0).fit(gene_pca_data)
adata.obs["kmeans15_PC"] = kmeans.labels_.astype(str)
# END SOLUTION

# Visualize clusters in UMAP space
sc.pl.umap(adata, color=["kmeans5_PC", "kmeans10_PC", "kmeans15_PC", "bulk_labels"])

In [None]:
# Test assertions
assert "kmeans5_PC" in adata.obs.columns, "kmeans5_PC should be in obs"
assert "kmeans10_PC" in adata.obs.columns, "kmeans10_PC should be in obs"
assert "kmeans15_PC" in adata.obs.columns, "kmeans15_PC should be in obs"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(adata.obs["kmeans5_PC"].unique()) == 5, "kmeans5_PC should have 5 clusters"
assert len(adata.obs["kmeans10_PC"].unique()) == 10, "kmeans10_PC should have 10 clusters"
assert len(adata.obs["kmeans15_PC"].unique()) == 15, "kmeans15_PC should have 15 clusters"
# END HIDDEN TESTS

---

---

**Problem 11a:** Hierarchical Clustering on Raw Gene Expression Data

Perform agglomerative (hierarchical) clustering on the entire gene expression dataset (`gene_data`). Use `n_clusters` $\in \{5, 10, 15\}$ with Euclidean distance and Ward linkage. Store the cluster assignments in obs with keys `hclust_5_ALL`, `hclust_10_ALL`, and `hclust_15_ALL`. The $n=5$ example is provided for you.

See [`AgglomerativeClustering`](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html) for documentation.

In [None]:
from sklearn.cluster import AgglomerativeClustering

# Hierarchical clustering with n_clusters=5 (provided example)
cluster = AgglomerativeClustering(n_clusters=5, metric="euclidean", linkage="ward")
adata.obs["hclust_5_ALL"] = cluster.fit_predict(gene_data).astype(str)

# BEGIN SOLUTION
# Hierarchical clustering with n_clusters=10
cluster = AgglomerativeClustering(n_clusters=10, metric="euclidean", linkage="ward")
adata.obs["hclust_10_ALL"] = cluster.fit_predict(gene_data).astype(str)

# Hierarchical clustering with n_clusters=15
cluster = AgglomerativeClustering(n_clusters=15, metric="euclidean", linkage="ward")
adata.obs["hclust_15_ALL"] = cluster.fit_predict(gene_data).astype(str)
# END SOLUTION

# Visualize clusters in UMAP space
sc.pl.umap(adata, color=["hclust_5_ALL", "hclust_10_ALL", "hclust_15_ALL", "bulk_labels"])

In [None]:
# Test assertions
assert "hclust_5_ALL" in adata.obs.columns, "hclust_5_ALL should be in obs"
assert "hclust_10_ALL" in adata.obs.columns, "hclust_10_ALL should be in obs"
assert "hclust_15_ALL" in adata.obs.columns, "hclust_15_ALL should be in obs"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(adata.obs["hclust_5_ALL"].unique()) == 5, "hclust_5_ALL should have 5 clusters"
assert len(adata.obs["hclust_10_ALL"].unique()) == 10, "hclust_10_ALL should have 10 clusters"
assert len(adata.obs["hclust_15_ALL"].unique()) == 15, "hclust_15_ALL should have 15 clusters"
# END HIDDEN TESTS

---

**Problem 11b:** Hierarchical Clustering on PCA-Transformed Gene Data

Now perform the same hierarchical clustering (`n_clusters` $\in \{5, 10, 15\}$) but using the principal components as clustering input. Store results with keys `hclust_5_PC`, `hclust_10_PC`, and `hclust_15_PC`.

In [None]:
# BEGIN SOLUTION
# Hierarchical clustering with n_clusters=5
cluster = AgglomerativeClustering(n_clusters=5, metric="euclidean", linkage="ward")
adata.obs["hclust_5_PC"] = cluster.fit_predict(gene_pca_data).astype(str)

# Hierarchical clustering with n_clusters=10
cluster = AgglomerativeClustering(n_clusters=10, metric="euclidean", linkage="ward")
adata.obs["hclust_10_PC"] = cluster.fit_predict(gene_pca_data).astype(str)

# Hierarchical clustering with n_clusters=15
cluster = AgglomerativeClustering(n_clusters=15, metric="euclidean", linkage="ward")
adata.obs["hclust_15_PC"] = cluster.fit_predict(gene_pca_data).astype(str)
# END SOLUTION

# Visualize clusters in UMAP space
sc.pl.umap(adata, color=["hclust_5_PC", "hclust_10_PC", "hclust_15_PC", "bulk_labels"])

In [None]:
# Test assertions
assert "hclust_5_PC" in adata.obs.columns, "hclust_5_PC should be in obs"
assert "hclust_10_PC" in adata.obs.columns, "hclust_10_PC should be in obs"
assert "hclust_15_PC" in adata.obs.columns, "hclust_15_PC should be in obs"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(adata.obs["hclust_5_PC"].unique()) == 5, "hclust_5_PC should have 5 clusters"
assert len(adata.obs["hclust_10_PC"].unique()) == 10, "hclust_10_PC should have 10 clusters"
assert len(adata.obs["hclust_15_PC"].unique()) == 15, "hclust_15_PC should have 15 clusters"
# END HIDDEN TESTS

---

**Problem 12:** Evaluating Cluster Assignments

We have run all the clustering algorithms, but just looking at the cluster assignments is not a great way to evaluate them. One cluster evaluation metric is the [adjusted Rand index (ARI)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html). You do not need to fully understand the formula; just know that scores closer to 1 mean nearly identical clusters while values 0 or below represent random or discordant clusters.

Use [`adjusted_rand_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html) to evaluate all previous clusterings against the true `bulk_labels`. Determine which clustering had the highest ARI. In 1-2 sentences, state which clustering method performed best and explain whether this conclusion makes sense.

In [None]:
from sklearn.metrics import adjusted_rand_score

all_clusters = [
    "kmeans5_ALL",
    "kmeans10_ALL",
    "kmeans15_ALL",
    "kmeans5_PC",
    "kmeans10_PC",
    "kmeans15_PC",
    "hclust_5_ALL",
    "hclust_10_ALL",
    "hclust_15_ALL",
    "hclust_5_PC",
    "hclust_10_PC",
    "hclust_15_PC",
]

# BEGIN SOLUTION
# Find the clustering with the highest ARI
max_ari = 0
best_clustering = None

for cluster_name in all_clusters:
    ari = adjusted_rand_score(adata.obs[cluster_name], adata.obs["bulk_labels"])
    print(f"{cluster_name}: ARI = {ari:.4f}")
    if ari > max_ari:
        max_ari = ari
        best_clustering = cluster_name

print(f"\nThe best clustering was {best_clustering} with an ARI of {max_ari:.4f}")
# END SOLUTION

In [None]:
# Test assertions
assert best_clustering is not None, "best_clustering should be identified"
assert max_ari > 0, "max_ari should be positive"
assert best_clustering in all_clusters, "best_clustering should be one of the cluster methods"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert max_ari > 0.3, "Best ARI should be reasonably high (> 0.3)"
assert isinstance(best_clustering, str), "best_clustering should be a string"
# END HIDDEN TESTS

> BEGIN SOLUTION

The best clustering was `kmeans10_PC` with an ARI of approximately 0.59. This makes sense because (1) the true labels have 10 cell types, so K=10 is the correct number of clusters, and (2) using PCA-transformed data removes noise and concentrates signal in fewer dimensions, leading to better cluster separation.
> END SOLUTION
