# DATASCI 503, Group Work 4: Linear and Quadratic Discriminant Analysis

**Instructions:** During lab section, and afterward as necessary, you will collaborate in two-person teams (assigned by the GSI) to complete the problems that are interspersed below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. **During lab, feel free to flag down your GSI to ask questions at any point!** Upon completion, one member of the team should submit their team's work through Canvas as HTML.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.datasets import make_blobs
from sklearn.discriminant_analysis import (
    LinearDiscriminantAnalysis as LDA,
)
from sklearn.discriminant_analysis import (
    QuadraticDiscriminantAnalysis as QDA,
)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

## Bayes Classifier

The **Bayes Classifier** is the theoretical optimal classifier that minimizes the probability of misclassification. It assigns a new observation $ X $ to the class $ k $ that has the highest posterior probability:

$$
P(Y = k \mid X = x) = \frac{P(X = x \mid Y = k) P(Y = k)}{P(X = x)}
$$

By Bayes' Theorem, the **Bayes Classifier** assigns $ X $ to the class:

$$
\hat{Y} = \arg\max_k P(Y = k \mid X = x)
$$

where:
- $ P(Y = k) $ is the prior probability of class $ k $,
- $ P(X = x \mid Y = k) $ is the class-conditional density,
- $ P(X = x) $ is the marginal density of $ X $.

If we know the true class-conditional distributions, the Bayes Classifier is optimal. However, in practice, these distributions are unknown, and we approximate them using models like **LDA** and **QDA**.

### On Bayes Classifier Error

Even though the Bayes Classifier is optimal, it is not perfect unless the classes are completely separable. The Bayes Error Rate is the probability that the classifier makes a mistake.

This error arises because of overlapping class distributions. When two classes have multivariate normal likelihoods with mean vectors $\mu_0$  and  $\mu_1$  and covariances $\Sigma_0, \Sigma_1$, the posterior probabilities are computed using Bayes' Theorem and the Bayes decision boundary is given by the set of points where:

$$
P(Y = 0 \mid X = x) = P(Y = 1 \mid X = x)
$$

which, in the case of multivariate normal likelihoods, leads to the quadratic equation:

$$
(x - \mu_0)^T \Sigma_0^{-1} (x - \mu_0) - 2 \log P(Y = 0) = (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) - 2 \log P(Y = 1)
$$

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import multivariate_normal

# Set random seed for reproducibility
np.random.seed(42)

# Parameters for Class 0
mu_0 = np.array([2, 2])
sigma_0 = np.array([[1, 0.5], [0.5, 1]])

# Parameters for Class 1
mu_1 = np.array([5, 5])
sigma_1 = np.array([[1, -0.3], [-0.3, 1]])

# Generate data
n_samples = 400  # Total number of samples
prior_0 = 0.6  # Prior probability for class 0
prior_1 = 1 - prior_0  # Prior probability for class 1
# Sample class labels according to the prior
assignment = np.random.choice([0, 1], size=n_samples, p=[prior_0, prior_1])
# Count samples per class
n_samples_0 = np.sum(assignment == 0)
n_samples_1 = np.sum(assignment == 1)
X0 = np.random.multivariate_normal(mu_0, sigma_0, n_samples_0)
X1 = np.random.multivariate_normal(mu_1, sigma_1, n_samples_1)

# set up figure size
plt.figure(figsize=(10, 6))
# Plot data
plt.scatter(X0[:, 0], X0[:, 1], label="Class 0", alpha=0.5)
plt.scatter(X1[:, 0], X1[:, 1], label="Class 1", alpha=0.5)
# plot density contours
x, y = np.linspace(-3, 10, 100), np.linspace(-3, 10, 100)
X, Y = np.meshgrid(x, y)
pos = np.dstack((X, Y))
density0 = multivariate_normal(mu_0, sigma_0).pdf(pos)
density1 = multivariate_normal(mu_1, sigma_1).pdf(pos)
plt.contour(X, Y, density0, levels=[0.01, 0.05, 0.1], colors="r", linestyles="dashed")
plt.contour(X, Y, density1, levels=[0.01, 0.05, 0.1], colors="b", linestyles="dashed")
# plot means as crosses
plt.scatter(mu_0[0], mu_0[1], marker="x", color="r", s=100, label="Mean Class 0")
plt.scatter(mu_1[0], mu_1[1], marker="x", color="b", s=100, label="Mean Class 1")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Simulated Data with Class Means")
plt.legend()
# set axis to [-3,8] and [-2,8]
plt.xlim([-1, 8])
plt.ylim([-1, 8])
plt.show()

To illustrate how the error changes as a function of $x$, we compute the Bayes error rate along the segment connecting  $\mu_0$  and  $\mu_1$.

In [None]:
# Generate points along the line connecting mu_0 and mu_1
t_values = np.linspace(0, 1, 100)
points = np.outer(1 - t_values, mu_0) + np.outer(t_values, mu_1)

# Compute class conditional densities with different covariances
pdf_0 = multivariate_normal.pdf(points, mean=mu_0, cov=sigma_0)
pdf_1 = multivariate_normal.pdf(points, mean=mu_1, cov=sigma_1)

# Compute posterior probabilities (assuming equal priors)
posterior_0 = (pdf_0 * prior_0) / (pdf_0 * prior_0 + pdf_1 * prior_1)
posterior_1 = (pdf_1 * prior_1) / (pdf_0 * prior_0 + pdf_1 * prior_1)

# Compute Bayes error rate at each point
bayes_error = 1 - np.maximum(posterior_0, posterior_1)

# Plot
plt.figure(figsize=(8, 5))
plt.plot(t_values, bayes_error, label="Bayes Error Rate", color="blue", linewidth=2)
plt.xlabel(r"Position along segment from $\mu_0$ to $\mu_1$")
plt.ylabel("Bayes Error Rate")
plt.title("Bayes Error Rate with Different Covariance Matrices")
plt.legend()
plt.grid(visible=True)
plt.show()

Unfortunately we do not know the prior and likelihood to perform this calculations and achieve the Bayes optimal classifier. We thus have to make assumptions on the likelihoods and priors and further estimates their parameters to approximate the Bayes Classifier.

## Linear Discriminant Analysis

LDA assumes that the different classes generate data based on Gaussian distributions with means that are distinct but share the same covariance matrix. This assumption allows LDA to find a linear combination of features that characterizes or separates two or more classes of objects or events.

### Assumptions

- Data is drawn from **$k$** multivariate normal distributions, where each one of these distributions can have a different mean vector $\mu_k$, but all share the same covariance structure $\Sigma$




In [None]:
# Step 1: Generate cluster centers
centers = [[-1, -4], [0, 0], [3, 4]]  # Define centers for 3 clusters
X, labels = make_blobs(n_samples=300, centers=centers, cluster_std=1, random_state=42)

# Step 2: Apply different covariance matrices to each cluster
# Define different covariance matrices
covariances = [
    np.array([[3, 1], [1, 2]]),  # Covariance for the first cluster
    np.array([[3, 1], [1, 2]]),  # Covariance for the second cluster
    np.array([[3, 1], [1, 2]]),  # Covariance for the third cluster
]


# Initialize an empty array for transformed data
X_transformed = np.zeros(X.shape)

for label, cov in zip(range(len(centers)), covariances):
    # Select data points belonging to the current cluster
    cluster_data = X[labels == label]
    # Apply the covariance matrix (linear transformation)
    transformed_cluster_data = cluster_data.dot(cov)
    # Store the transformed data back
    X_transformed[labels == label] = transformed_cluster_data

# Plotting
plt.figure(figsize=(8, 6))
for label in range(len(centers)):
    # Plot each cluster using transformed data
    plt.scatter(
        X_transformed[labels == label][:, 0],
        X_transformed[labels == label][:, 1],
        label=f"Cluster {label + 1}",
    )

plt.title("3 Clusters with Different Covariance Structures")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.grid(visible=True)
plt.show()

You can show that the Bayes classifier assigns class $k$ to observation $X=x$ if:

$$
\delta_k(x) = x^{T}\Sigma^{-1}\mu_k-\frac{1}{2} \mu_k^{T}\Sigma^{-1}\mu_k+ \log \pi_k
$$
is the largest among the $\{\delta_1(x), \delta_2(x), \cdots, \delta_k(x)\}$. This will determine regions that partition the space, decision boundaries, and a new point will be classified according to the region it is contained in.




In [None]:
# Step 3: Train LDA on the transformed data
lda = LDA()
lda.fit(X_transformed, labels)

# Step 4: Visualize the decision boundaries
# Create a mesh to plot the decision boundaries
x_min, x_max = X_transformed[:, 0].min() - 1, X_transformed[:, 0].max() + 1
y_min, y_max = X_transformed[:, 1].min() - 1, X_transformed[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))

# Predict the class for each mesh point
Z = lda.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision boundaries
plt.contourf(xx, yy, Z, alpha=0.4)
plt.scatter(X_transformed[:, 0], X_transformed[:, 1], c=labels, s=20, edgecolor="k")

plt.title("LDA Decision Boundaries")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

## Problems: LDA

---

**Problem 1:** Bayes Classifier Definition

What is the Bayes Classifier? Give the definition.

> BEGIN SOLUTION

The Bayes Classifier is the theoretical optimal classifier that minimizes the probability of misclassification. It assigns a new observation $X$ to the class $k$ that has the highest posterior probability:

$$\hat{Y} = \arg\max_k P(Y = k \mid X = x)$$

By Bayes' Theorem, this can be expressed as:

$$P(Y = k \mid X = x) = \frac{P(X = x \mid Y = k) P(Y = k)}{P(X = x)}$$

where $P(Y = k)$ is the prior probability of class $k$, $P(X = x \mid Y = k)$ is the class-conditional density (likelihood), and $P(X = x)$ is the marginal density of $X$.
> END SOLUTION


---

**Problem 2:** LDA Discriminant Function

Implement the discriminant function for LDA. For class $k$, the discriminant function is:

$$\delta_k(x) = x^{T}\Sigma^{-1}\mu_k - \frac{1}{2} \mu_k^{T}\Sigma^{-1}\mu_k + \log \pi_k$$

where $\Sigma$ is the shared covariance matrix, $\mu_k$ is the mean vector for class $k$, and $\pi_k$ is the prior probability of class $k$.

In [None]:
# Discriminant function for LDA
def discriminant(x: np.ndarray, mu_k: np.ndarray, sigma: np.ndarray, pi: float) -> float:
    """
    Computes the LDA discriminant function for class k.

    Parameters:
    x (np.ndarray): The input data point (d-dimensional).
    mu_k (np.ndarray): Mean vector of class k (d-dimensional).
    sigma (np.ndarray): Shared covariance matrix (d x d).
    pi (float): Prior probability of class k.

    Returns:
    float: The discriminant score for class k.
    """
    # BEGIN SOLUTION
    # Compute inverse of shared covariance matrix
    sigma_inv = np.linalg.inv(sigma)
    # Compute linear term: x^T @ Sigma^{-1} @ mu_k
    linear_term = np.dot(sigma_inv, mu_k)
    # Compute constant term: -0.5 * mu_k^T @ Sigma^{-1} @ mu_k + log(pi)
    constant_term = -0.5 * np.dot(mu_k.T, np.dot(sigma_inv, mu_k)) + np.log(pi)
    return np.dot(x, linear_term) + constant_term
    # END SOLUTION

In [None]:
# Test assertions
# Test case 1: Simple 2D identity covariance
x1 = np.array([1, 2])
mu_k1 = np.array([0, 0])
sigma1 = np.array([[1, 0], [0, 1]])
pi1 = 0.5
result1 = discriminant(x1, mu_k1, sigma1, pi1)
expected1 = -0.69314718056
assert np.isclose(result1, expected1), f"Test case 1 failed: {result1} != {expected1}"

# Test case 2: Non-zero mean
x4 = np.array([1, 2])
mu_k4 = np.array([2, 2])
sigma4 = np.array([[1, 0], [0, 1]])
pi4 = 1.0
result4 = discriminant(x4, mu_k4, sigma4, pi4)
expected4 = 2
assert np.isclose(result4, expected4), f"Test case 2 failed: {result4} != {expected4}"

print("All tests passed!")

# BEGIN HIDDEN TESTS
# Test case: 3D identity covariance
x3 = np.array([1, 2, 3])
mu_k3 = np.array([0, 0, 0])
sigma3 = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])
pi3 = 0.3
result3 = discriminant(x3, mu_k3, sigma3, pi3)
expected3 = -1.2039728043259361
assert np.isclose(result3, expected3), f"Hidden test 1 failed: {result3} != {expected3}"

# Test case: Edge case - zero input
x5 = np.array([0, 0])
mu_k5 = np.array([1, 1])
sigma5 = np.array([[1, 0], [0, 1]])
pi5 = 0.5
result5 = discriminant(x5, mu_k5, sigma5, pi5)
expected5 = -0.5 * (1 + 1) + np.log(0.5)
assert np.isclose(result5, expected5), f"Hidden test 2 failed: {result5} != {expected5}"

# Test case: High-dimensional with random covariance
d = 20
np.random.seed(42)
x_hd = np.random.normal(size=d)
mu_k_hd = np.random.normal(size=d)
sigma_hd = np.random.normal(size=(d, d))
sigma_hd = sigma_hd.dot(sigma_hd.T)
pi_hd = np.random.rand()
result6 = discriminant(x_hd, mu_k_hd, sigma_hd, pi_hd)
expected6 = -0.22459130958043083
assert np.isclose(result6, expected6), f"Hidden test 3 failed: {result6} != {expected6}"
# END HIDDEN TESTS

---

**Problem 3:** LDA Prediction

Implement a function that predicts the class by taking the argmax of the discriminant function over all classes.

In [None]:
def predict(x: np.ndarray, mu: np.ndarray, sigma: np.ndarray, pi: np.ndarray) -> int:
    """
    Computes the predicted class index using LDA.

    Parameters:
    x (np.ndarray): The input data point. shape = (d,)
    mu (np.ndarray): Mean vectors of classes. shape = (k, d)
    sigma (np.ndarray): Shared covariance matrix. shape = (d, d)
    pi (np.ndarray): Prior probabilities. shape = (k,)

    Returns:
    int: The index of the predicted class.
    """
    # BEGIN SOLUTION
    num_classes = mu.shape[0]
    scores = np.zeros(num_classes)
    for class_idx in range(num_classes):
        scores[class_idx] = discriminant(x, mu[class_idx], sigma, pi[class_idx])
    return np.argmax(scores)
    # END SOLUTION

In [None]:
# Test assertions
# Test case 1: Simple 2D example with two classes
x_test = np.array([1, 2])
mu_test = np.array([[0, 0], [2, 2]])
sigma_test = np.array([[1, 0.2], [0.2, 1]])
pi_test = np.array([0.5, 0.5])
expected_class_1 = 1
result_1 = predict(x_test, mu_test, sigma_test, pi_test)
assert (
    result_1 == expected_class_1
), f"Test case 1 failed: expected {expected_class_1}, got {result_1}"

# Test case 2: Higher-dimensional case with three classes
x_test2 = np.array([3, 2, 1])
mu_test2 = np.array([[1, 1, 1], [4, 4, 4], [0, 0, 0]])
sigma_test2 = np.array([[1, 0.1, 0.2], [0.1, 1, 0.3], [0.2, 0.3, 1]])
pi_test2 = np.array([0.4, 0.4, 0.2])
expected_class_2 = 0
result_2 = predict(x_test2, mu_test2, sigma_test2, pi_test2)
assert (
    result_2 == expected_class_2
), f"Test case 2 failed: expected {expected_class_2}, got {result_2}"

print("All tests passed!")

# BEGIN HIDDEN TESTS
# Test case 3: Imbalanced priors
x_test3 = np.array([-1, -1])
mu_test3 = np.array([[1, 1], [-1, -1]])
sigma_test3 = np.array([[1, 0], [0, 1]])
pi_test3 = np.array([0.9, 0.1])
expected_class_3 = 1
result_3 = predict(x_test3, mu_test3, sigma_test3, pi_test3)
assert (
    result_3 == expected_class_3
), f"Hidden test 1 failed: expected {expected_class_3}, got {result_3}"

# Test case 4: Priors affect decision
x_test4 = np.array([0.5, 0.5])
mu_test4 = np.array([[0, 0], [1, 1]])
sigma_test4 = np.array([[1, 0], [0, 1]])
pi_test4 = np.array([0.1, 0.9])
result_4 = predict(x_test4, mu_test4, sigma_test4, pi_test4)
assert result_4 == 1, f"Hidden test 2 failed: expected 1, got {result_4}"
# END HIDDEN TESTS

---

**Problem 4:** Vectorized Discriminant Scores

Using for loops is slow. Implement a vectorized function that directly returns an array of discriminant scores for all classes.

**Hint:** You can use `np.einsum` to perform the computation of each of the two terms of the discriminants in one single call. For example, `np.einsum('ij,jk->ik', A, B)` computes matrix multiplication. See the [NumPy einsum documentation](https://numpy.org/doc/stable/reference/generated/numpy.einsum.html) for more details.

In [None]:
def discriminant_scores_single(
    x: np.ndarray, mu: np.ndarray, sigma: np.ndarray, pi: np.ndarray
) -> np.ndarray:
    """
    Computes all discriminant scores for LDA for a single data point.

    Parameters:
    x (np.ndarray): The input data point. shape = (d,)
    mu (np.ndarray): Mean vectors of classes. shape = (k, d)
    sigma (np.ndarray): Shared covariance matrix. shape = (d, d)
    pi (np.ndarray): Prior probabilities. shape = (k,)

    Returns:
    np.ndarray: An array of discriminant scores. shape = (k,)
    """
    # BEGIN SOLUTION
    # Compute inverse of shared covariance matrix
    sigma_inv = np.linalg.inv(sigma)
    # Compute linear term using einsum: x^T @ Sigma^{-1} @ mu_k for all k
    linear_term = np.einsum("i, ij, kj -> k", x, sigma_inv, mu)
    # Compute constant term: -0.5 * mu_k^T @ Sigma^{-1} @ mu_k + log(pi_k)
    constant_term = -0.5 * np.einsum("ki, ij, kj -> k", mu, sigma_inv, mu) + np.log(pi)
    return linear_term + constant_term
    # END SOLUTION

In [None]:
def predict_vectorized_single(
    x: np.ndarray, mu: np.ndarray, sigma: np.ndarray, pi: np.ndarray
) -> int:
    """Helper function that uses discriminant_scores_single to predict."""
    return np.argmax(discriminant_scores_single(x, mu, sigma, pi))

In [None]:
# Test assertions
# Test case 1: Simple 2D example with two classes
x_test = np.array([1, 2])
mu_test = np.array([[0, 0], [2, 2]])
sigma_test = np.array([[1, 0.2], [0.2, 1]])
pi_test = np.array([0.5, 0.5])
expected_class_1 = 1
result_1 = predict_vectorized_single(x_test, mu_test, sigma_test, pi_test)
assert (
    result_1 == expected_class_1
), f"Test case 1 failed: expected {expected_class_1}, got {result_1}"

# Test case 2: Higher-dimensional case with three classes
x_test2 = np.array([3, 2, 1])
mu_test2 = np.array([[1, 1, 1], [4, 4, 4], [0, 0, 0]])
sigma_test2 = np.array([[1, 0.1, 0.2], [0.1, 1, 0.3], [0.2, 0.3, 1]])
pi_test2 = np.array([0.4, 0.4, 0.2])
expected_class_2 = 0
result_2 = predict_vectorized_single(x_test2, mu_test2, sigma_test2, pi_test2)
assert (
    result_2 == expected_class_2
), f"Test case 2 failed: expected {expected_class_2}, got {result_2}"

print("All tests passed!")

# BEGIN HIDDEN TESTS
# Test case 3: Imbalanced priors
x_test3 = np.array([-1, -1])
mu_test3 = np.array([[1, 1], [-1, -1]])
sigma_test3 = np.array([[1, 0], [0, 1]])
pi_test3 = np.array([0.9, 0.1])
expected_class_3 = 1
result_3 = predict_vectorized_single(x_test3, mu_test3, sigma_test3, pi_test3)
assert (
    result_3 == expected_class_3
), f"Hidden test 1 failed: expected {expected_class_3}, got {result_3}"

# Test case 4: Verify scores match non-vectorized version
np.random.seed(123)
x_rand = np.random.normal(size=5)
mu_rand = np.random.normal(size=(3, 5))
sigma_rand = np.random.normal(size=(5, 5))
sigma_rand = sigma_rand.dot(sigma_rand.T)
pi_rand = np.array([0.3, 0.4, 0.3])
scores_vec = discriminant_scores_single(x_rand, mu_rand, sigma_rand, pi_rand)
for class_idx in range(3):
    score_loop = discriminant(x_rand, mu_rand[class_idx], sigma_rand, pi_rand[class_idx])
    assert np.isclose(
        scores_vec[class_idx], score_loop
    ), f"Hidden test 2 failed for class {class_idx}"
# END HIDDEN TESTS

---

**Problem 5:** Batch Discriminant Scores

Now vectorize the function over multiple data points to handle batched inputs.

**Hint:** You can use `np.einsum` to perform the computation of each of the two terms of the discriminants in one single call. See the [NumPy einsum documentation](https://numpy.org/doc/stable/reference/generated/numpy.einsum.html) for more details.

In [None]:
def discriminant_scores(
    x: np.ndarray, mu: np.ndarray, sigma: np.ndarray, pi: np.ndarray
) -> np.ndarray:
    """
    Computes all discriminant scores for LDA at several data points.

    Parameters:
    x (np.ndarray): The input data points. shape = (n, d)
    mu (np.ndarray): Mean vectors of classes. shape = (k, d)
    sigma (np.ndarray): Shared covariance matrix. shape = (d, d)
    pi (np.ndarray): Prior probabilities. shape = (k,)

    Returns:
    np.ndarray: An array of discriminant scores. shape = (n, k)
    """
    # BEGIN SOLUTION
    # Compute inverse of shared covariance matrix
    sigma_inv = np.linalg.inv(sigma)
    # Compute linear term (n, k) using einsum
    linear_term = np.einsum("li, ij, kj -> lk", x, sigma_inv, mu)
    # Compute constant term (k,)
    constant_term = -0.5 * np.einsum("ki, ij, kj -> k", mu, sigma_inv, mu) + np.log(pi)
    return linear_term + constant_term
    # END SOLUTION

In [None]:
# Test assertions
def predict_vectorized(
    x: np.ndarray, mu: np.ndarray, sigma: np.ndarray, pi: np.ndarray
) -> np.ndarray:
    """Helper function to predict classes for batched inputs."""
    return np.argmax(discriminant_scores(x, mu, sigma, pi), axis=1)


# Test case 1: Basic shape test
num_samples = 4
num_features = 3
num_classes = 2
np.random.seed(503)
x_batch = np.random.normal(size=(num_samples, num_features))
mu_batch = np.random.normal(size=(num_classes, num_features))
sigma_batch = np.random.normal(size=(num_features, num_features))
sigma_batch = sigma_batch.dot(sigma_batch.T)
pi_batch = np.random.rand(num_classes)

result = discriminant_scores(x_batch, mu_batch, sigma_batch, pi_batch)
expected = np.array(
    [
        [-0.79650917, -1.79941157],
        [-0.63978301, 2.06795532],
        [0.10567327, 1.54700591],
        [0.04314746, 1.44296768],
    ]
)
assert np.allclose(result, expected), f"Test case 1 failed: {result} != {expected}"

# Test case 2: Larger test with shape and statistics check
num_samples = 100
num_features = 20
num_classes = 5
np.random.seed(42)
x_large = np.random.normal(size=(num_samples, num_features))
mu_large = np.random.normal(size=(num_classes, num_features))
sigma_large = np.random.normal(size=(num_features, num_features))
sigma_large = sigma_large.dot(sigma_large.T)
pi_large = np.random.rand(num_classes)

result = discriminant_scores(x_large, mu_large, sigma_large, pi_large)
assert result.shape == (
    num_samples,
    num_classes,
), f"Shape test failed: {result.shape} != {(num_samples, num_classes)}"
assert np.isclose(result.mean(), -41.130043376305416), f"Mean test failed: {result.mean()}"

print("All tests passed!")

# BEGIN HIDDEN TESTS
assert np.isclose(result.std(), 95.2601173142212), f"Hidden test 1 failed: std {result.std()}"
assert np.isclose(result.min(), -525.4935904371674), f"Hidden test 2 failed: min {result.min()}"
assert np.isclose(result.max(), 252.6084971655994), f"Hidden test 3 failed: max {result.max()}"
# END HIDDEN TESTS

---

**Problem 6:** LDA Parameter Count

We now know how to predict data points given a list of means, a covariance matrix, and a vector of prior probabilities.

LDA starts by estimating all of these. How many parameters are estimated with $k$ classes and $d$ dimensions?

> BEGIN SOLUTION

LDA estimates the following parameters:

- **Mean vectors**: $k$ class means, each of dimension $d$, giving $k \cdot d$ parameters.
- **Shared covariance matrix**: One $d \times d$ symmetric matrix, giving $\frac{d(d+1)}{2}$ unique parameters.
- **Prior probabilities**: $k$ priors, but since they sum to 1, only $k-1$ are free parameters.

**Total parameters**: $k \cdot d + \frac{d(d+1)}{2} + (k-1)$

This simplifies to: $kd + \frac{d^2 + d}{2} + k - 1$
> END SOLUTION


---

**Problem 7:** LDA Parameter Estimation

Write a function that takes in $X$, $y$ and computes all LDA parameters using the following formulas:

$$
\mu_k = \frac{1}{\#\{y_i = k\}}\sum_{y_i = k} x_i
$$

$$
\Sigma = \frac{1}{n - K} \sum_{k=1}^{K} \sum_{y_i = k} (x_i - \mu_k)(x_i - \mu_k)^T
$$

$$
\pi_k = \frac{\#\{y_i = k\}}{n}
$$

where $\mu_k$ is the mean vector for class $k$, $\Sigma$ is the shared covariance matrix, and $\pi_k$ is the prior probability of class $k$.

In [None]:
def lda_estimator(X: np.ndarray, y: np.ndarray) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Estimates the parameters for Linear Discriminant Analysis (LDA).

    Parameters:
    X: np.ndarray, shape = (n, d). Feature matrix for n samples.
    y: np.ndarray, shape = (n,). Class labels for n samples.

    Returns:
    mu: np.ndarray, shape = (k, d) - Mean vectors for each class.
    sigma: np.ndarray, shape = (d, d) - Shared covariance matrix.
    pi: np.ndarray, shape = (k,) - Prior probabilities for each class.
    """
    # BEGIN SOLUTION
    classes = np.unique(y)
    num_classes = len(classes)
    num_samples = X.shape[0]

    # Compute priors: pi_k = (# samples in class k) / (total samples)
    pi = np.array([np.mean(y == c) for c in classes])

    # Compute means: mu_k = mean of X for samples in class k
    mu = np.array([X[y == c].mean(axis=0) for c in classes])

    # Compute shared covariance matrix using pooled within-class scatter
    X_centered = X - mu[y]
    sigma = (X_centered.T @ X_centered) / (num_samples - num_classes)

    return mu, sigma, pi
    # END SOLUTION

In [None]:
# Test assertions
# Generate random test dataset
np.random.seed(42)
n_samples_per_class = 50
num_features = 2
num_classes = 2

# Generate random means for each class
mu_true = np.array([[2, 3], [9, 8]])
sigma_true = np.array([[1, 0.5], [0.5, 1]])

# Generate random data
X_class_0 = np.random.multivariate_normal(mu_true[0], sigma_true, n_samples_per_class)
X_class_1 = np.random.multivariate_normal(mu_true[1], sigma_true, n_samples_per_class)

# Combine data
X_test = np.vstack([X_class_0, X_class_1])
y_test = np.hstack(
    [np.zeros(n_samples_per_class, dtype=int), np.ones(n_samples_per_class, dtype=int)]
)

# Compute LDA parameters
mu_test, sigma_test, pi_test = lda_estimator(X_test, y_test)

# Expected values
expected_mu = np.array([[2.15350725, 3.08148984], [9.01263359, 8.15269564]])
expected_pi = np.array([0.5, 0.5])
expected_sigma = np.array([[0.82666485, 0.30615014], [0.30615014, 0.78206076]])

# Test means
assert np.allclose(mu_test, expected_mu, atol=1e-2), f"Mean test failed: {mu_test}"

# Test priors
assert np.allclose(pi_test, expected_pi, atol=1e-2), f"Prior test failed: {pi_test}"

print("All tests passed!")

# BEGIN HIDDEN TESTS
# Test covariance
assert np.allclose(
    sigma_test, expected_sigma, atol=1e-2
), f"Hidden test 1 (covariance) failed: {sigma_test}"

# Test with different class sizes
np.random.seed(99)
X_imbalanced = np.vstack(
    [
        np.random.multivariate_normal([0, 0], np.eye(2), 30),
        np.random.multivariate_normal([3, 3], np.eye(2), 70),
    ]
)
y_imbalanced = np.array([0] * 30 + [1] * 70)
mu_imb, sigma_imb, pi_imb = lda_estimator(X_imbalanced, y_imbalanced)
assert np.isclose(pi_imb[0], 0.3), f"Hidden test 2 (priors) failed: {pi_imb[0]}"
assert np.isclose(pi_imb[1], 0.7), f"Hidden test 3 (priors) failed: {pi_imb[1]}"
# END HIDDEN TESTS

---

**Problem 8:** LDA Class Implementation

Now implement a complete LDA classifier class. Use the functions we wrote above to fill in this class.

In [None]:
class LDAClassifier:
    """Linear Discriminant Analysis classifier."""

    def __init__(self):
        self.mu = None
        self.sigma = None
        self.pi = None

    def fit(self, X: np.ndarray, y: np.ndarray) -> None:
        """
        Estimates the parameters for Linear Discriminant Analysis (LDA).

        Parameters:
        X: np.ndarray, shape = (n, d). Feature matrix for n samples.
        y: np.ndarray, shape = (n,). Class labels for n samples.

        Returns:
        None
        """
        # BEGIN SOLUTION
        self.mu, self.sigma, self.pi = lda_estimator(X, y)
        # END SOLUTION

    def discriminant_scores(self, X: np.ndarray) -> np.ndarray:
        """
        Computes the discriminant scores for LDA.

        Parameters:
        X: np.ndarray, shape = (n, d). Feature matrix for n samples

        Returns:
        np.ndarray, shape = (n, k) - Discriminant scores for each class.
        """
        # BEGIN SOLUTION
        return discriminant_scores(X, self.mu, self.sigma, self.pi)
        # END SOLUTION

    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predicts class labels for given data points.

        Parameters:
        X: np.ndarray, shape = (n, d). Feature matrix for n samples.

        Returns:
        np.ndarray, shape = (n,). Predicted class labels for n samples.
        """
        # BEGIN SOLUTION
        return np.argmax(self.discriminant_scores(X), axis=1)
        # END SOLUTION

In [None]:
# Test assertions
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load and preprocess data
data = load_breast_cancer()
X = StandardScaler().fit_transform(data.data)
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit our LDA classifier
lda_clf = LDAClassifier()
lda_clf.fit(X_train, y_train)

# Predict and compute accuracy
y_pred = lda_clf.predict(X_test)
accuracy = np.mean(y_pred == y_test)

# Test that accuracy is reasonable (above 90%)
assert accuracy > 0.90, f"Accuracy too low: {accuracy}"

# Visualize predictions
plt.figure(figsize=(8, 6))
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred, cmap="viridis")
plt.title("Predicted Classes on Test Set")
plt.xlabel("Feature 1 (scaled)")
plt.ylabel("Feature 2 (scaled)")
plt.text(
    0.95,
    0.05,
    f"Test Accuracy: {accuracy:.2f}",
    transform=plt.gca().transAxes,
    ha="right",
    va="bottom",
)
plt.show()

print("All tests passed!")

# BEGIN HIDDEN TESTS
# Test that fit properly sets attributes
assert lda_clf.mu is not None, "Hidden test 1 failed: mu not set"
assert lda_clf.sigma is not None, "Hidden test 2 failed: sigma not set"
assert lda_clf.pi is not None, "Hidden test 3 failed: pi not set"

# Test shapes
assert lda_clf.mu.shape == (2, 30), f"Hidden test 4 failed: mu shape {lda_clf.mu.shape}"
assert lda_clf.sigma.shape == (
    30,
    30,
), f"Hidden test 5 failed: sigma shape {lda_clf.sigma.shape}"
assert lda_clf.pi.shape == (2,), f"Hidden test 6 failed: pi shape {lda_clf.pi.shape}"
# END HIDDEN TESTS

## Quadratic Discriminant Analysis

QDA relaxes the LDA assumption that classes share the same covariance matrix. Instead, each class has its own covariance matrix $\Sigma_k$.

### QDA Assumptions

- Data is drawn from $k$ multivariate normal distributions, where each distribution can have a different mean vector $\mu_k$ and a different covariance structure $\Sigma_k$.

QDA allows for each class to have its own covariance matrix, making it a more flexible approach that can capture more complex structures. However, it requires estimating more parameters and thus may not perform as well as LDA when training data is limited.

The QDA discriminant function assigns class $k$ to observation $x$ if:

$$
\delta_k(x) = -\frac{1}{2}x^{T}\Sigma^{-1}_k x + x^{T}\Sigma^{-1}_k\mu_k - \frac{1}{2} \mu_k^{T}\Sigma^{-1}_k\mu_k - \frac{1}{2}\log|\Sigma_k| + \log \pi_k
$$

is the largest among all classes. This induces quadratic decision boundaries.

## Problems: QDA

---

**Problem 9:** QDA Parameter Count

How many parameters do you have to estimate in QDA with $k$ classes and $d$ dimensions?

> BEGIN SOLUTION

QDA estimates the following parameters:

- **Mean vectors**: $k$ class means, each of dimension $d$, giving $k \cdot d$ parameters.
- **Class-specific covariance matrices**: $k$ symmetric $d \times d$ matrices, each with $\frac{d(d+1)}{2}$ unique parameters, giving $k \cdot \frac{d(d+1)}{2}$ total.
- **Prior probabilities**: $k$ priors, but since they sum to 1, only $k-1$ are free parameters.

**Total parameters**: $k \cdot d + k \cdot \frac{d(d+1)}{2} + (k-1)$

This simplifies to: $k \left( d + \frac{d(d+1)}{2} \right) + k - 1 = k \cdot \frac{d^2 + 3d + 2}{2} - 1$

Compared to LDA, QDA has $(k-1) \cdot \frac{d(d+1)}{2}$ more parameters due to the class-specific covariance matrices.
> END SOLUTION


---

**Problem 10:** QDA on NHANES Dataset

Fit QDA using `sklearn` on the NHANES dataset. The data files are located in `data/NHANES/`. Pick two features (BMI and RatioToPoverty) and focus on HDL as an outcome. Threshold HDL to generate three class labels:

- Group 0: HDL < 40
- Group 1: 40 <= HDL < 70
- Group 2: HDL >= 70

Before fitting the model, scale your data and split it into train and test sets.

In [None]:
# BEGIN SOLUTION
# Load NHANES data from data folder
data_path = "data/NHANES/"
hdl = pd.read_sas(data_path + "HDL_L.xpt")
bmx = pd.read_sas(data_path + "BMX_L.xpt")
demo = pd.read_sas(data_path + "DEMO_L.xpt")

# Join on SEQN
df = pd.merge(hdl, bmx, on="SEQN", how="inner")
df = pd.merge(df, demo, on="SEQN", how="inner")

# Keep relevant features
my_df = df[["SEQN", "LBDHDD", "BMXBMI", "RIDAGEYR", "INDFMPIR"]]
my_df = my_df.rename(
    columns={
        "LBDHDD": "HDL",
        "BMXBMI": "BMI",
        "INDFMPIR": "RatioToPoverty",
        "RIDAGEYR": "ScreeningAge",
    }
)

# Remove NAs
my_df = my_df.dropna()

# Threshold HDL into three groups
my_df["HDL_group"] = np.where(my_df["HDL"] < 40, 0, np.where(my_df["HDL"] < 70, 1, 2))

# Plot correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(my_df.drop(columns=["SEQN"]).corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Matrix with Outcome")
plt.show()

# Features and labels
X = my_df[["BMI", "RatioToPoverty"]].values
y = my_df["HDL_group"].values

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Fit QDA
qda = QDA()
qda.fit(X_train, y_train)

# Predict and evaluate
y_pred = qda.predict(X_test)
print(f"Accuracy: {np.mean(y_pred == y_test):.4f}")
# END SOLUTION

In [None]:
# Test assertions
# Verify QDA model was fitted
assert hasattr(qda, "means_"), "QDA model should be fitted"
assert len(qda.means_) == 3, "Should have 3 class means"

# Verify accuracy is reasonable (better than random guessing with 3 classes)
qda_accuracy = np.mean(y_pred == y_test)
assert qda_accuracy > 0.4, f"Accuracy should be better than random: {qda_accuracy}"

print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify data was properly scaled
assert np.abs(X_scaled.mean()) < 0.1, "Data should be centered near 0"
assert np.abs(X_scaled.std() - 1) < 0.5, "Data should be standardized"

# Verify train/test split
assert len(X_train) > len(X_test), "Training set should be larger than test set"
assert len(X_train) + len(X_test) == len(X_scaled), "Total samples should match"
# END HIDDEN TESTS

---

**Problem 11:** QDA Decision Boundaries

Plot the decision boundaries for the QDA model fitted in Problem 10.

In [None]:
# BEGIN SOLUTION
# Set min and max values with padding
x_min, x_max = X_scaled[:, 0].min() - 1, X_scaled[:, 0].max() + 1
y_min, y_max = X_scaled[:, 1].min() - 1, X_scaled[:, 1].max() + 1
step_size = 0.02

# Generate a grid of points
xx, yy = np.meshgrid(np.arange(x_min, x_max, step_size), np.arange(y_min, y_max, step_size))

# Predict the class for the whole grid
Z = qda.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

# Plot the decision boundaries and training data
plt.figure(figsize=(10, 8))
plt.contourf(xx, yy, Z, alpha=0.5, cmap="viridis")
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y, s=20, edgecolor="k", cmap="viridis")
plt.title("QDA Decision Boundaries")
plt.xlabel("BMI (scaled)")
plt.ylabel("RatioToPoverty (scaled)")
plt.colorbar(label="Class")
plt.show()
# END SOLUTION

In [None]:
# Test assertions
# Verify the grid was created properly
assert xx.shape == yy.shape, "Grid shapes should match"
assert Z.shape == xx.shape, "Prediction grid shape should match input grid"

# Verify predictions are valid class labels
assert set(np.unique(Z)).issubset({0, 1, 2}), "Predictions should be valid class labels"

print("All tests passed!")

# BEGIN HIDDEN TESTS
# Verify grid covers the data range
assert xx.min() < X_scaled[:, 0].min(), "Grid should extend beyond data range"
assert xx.max() > X_scaled[:, 0].max(), "Grid should extend beyond data range"
assert yy.min() < X_scaled[:, 1].min(), "Grid should extend beyond data range"
assert yy.max() > X_scaled[:, 1].max(), "Grid should extend beyond data range"
# END HIDDEN TESTS