# DATASCI 503, Homework 4: Classification

This assignment covers Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and logistic regression for classification problems.

In [None]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import norm
from sklearn.discriminant_analysis import (
    LinearDiscriminantAnalysis,
    QuadraticDiscriminantAnalysis,
)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sympy import solve, symbols

warnings.filterwarnings("ignore")

---

**Problem 1:** LDA Discriminant Function Derivation

It was stated in the text that classifying an observation to the class for which
$$ p_k(x) = \frac{\pi_k \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2\sigma^2}(x - \mu_k)^2\right)}{\sum_{l=1}^K \pi_l \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2\sigma^2}(x - \mu_l)^2\right)} $$
is largest is equivalent to classifying an observation to the class for which
$$ \delta_k(x) = x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2\sigma^2} + \log(\pi_k) $$
is largest.

Prove that this is the case. In other words, under the assumption that the observations in the $k$th class are drawn from a $\mathcal{N}(\mu_k, \sigma^2)$ distribution, the Bayes classifier assigns an observation to the class for which the discriminant function is maximized.

**Hint:** Take the log of the numerator and simplify, noting that terms that don't depend on $k$ can be ignored when comparing across classes.

> BEGIN SOLUTION

Taking the log of the numerator (the denominator is the same for all classes $k$, so we can ignore it):

$$\log\left(\pi_k \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{1}{2\sigma^2}(x - \mu_k)^2\right)\right)$$

Using log properties:
$$\log(\pi_k) + \log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) + \log\left(\exp\left(-\frac{1}{2\sigma^2}(x - \mu_k)^2\right)\right)$$

$$= \log(\pi_k) - \frac{1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(x - \mu_k)^2$$

Expanding $(x - \mu_k)^2$:
$$= \log(\pi_k) - \frac{1}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}x^2 + \frac{\mu_k}{\sigma^2}x - \frac{\mu_k^2}{2\sigma^2}$$

The terms $-\frac{1}{2}\log(2\pi\sigma^2)$ and $-\frac{1}{2\sigma^2}x^2$ do not depend on $k$, so they don't affect which class maximizes the expression. Dropping these terms:
$$\delta_k(x) = \log(\pi_k) + \frac{\mu_k}{\sigma^2}x - \frac{\mu_k^2}{2\sigma^2}$$
> END SOLUTION


---

**Problem 2:** QDA is Quadratic

This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a class-specific mean vector and a class-specific covariance matrix. We consider the simple case where $p = 1$; i.e., there is only one feature.

Suppose that we have $K$ classes, and that if an observation belongs to the $k$th class then $X$ comes from a one-dimensional normal distribution, $X \sim \mathcal{N}(\mu_k, \sigma^2_k)$. Prove that in this case, the Bayes classifier is *not linear*. Argue that it is in fact quadratic.

**Hint:** Follow the arguments laid out in Problem 1, but without making the assumption that $\sigma^2_1 = \dots = \sigma^2_K$.

> BEGIN SOLUTION

Following the same approach as Problem 1, but with class-specific variances $\sigma_k^2$:

$$\log\left(\pi_k \frac{1}{\sqrt{2\pi\sigma_k^2}} \exp\left(-\frac{(x - \mu_k)^2}{2\sigma_k^2}\right)\right)$$

$$= \log(\pi_k) - \frac{1}{2}\log(2\pi\sigma_k^2) - \frac{(x - \mu_k)^2}{2\sigma_k^2}$$

$$= \log(\pi_k) - \frac{1}{2}\log(2\pi\sigma_k^2) - \frac{x^2}{2\sigma_k^2} + \frac{x\mu_k}{\sigma_k^2} - \frac{\mu_k^2}{2\sigma_k^2}$$

Computing the log odds between class $k$ and class $k'$:
$$\log\frac{p_k(x)}{p_{k'}(x)} = \log(\pi_k) - \frac{1}{2}\log(2\pi\sigma_k^2) - \frac{x^2}{2\sigma_k^2} + \frac{x\mu_k}{\sigma_k^2} - \frac{\mu_k^2}{2\sigma_k^2}$$
$$- \left(\log(\pi_{k'}) - \frac{1}{2}\log(2\pi\sigma_{k'}^2) - \frac{x^2}{2\sigma_{k'}^2} + \frac{x\mu_{k'}}{\sigma_{k'}^2} - \frac{\mu_{k'}^2}{2\sigma_{k'}^2}\right)$$

When $\sigma_k \neq \sigma_{k'}$, the $x^2$ terms do not cancel, resulting in a quadratic decision boundary.
> END SOLUTION


---

**Problem 3:** LDA vs QDA Performance

We now examine the differences between LDA and QDA.

(a) If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

> BEGIN SOLUTION

If the Bayes decision boundary is linear: On training, QDA may perform slightly better due to flexibility/overfitting. On test, LDA will likely perform better because it correctly captures the linear structure.

> END SOLUTION

(b) If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

> BEGIN SOLUTION

On both training and test sets, QDA is expected to perform better because it can model non-linear boundaries.

> END SOLUTION

(c) True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

> BEGIN SOLUTION

False. While QDA can model a linear decision boundary, it has more parameters to estimate. With limited data, this can lead to overfitting, resulting in worse test error compared to LDA.

> END SOLUTION

---

**Problem 4:** Dividend Prediction

Suppose that we wish to predict whether a given stock will issue a dividend this year ("Yes" or "No") based on $X$, last year's percent profit. We examine a large number of companies and discover that:
- The mean value of $X$ for companies that issued a dividend was $\bar{X} = 10$
- The mean for those that didn't was $\bar{X} = 0$
- The variance of $X$ for both groups was $\sigma^2 = 36$
- 80% of companies issued dividends

Assuming that $X$ follows a normal distribution, compute the probability that a company will issue a dividend this year given that its percentage profit was $X = 4$ last year. Store your answer (as a decimal between 0 and 1) in a variable called `prob_dividend`.

**Hint:** Use Bayes' theorem: $P(D|X) = \frac{P(X|D)P(D)}{P(X)}$

In [None]:
# Compute P(D|X=4) using Bayes' theorem
# BEGIN SOLUTION
# Parameters
mu_dividend = 10  # Mean for companies that issued dividend
mu_no_dividend = 0  # Mean for companies that did not
sigma = np.sqrt(36)  # Standard deviation (same for both)
prior_dividend = 0.8  # P(D)
prior_no_dividend = 0.2  # P(not D)
x_observed = 4  # Observed profit

# P(X=4 | D) and P(X=4 | not D) using Gaussian pdf
p_x_given_d = norm.pdf(x_observed, loc=mu_dividend, scale=sigma)
p_x_given_not_d = norm.pdf(x_observed, loc=mu_no_dividend, scale=sigma)

# P(X=4) using law of total probability
p_x = p_x_given_d * prior_dividend + p_x_given_not_d * prior_no_dividend

# P(D | X=4) using Bayes' theorem
prob_dividend = (p_x_given_d * prior_dividend) / p_x
# END SOLUTION

print(f"Probability of dividend given X=4: {prob_dividend:.4f}")

In [None]:
# Test assertions
assert 0.75 < prob_dividend < 0.76, f"Expected ~0.7519, got {prob_dividend}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    abs(prob_dividend - 0.7518524532975261) < 1e-6
), "prob_dividend should be approximately 0.7519"
# END HIDDEN TESTS

## LDA Decision Boundary

Suppose you have one continuous predictor $X$ and a binary categorical response $Y$, which can take values 1 or 2. Suppose you collected training data from the two classes and obtained class-specific sample means $\hat{\mu}_1 = -1$ and $\hat{\mu}_2 = 3$, along with the pooled variance estimate over the two classes, $\hat{\sigma}^2 = 1$.

---

**Problem 5:** Equal Priors

Assume equal class priors and derive the LDA classification rule for this problem. Using `scipy.stats.norm.pdf` to compute the necessary probability density functions, show both of the estimated class-conditional densities in the same plot. Also show the estimated Bayes decision boundary in this plot. Make sure you label the axes.

Let $c$ denote the position of the decision boundary. Store the numerical value of $c$ in a variable called `boundary_c`.

In [None]:
# Plot class-conditional densities and compute decision boundary
# BEGIN SOLUTION
# Parameters
mu1, mu2 = -1, 3
sigma = 1

# Generate x values for plotting
x_vals = np.linspace(-5, 7, 1000)

# Class-conditional densities
pdf_class1 = norm.pdf(x_vals, mu1, sigma)
pdf_class2 = norm.pdf(x_vals, mu2, sigma)

# Decision boundary with equal priors: midpoint between means
boundary_c = (mu1 + mu2) / 2

# Plot
plt.figure(figsize=(8, 5))
plt.plot(x_vals, pdf_class1, label="Class 1 ($Y=1$)")
plt.plot(x_vals, pdf_class2, label="Class 2 ($Y=2$)")
plt.axvline(
    x=boundary_c, color="red", linestyle="--", label=f"Decision Boundary ($c={boundary_c}$)"
)
plt.title("Class-Conditional Densities and Decision Boundary")
plt.xlabel("$X$")
plt.ylabel("Density")
plt.legend()
plt.show()
# END SOLUTION

print(f"Decision boundary position: c = {boundary_c}")

In [None]:
# Test assertions
assert boundary_c == 1.0, f"Expected boundary at 1.0, got {boundary_c}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert boundary_c == (-1 + 3) / 2, "boundary_c should be the midpoint of mu1 and mu2"
# END HIDDEN TESTS

---

**Problem 6:** Unequal Priors (Conceptual)

Suppose the estimates were obtained from 100 training points, among which 40 were from class 1 and 60 were from class 2. Suppose now you will estimate class priors from data.

Without calculating the new boundary $\hat{c}$, would you expect it to be the same as, less than, or greater than $c$? Explain your reasoning.

> BEGIN SOLUTION

The new boundary $\hat{c}$ will be less than $c$.

Since class 2 has a higher prior probability (0.6 vs 0.4), the decision boundary shifts toward the class with lower prior (class 1). This accounts for the higher likelihood of encountering class 2 samples, making it "easier" to classify a point as class 2 by expanding its decision region.

Mathematically, the LDA decision boundary with priors is:
$$c = \frac{\mu_1 + \mu_2}{2} + \frac{\sigma^2}{\mu_2 - \mu_1}\log\left(\frac{\pi_1}{\pi_2}\right)$$

Since $\pi_1 < \pi_2$, the log term is negative, shifting $c$ to the left (smaller).
> END SOLUTION


---

**Problem 7:** Calculate Boundary with Unequal Priors

Now calculate the new boundary value $\hat{c}$ using the priors estimated from the data (40 from class 1, 60 from class 2). Store your answer in a variable called `boundary_c_hat`.

In [None]:
# Calculate boundary with unequal priors
# BEGIN SOLUTION
mu1, mu2 = -1, 3
sigma_sq = 1
prior_1, prior_2 = 0.4, 0.6

# LDA decision boundary formula with priors
boundary_c_hat = (mu1 + mu2) / 2 + (np.log(prior_1 / prior_2) * sigma_sq) / (mu2 - mu1)
# END SOLUTION

print(f"New decision boundary: c_hat = {boundary_c_hat:.4f}")

In [None]:
# Test assertions
assert 0.89 < boundary_c_hat < 0.91, f"Expected ~0.899, got {boundary_c_hat}"
assert boundary_c_hat < 1.0, "boundary_c_hat should be less than 1.0"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    abs(boundary_c_hat - 0.898633722972959) < 1e-6
), "boundary_c_hat should be approximately 0.8986"
# END HIDDEN TESTS

---

**Problem 8:** LDA vs QDA Recommendation

Suppose in addition to the pooled covariance value $\hat{\sigma}^2 = 1$, I now tell you the individual class-specific covariances were estimated as $\hat{\sigma}_1^2 = 0.25$ and $\hat{\sigma}_2^2 = 1.5$.

Based on this new information, would you recommend using LDA or QDA, and why?

> BEGIN SOLUTION

QDA should be used because the class-specific variances are substantially different ($\hat{\sigma}_1^2 = 0.25$ vs $\hat{\sigma}_2^2 = 1.5$). LDA assumes equal covariances across classes, which is violated here. QDA can model these different variances and will produce a quadratic decision boundary that better captures the true class structure.
> END SOLUTION


---

**Problem 9:** QDA Decision Boundary

Derive the QDA rule if $\hat{\sigma}_1^2 = 0.25$ and $\hat{\sigma}_2^2 = 1.5$, $\hat{\mu}_1 = -1$ and $\hat{\mu}_2 = 3$, assuming equal class priors. Calculate the numerical values for the decision boundary. Store the two boundary values in a list called `qda_boundaries` (sorted from smallest to largest).

In [None]:
# Calculate QDA decision boundary
# BEGIN SOLUTION
# With equal priors, the decision boundary is where the discriminant functions are equal
# delta_1(x) = delta_2(x), which means N(x; mu1, sigma1^2) = N(x; mu2, sigma2^2)

x = symbols("x")
sigma1_sq = 0.25
sigma2_sq = 1.5
mu1, mu2 = -1, 3

# Log discriminant functions (ignoring common constant terms)
delta1 = -0.5 * np.log(2 * np.pi * sigma1_sq) - ((x - mu1) ** 2) / (2 * sigma1_sq)
delta2 = -0.5 * np.log(2 * np.pi * sigma2_sq) - ((x - mu2) ** 2) / (2 * sigma2_sq)

# Solve delta1 = delta2
boundary_solutions = solve(delta1 - delta2, x)
qda_boundaries = sorted([float(sol) for sol in boundary_solutions])
# END SOLUTION

print(f"QDA decision boundaries: {qda_boundaries}")

In [None]:
# Test assertions
assert len(qda_boundaries) == 2, "Should have two boundary points"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert qda_boundaries[0] < 0, "First boundary should be negative"
assert qda_boundaries[1] > 0, "Second boundary should be positive"
assert abs(qda_boundaries[0] - (-3.892254248596096)) < 1e-4, "First boundary incorrect"
assert abs(qda_boundaries[1] - 0.29225424859609667) < 1e-4, "Second boundary incorrect"
# END HIDDEN TESTS

## Stock Market Prediction

The `Smarket` dataset consists of percentage returns for the S&P 500 stock index over 1,250 days, from the beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days (`Lag1` through `Lag5`), the trading `Volume` (in billions of shares) for the previous trading day, and the return and direction (Up or Down) of the market on the date in question (`Today` and `Direction`).

In [None]:
# Load the Smarket data
smarket = pd.read_csv("./data/Smarket.csv")
smarket.head()

---

**Problem 10:** Exploratory Data Analysis

Produce some numerical and graphical summaries of the `smarket` data. Do there appear to be any patterns? Create at least one visualization (e.g., boxplots comparing Up vs Down days).

In [None]:
# Exploratory analysis of smarket data
# BEGIN SOLUTION
# Create side-by-side boxplots for each feature by Direction
fig, axes = plt.subplots(2, 4, figsize=(14, 8))
features = ["Lag1", "Lag2", "Lag3", "Lag4", "Lag5", "Volume", "Today"]

for i, nm in enumerate(features):
    ax = axes[i // 4, i % 4]
    up_values = smarket.loc[smarket.Direction == "Up"][nm]
    down_values = smarket.loc[smarket.Direction == "Down"][nm]
    ax.boxplot([up_values, down_values])
    ax.set_ylabel(nm)
    ax.set_xticks([1, 2])
    ax.set_xticklabels(["Up", "Down"])

axes[1, 3].axis("off")  # Hide empty subplot
plt.tight_layout()
plt.show()

print()
print("Interpretation:")
print("Based on the boxplots, there is very little visible difference between Up and Down")
print("days for the lag variables and volume. Only the Today variable shows a clear")
print("difference (which is expected since Direction is derived from Today). This suggests")
print("that the lag variables may not be strong predictors of Direction.")
# END SOLUTION

In [None]:
# Test assertions
assert len(plt.get_fignums()) > 0, "Expected at least one plot to be created"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# EDA problem - graded on effort and interpretation
# END HIDDEN TESTS

---

**Problem 11:** LDA on Stock Market Data

Fit an LDA model using training data from 2001 to 2004, with `Direction` as the response and `Lag1` and `Lag2` as predictors. Use the model to predict on the held-out test data (2005). Store the test accuracy in a variable called `lda_accuracy`.

In [None]:
# Fit LDA model and evaluate on test set
# BEGIN SOLUTION
# Split data by year
train_smarket = smarket.loc[smarket["Year"] <= 2004]
test_smarket = smarket.loc[smarket["Year"] > 2004]

# Prepare features and labels
x_train_sm = train_smarket[["Lag1", "Lag2"]]
y_train_sm = train_smarket["Direction"]
x_test_sm = test_smarket[["Lag1", "Lag2"]]
y_test_sm = test_smarket["Direction"]

# Fit LDA
lda_smarket = LinearDiscriminantAnalysis().fit(x_train_sm, y_train_sm)
y_pred_lda = lda_smarket.predict(x_test_sm)

# Compute accuracy
lda_accuracy = accuracy_score(y_test_sm, y_pred_lda)
# END SOLUTION

print(f"LDA Test Accuracy: {lda_accuracy:.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test_sm, y_pred_lda))

In [None]:
# Test assertions
assert 0.5 < lda_accuracy < 0.6, f"Expected accuracy around 0.56, got {lda_accuracy}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(lda_accuracy - 0.5595238095238095) < 1e-6, "LDA accuracy should be approximately 0.5595"
# END HIDDEN TESTS

---

**Problem 12:** QDA on Stock Market Data

Repeat the analysis using QDA instead of LDA. Store the test accuracy in a variable called `qda_accuracy`.

In [None]:
# Fit QDA model and evaluate on test set
# BEGIN SOLUTION
qda_smarket = QuadraticDiscriminantAnalysis().fit(x_train_sm, y_train_sm)
y_pred_qda = qda_smarket.predict(x_test_sm)
qda_accuracy = accuracy_score(y_test_sm, y_pred_qda)
# END SOLUTION

print(f"QDA Test Accuracy: {qda_accuracy:.4f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test_sm, y_pred_qda))

In [None]:
# Test assertions
assert 0.59 < qda_accuracy < 0.61, f"Expected accuracy around 0.60, got {qda_accuracy}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert abs(qda_accuracy - 0.5992063492063492) < 1e-6, "QDA accuracy should be approximately 0.5992"
# END HIDDEN TESTS

## Auto MPG Classification

In this problem, you will develop a model to predict whether a given car will be classified as having high or low gas mileage based on the Auto dataset.

In [None]:
# Load the Auto data
auto_df = pd.read_csv("./data/auto_nonan.csv", index_col=0).set_index("name")
auto_df.head()

---

**Problem 13:** Create Binary Variable

Create a binary variable `mpg01` that equals 1 if the value of `mpg` for that car is above 25, and 0 otherwise. Add this variable as a new column to your data frame.

In [None]:
# Create mpg01 column
# BEGIN SOLUTION
auto_df["mpg01"] = np.where(auto_df["mpg"] > 25, 1, 0)
# END SOLUTION

auto_df[["mpg", "mpg01"]].head(10)

In [None]:
# Test assertions
assert "mpg01" in auto_df.columns, "mpg01 column not found"
assert auto_df["mpg01"].isin([0, 1]).all(), "mpg01 should only contain 0 and 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert auto_df["mpg01"].sum() == 156, "Number of cars with mpg > 25 should be 156"
assert (auto_df["mpg01"] == 0).sum() == 236, "Number of cars with mpg <= 25 should be 236"
# END HIDDEN TESTS

---

**Problem 14:** Exploratory Analysis

Make some exploratory plots to investigate the association between `mpg01` and other variables. Besides `mpg` itself, which four quantitative features do you think are most likely to be useful in predicting `mpg01`? Defend your argument with plots (e.g., side-by-side boxplots).

In [None]:
# Exploratory analysis
# BEGIN SOLUTION
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
features_to_plot = ["displacement", "horsepower", "weight", "acceleration", "year", "cylinders"]

for i, nm in enumerate(features_to_plot):
    ax = axes[i // 3, i % 3]
    vals0 = auto_df.loc[auto_df.mpg01 == 0][nm]
    vals1 = auto_df.loc[auto_df.mpg01 == 1][nm]
    ax.boxplot([vals0, vals1])
    ax.set_ylabel(nm)
    ax.set_xticks([1, 2])
    ax.set_xticklabels(["mpg01=0", "mpg01=1"])

plt.tight_layout()
plt.show()

print()
print("Based on the boxplots, the four most useful features for predicting mpg01 are:")
print("1. cylinders - Clear separation between low/high mpg cars")
print("2. displacement - Strong separation, low mpg cars have higher displacement")
print("3. horsepower - Clear difference, high mpg cars have lower horsepower")
print("4. weight - Strong predictor, lighter cars tend to have better mpg")
print()
print("These features show the clearest separation between the two groups.")
# END SOLUTION

In [None]:
# Test assertions
assert len(plt.get_fignums()) > 0, "Expected at least one plot to be created"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# EDA problem - graded on effort and interpretation
# END HIDDEN TESTS

---

**Problem 15:** Train/Test Split

Split the data into training and test sets using `train_test_split` with:
- `random_state=123`
- `train_size=0.8`
- `stratify` set to the values of `mpg01`

Store the number of training samples where `mpg01` is 1 in a variable called `n_mpg01_train`.

In [None]:
# Split data into train and test sets
# BEGIN SOLUTION
# Use the four selected features
features = ["cylinders", "displacement", "horsepower", "weight"]
x_auto = auto_df[features]
y_auto = auto_df["mpg01"]

x_train_auto, x_test_auto, y_train_auto, y_test_auto = train_test_split(
    x_auto, y_auto, train_size=0.8, random_state=123, stratify=y_auto
)

n_mpg01_train = (y_train_auto == 1).sum()
# END SOLUTION

print(f"Training samples with mpg01=1: {n_mpg01_train}")

In [None]:
# Test assertions
assert n_mpg01_train == 125, f"Expected 125, got {n_mpg01_train}"
assert len(x_train_auto) == 313, "Training set should have 313 samples"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(x_test_auto) == 79, "Test set should have 79 samples"
# END HIDDEN TESTS

---

**Problem 16:** LDA on Auto Data

Fit an LDA model on the training data using the four selected features. Report the misclassification rate on both training and test data. Store the test misclassification rate in a variable called `lda_misclass_test`.

In [None]:
# Fit LDA model
# BEGIN SOLUTION
lda_auto = LinearDiscriminantAnalysis().fit(x_train_auto, y_train_auto)
y_train_pred_lda = lda_auto.predict(x_train_auto)
y_test_pred_lda = lda_auto.predict(x_test_auto)

lda_misclass_train = 1 - accuracy_score(y_train_auto, y_train_pred_lda)
lda_misclass_test = 1 - accuracy_score(y_test_auto, y_test_pred_lda)
# END SOLUTION

print(f"LDA Training Misclassification Rate: {lda_misclass_train:.4f}")
print(f"LDA Test Misclassification Rate: {lda_misclass_test:.4f}")

In [None]:
# Test assertions
assert 0.1 < lda_misclass_test < 0.25, f"Expected ~0.20, got {lda_misclass_test}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    abs(lda_misclass_test - 0.20253164556962022) < 1e-6
), "LDA test misclass rate should be ~0.2025"
# END HIDDEN TESTS

---

**Problem 17:** QDA on Auto Data

Repeat the analysis using QDA. Store the test misclassification rate in a variable called `qda_misclass_test`.

In [None]:
# Fit QDA model
# BEGIN SOLUTION
qda_auto = QuadraticDiscriminantAnalysis().fit(x_train_auto, y_train_auto)
y_train_pred_qda = qda_auto.predict(x_train_auto)
y_test_pred_qda = qda_auto.predict(x_test_auto)

qda_misclass_train = 1 - accuracy_score(y_train_auto, y_train_pred_qda)
qda_misclass_test = 1 - accuracy_score(y_test_auto, y_test_pred_qda)
# END SOLUTION

print(f"QDA Training Misclassification Rate: {qda_misclass_train:.4f}")
print(f"QDA Test Misclassification Rate: {qda_misclass_test:.4f}")

In [None]:
# Test assertions
assert 0.15 < qda_misclass_test < 0.2, f"Expected ~0.18, got {qda_misclass_test}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    abs(qda_misclass_test - 0.17721518987341767) < 1e-6
), "QDA test misclass rate should be ~0.1772"
# END HIDDEN TESTS

---

**Problem 18:** Compare LDA and QDA

Compare and contrast the performance of LDA and QDA. What do your results suggest about the class-specific covariances?

> BEGIN SOLUTION

QDA performs better than LDA on both training and test data (lower misclassification rates). This suggests that the class-specific covariances are not equal between the two groups (high vs low mpg cars). The assumption of equal covariances in LDA appears to be violated, making QDA's more flexible model a better fit for this data.
> END SOLUTION


---

**Problem 19:** Logistic Regression

Fit a logistic regression model on the training data. Store the test misclassification rate in a variable called `logistic_misclass_test`.

In [None]:
# Fit logistic regression model
# BEGIN SOLUTION
logistic_auto = LogisticRegression(max_iter=1000).fit(x_train_auto, y_train_auto)
y_train_pred_log = logistic_auto.predict(x_train_auto)
y_test_pred_log = logistic_auto.predict(x_test_auto)

logistic_misclass_train = 1 - accuracy_score(y_train_auto, y_train_pred_log)
logistic_misclass_test = 1 - accuracy_score(y_test_auto, y_test_pred_log)
# END SOLUTION

print(f"Logistic Training Misclassification Rate: {logistic_misclass_train:.4f}")
print(f"Logistic Test Misclassification Rate: {logistic_misclass_test:.4f}")

In [None]:
# Test assertions
assert 0.1 < logistic_misclass_test < 0.2, f"Expected ~0.15, got {logistic_misclass_test}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    abs(logistic_misclass_test - 0.15189873417721522) < 1e-6
), "Logistic test misclass should be ~0.1519"
# END HIDDEN TESTS

---

**Problem 20:** K-Nearest Neighbors

Fit KNN models for various values of K (from 1 to 35). Plot the training and test classification error vs K. Store the best K (based on test error) in a variable called `best_k`.

In [None]:
# Fit KNN models for various K
# BEGIN SOLUTION
k_values = range(1, 36)
train_errors_knn = []
test_errors_knn = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train_auto, y_train_auto)

    y_train_pred_knn = knn.predict(x_train_auto)
    y_test_pred_knn = knn.predict(x_test_auto)

    train_errors_knn.append(1 - accuracy_score(y_train_auto, y_train_pred_knn))
    test_errors_knn.append(1 - accuracy_score(y_test_auto, y_test_pred_knn))

best_k = k_values[np.argmin(test_errors_knn)]

# Plot
plt.figure(figsize=(10, 6))
plt.plot(k_values, train_errors_knn, label="Training Error")
plt.plot(k_values, test_errors_knn, label="Test Error")
plt.xlabel("Number of Neighbors K")
plt.ylabel("Classification Error")
plt.title("KNN Classification Error vs. Number of Neighbors")
plt.legend()
plt.gca().invert_xaxis()
plt.show()
# END SOLUTION

print(f"Best K based on test error: {best_k}")

In [None]:
# Test assertions
assert 1 <= best_k <= 35, "best_k should be between 1 and 35"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert best_k == 23, f"Best K should be 23, got {best_k}"
# END HIDDEN TESTS

---

**Problem 21:** Model Comparison Summary

Compare the test performance of all methods (LDA, QDA, Logistic Regression, KNN with best K). Which method performs best on this dataset?

> BEGIN SOLUTION

Test misclassification rates:
- LDA: ~20.3%
- QDA: ~17.7%
- Logistic Regression: ~15.2%
- KNN (K=23): ~16.5%

Logistic regression performs best on this dataset, followed closely by KNN and QDA. LDA performs worst, which is consistent with our earlier observation that the class-specific covariances are unequal. Logistic regression provides a good balance between flexibility and simplicity for this classification task.
> END SOLUTION
