# Chapter 4 Exercises: Classification

Solutions to selected exercises from Chapter 4 of *An Introduction to Statistical Learning with Applications in Python*. Focuses on classification concepts like KNN, odds, and LDA.

## Exercise 4: The Curse of Dimensionality

This exercise explores why KNN struggles with high-dimensional data (Section 4.5.2, p. 164).

### (a) p = 1

For one feature uniformly distributed on [0, 1], we use observations within 10% of the range (0.1). The fraction of observations is:

\[ \text{Fraction} = \frac{0.1}{1} = 0.1 \]



**Answer**: 10% of observations.

### (b) p = 2

For two features in [0, 1] × [0, 1], the fraction is:

\[ \text{Fraction} = 0.1 \times 0.1 = 0.01 \]

**Answer**: 1% of observations.

### (c) p = 100

For 100 features, the fraction is:

\[ \text{Fraction} = 0.1^{100} = 10^{-100} \]

**Answer**: A negligible fraction, $10^{-100}$.

### (d) Drawback of KNN with Large p

As $p$ increases, the fraction of observations in the neighborhood ($0.1^p$) becomes tiny, making predictions unreliable due to sparse data (curse of dimensionality, p. 115, 193, 266).

**Answer**: KNN fails because very few observations are near the test point, leading to poor predictions.

### (e) Hypercube Side Length for 10% of Observations

To include 10% of observations, the hypercube volume is 0.1:

\[ s^p = 0.1 \]
\[ s = 0.1^{1/p} \]

Compute for different $p$:

In [None]:
import numpy as np

p_values = [1, 2, 100]
for p in p_values:
    s = 0.1 ** (1 / p)
    print(f'p = {p}: Side length = {s:.3f}')


**Answer**:
- $p = 1$: Side length = 0.1
- $p = 2$: Side length ≈ 0.316
- $p = 100$: Side length ≈ 0.977

As $p$ increases, the neighborhood spans nearly the entire space, losing locality.

## Exercise 9: Odds

Explores odds in the context of logistic regression (Section 4.3, p. 138–145).

### (a) Fraction Defaulting with Odds of 0.37

\[ \text{Odds} = \frac{P}{1 - P} = 0.37 \]

Solve for $P$:

In [None]:
odds = 0.37
P = odds / (1 + odds)
print(f'Fraction defaulting: {P:.3f} or {P*100:.1f}%')

**Answer**: 27% default.

### (b) Odds for 16% Default Probability

\[ P = 0.16 \]
\[ \text{Odds} = \frac{P}{1 - P} \]

In [None]:
P = 0.16
odds = P / (1 - P)
print(f'Odds of default: {odds:.4f}')

**Answer**: Odds = 0.1905.

## Exercise 14: Predicting Gas Mileage with Auto Dataset

Uses LDA to classify cars based on gas mileage (Section 4.4, p. 146–155).

### (a) Create Binary Variable mpg01

Create `mpg01`: 1 if `mpg` > median, 0 otherwise.

In [None]:
import pandas as pd
from ISLP import load_data

Auto = load_data('Auto')
median_mpg = Auto['mpg'].median()
Auto['mpg01'] = (Auto['mpg'] > median_mpg).astype(int)
print(Auto[['mpg', 'mpg01']].head())

**Answer**: `mpg01` created and added to the dataset.

### (b) Graphical Exploration

Explore associations with `mpg01` using boxplots and scatterplots.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Boxplots
quant_vars = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year']
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()
for i, var in enumerate(quant_vars):
    sns.boxplot(x='mpg01', y=var, data=Auto, ax=axes[i])
    axes[i].set_title(f'{var} vs mpg01')
plt.tight_layout()
plt.show()

# Scatterplot matrix
sns.pairplot(Auto, vars=['displacement', 'horsepower', 'weight', 'acceleration', 'mpg01'], hue='mpg01')
plt.show()

**Findings**: `displacement`, `horsepower`, `weight`, and `cylinders` show strong associations with `mpg01`. Lower values correspond to `mpg01 = 1` (high mileage).

### (c) Split Data

Split into 70% training, 30% test.

In [None]:
from sklearn.model_selection import train_test_split

X = Auto[['displacement', 'horsepower', 'weight', 'cylinders']]
y = Auto['mpg01']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f'Training set size: {X_train.shape[0]}, Test set size: {X_test.shape[0]}')

**Answer**: Data split into training and test sets.

### (d) Perform LDA and Compute Test Error

Use LDA to predict `mpg01` and compute test error.

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score

lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)
y_pred = lda.predict(X_test)
test_error = 1 - accuracy_score(y_test, y_pred)
print(f'Test error: {test_error:.4f}')

**Answer**: Test error is approximately 0.12 (exact value depends on the split), indicating good performance.