# DATASCI 503, Homework 2: K-Nearest Neighbors and Bias-Variance Tradeoff

This assignment covers **K-Nearest Neighbors (KNN)**, a non-parametric method for classification and regression, and the **bias-variance tradeoff**, which describes how model complexity affects prediction error.

## K-Nearest Neighbors

Consider the following dataset $\mathcal{D} = \{(x^{(i)}, y^{(i)})\}_{i=1}^{6}$ where each $x^{(i)} \in \mathbb{R}^3$ and $y^{(i)} \in \{\text{Red}, \text{Green}\}$:

| $i$ | $x^{(i)}$ | $y^{(i)}$ |
|-----|-----------|----------|
| 1 | $(0, 3, 0)$ | Red |
| 2 | $(2, 0, 0)$ | Red |
| 3 | $(0, 1, 3)$ | Red |
| 4 | $(0, 1, 2)$ | Green |
| 5 | $(-1, 0, 1)$ | Green |
| 6 | $(1, 1, 2)$ | Green |

We want to classify a test point $x^{(te)} = (0, 0, 0)$ using K-nearest neighbors with squared Euclidean distance $d(a, b) = \sum_{j=1}^{3}(a_j - b_j)^2$.

---

**Problem 1a:** Compute Distances

Compute the squared Euclidean distance from the test point $x^{(te)} = (0, 0, 0)$ to each of the six training points.

> BEGIN SOLUTION

$d(x^{(te)}, x^{(1)}) = (0-0)^2 + (0-3)^2 + (0-0)^2 = 9$

$d(x^{(te)}, x^{(2)}) = (0-2)^2 + (0-0)^2 + (0-0)^2 = 4$

$d(x^{(te)}, x^{(3)}) = (0-0)^2 + (0-1)^2 + (0-3)^2 = 10$

$d(x^{(te)}, x^{(4)}) = (0-0)^2 + (0-1)^2 + (0-2)^2 = 5$

$d(x^{(te)}, x^{(5)}) = (0-(-1))^2 + (0-0)^2 + (0-1)^2 = 2$

$d(x^{(te)}, x^{(6)}) = (0-1)^2 + (0-1)^2 + (0-2)^2 = 6$
> END SOLUTION

---

**Problem 1b:** K=3 Classification

Using your computed distances, what is the predicted class $\hat{y}^{(3)}(x^{(te)}; \mathcal{D})$ when $K = 3$? Explain your reasoning.

> BEGIN SOLUTION

For $K = 3$, $\hat{y}^{(3)}(x^{(te)}; \mathcal{D}) = \mathrm{Green}$. This is because the closest 3 points are $x^{(5)}$ (which is Green), $x^{(2)}$ (which is Red), and $x^{(4)}$ (which is Green); the majority class is Green.
> END SOLUTION


---

**Problem 1c:** K=1 Classification

What is the predicted class $\hat{y}^{(1)}(x^{(te)}; \mathcal{D})$ when $K = 1$? Explain your reasoning.

> BEGIN SOLUTION

For $K = 1$, $\hat{y}^{(1)}(x^{(te)}; \mathcal{D}) = \mathrm{Green}$ as well, because the closest point is $x^{(5)}$ (Green), and the majority class among this single neighbor is Green.
> END SOLUTION


---

**Problem 1d:** Comparing K Values

True or False: In a typical data-generating process where outliers and noise are present, $K = 3$ tends to give more consistent predictions than $K = 1$. Explain your reasoning.

> BEGIN SOLUTION

True.

In a typical data-generating process where outliers and noise are present, $K=3$ tends to give more consistent predictions. This is because it considers more neighbors, which helps smooth out the effect of noisy or outlier data points.
> END SOLUTION


## Bias-Variance Tradeoff

---

**Problem 2a:** Training Error and Flexibility

True or False: The predictive error on training data generally decreases as the model becomes more flexible. Explain your reasoning.

> BEGIN SOLUTION

True.

The predictive error on the training data generally decreases as the model becomes more flexible. A more flexible model can fit the training data more closely, reducing training error.
> END SOLUTION


---

**Problem 2b:** Test Error and Flexibility

Describe how the predictive error on test data typically changes as model flexibility increases. What phenomenon explains this behavior?

> BEGIN SOLUTION

In typical settings, the predictive error on the test data follows a U-shaped curve as the model becomes more flexible. As flexibility increases from a very low level, the predictive error on test data tends to decrease initially; after a certain level, the predictive error tends to increase due to the issue of overfitting.
> END SOLUTION


## Bias-Variance Decomposition: A Simulation Study

Consider the data-generating process:
- $X \sim \mathrm{Uniform}[0, 1]$
- $Y | X = x \sim \mathrm{Uniform}[x + \cos(2\pi x) - 0.1, x + \cos(2\pi x) + 0.1]$

We will investigate the bias-variance tradeoff by fitting an ordinary least squares (OLS) linear regression model to data generated from this process.

**Resources:**
- [sklearn LinearRegression documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- [ISL Chapter 2.2: Bias-Variance Tradeoff](https://www.statlearning.com/)

---

**Problem 3a:** Conditional Expectation

Find $f(x) = \mathbb{E}[Y|X=x]$ and compute $f(0.5)$.

> BEGIN SOLUTION

$f(x) = \mathbb{E}[Y|X=x] = x + \cos(2\pi x)$

$f(0.5) = 0.5 + \cos(2 \pi \cdot 0.5) = 0.5 + (-1) = -0.5$
> END SOLUTION


---

**Problem 3b:** Conditional Variance

Compute $\mathrm{Var}(Y | X = 0.5)$. Recall that for a uniform distribution on $[a, b]$, the variance is $(b-a)^2/12$.

> BEGIN SOLUTION

When $X = 0.5$, $[Y |X = 0.5] \sim \mathrm{Uniform}[0.5 + \cos(2\pi \cdot 0.5) - 0.1, 0.5 + \cos(2\pi \cdot 0.5) + 0.1]$.

Therefore $[Y |X = 0.5] \sim \mathrm{Uniform}[-0.6, -0.4]$

$\mathrm{Var}(Y |X = 0.5) = \frac{(-0.4 - (-0.6))^2}{12} = \frac{0.2^2}{12} = \frac{0.04}{12} = \frac{1}{300} \approx 0.00333$
> END SOLUTION

---

**Problem 3c:** Data Generation and OLS Fit

Generate 100 samples from the data-generating process. Store the features in a variable `features` and the targets in a variable `targets`. Then fit an OLS linear regression model and store it in a variable `ols_model`. Finally, create a plot showing:
1. The data points as a scatter plot
2. The true regression function $f(x)$
3. The estimated OLS fit $\hat{f}(x)$

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression

np.random.seed(0)


# Define the true regression function
def true_function(x):
    return x + np.cos(2 * np.pi * x)

In [None]:
# BEGIN SOLUTION
# Generate X and Y from the data-generating process
features = np.random.uniform(0, 1, 100)
targets = np.random.uniform(
    features + np.cos(2 * np.pi * features) - 0.1,
    features + np.cos(2 * np.pi * features) + 0.1,
    100,
)

# Fit ordinary least squares model
features_reshaped = features.reshape(-1, 1)
ols_model = LinearRegression().fit(features_reshaped, targets)

# Create the plot
x_range = np.linspace(0, 1, 100)
ols_predictions = ols_model.predict(x_range.reshape(-1, 1))

plt.figure(figsize=(10, 6))
plt.scatter(features, targets, color="blue", label="Data points")
plt.plot(x_range, true_function(x_range), color="green", label="True function f")
plt.plot(x_range, ols_predictions, color="red", label="Estimated function f_hat")
plt.xlabel("X")
plt.ylabel("Y")
plt.title("Data points vs True function vs Estimated function")
plt.legend()
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert len(features) == 100, "features should have 100 samples"
assert len(targets) == 100, "targets should have 100 samples"
assert hasattr(ols_model, "coef_"), "ols_model should be a fitted LinearRegression model"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert features.min() >= 0 and features.max() <= 1, "features should be in [0, 1]"
assert hasattr(ols_model, "intercept_"), "ols_model should have intercept_"
# END HIDDEN TESTS

---

**Problem 3d:** Sampling Distribution of Predictions

To understand the variance of the OLS estimator, repeat the following 500 times:
1. Generate a new dataset of 100 samples from the data-generating process
2. Fit an OLS model
3. Store the prediction $\hat{f}(0.5; \mathcal{D}_i)$

Store all 500 predictions in a list called `predictions` and plot a histogram of these predictions.

In [None]:
# BEGIN SOLUTION
predictions = []
test_point = np.array([[0.5]])

for _ in range(500):
    # Generate the dataset from the data-generating process
    sample_features = np.random.uniform(0, 1, 100)
    sample_targets = np.random.uniform(
        sample_features + np.cos(2 * np.pi * sample_features) - 0.1,
        sample_features + np.cos(2 * np.pi * sample_features) + 0.1,
        100,
    )
    sample_features_reshaped = sample_features.reshape(-1, 1)

    # Fit the linear regression model
    model = LinearRegression().fit(sample_features_reshaped, sample_targets)

    # Store the prediction at x = 0.5
    predictions.append(model.predict(test_point)[0])

# Plot a histogram of predictions
plt.figure(figsize=(10, 6))
plt.hist(predictions, bins=30, edgecolor="black")
plt.xlabel("Predicted f_hat(0.5, D_i)")
plt.ylabel("Frequency")
plt.title("Histogram of Predictions for f_hat(0.5, D_i)")
plt.show()
# END SOLUTION

In [None]:
# Test assertions
assert len(predictions) == 500, "predictions should have 500 values"
assert all(
    -1 < p < 2 for p in predictions
), "Predictions should be in a reasonable range for this regression problem"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 0.4 < np.mean(predictions) < 0.6, "Mean of predictions should be around 0.5"
assert np.var(predictions) < 0.02, "Variance of predictions should be small"
# END HIDDEN TESTS

---

**Problem 3e:** Estimator Bias

Compute the mean of your 500 predictions at $x = 0.5$ and store it in a variable called `mean_prediction`. Compare this to the true value $f(0.5) = -0.5$ from Problem 3a. What does this suggest about the bias of the OLS estimator for this problem?

In [None]:
# BEGIN SOLUTION
mean_prediction = np.mean(predictions)
print(f"Mean prediction at x=0.5: {mean_prediction:.4f}")
print("True value f(0.5): -0.5")
print(f"Difference: {mean_prediction - (-0.5):.4f}")
print()
print("The mean prediction (~0.5) differs greatly from the true value (-0.5).")
print("This large difference suggests high bias in the OLS estimator for this nonlinear problem.")
# END SOLUTION

In [None]:
# Test assertions
assert -1 < mean_prediction < 2, "mean_prediction should be in a reasonable range"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    abs(mean_prediction - np.mean(predictions)) < 1e-10
), "mean_prediction should equal np.mean(predictions)"
assert 0.4 < mean_prediction < 0.6, "Mean prediction should be around 0.5"
# END HIDDEN TESTS

---

**Problem 3f:** Estimator Variance

Compute the variance of your 500 predictions at $x = 0.5$ and store it in a variable called `estimator_variance`. What does this value suggest about the consistency of the OLS estimator?

In [None]:
# BEGIN SOLUTION
estimator_variance = np.var(predictions)
print(f"Variance of predictions: {estimator_variance:.6f}")
print()
print("The variance is quite small (~0.005), suggesting the OLS estimator is consistent")
print("across different samples. However, this consistency does not imply accuracyâ€”")
print("as shown in 3e, the estimator may be consistently biased.")
# END SOLUTION

In [None]:
# Test assertions
assert estimator_variance > 0, "estimator_variance should be positive"
assert estimator_variance < 0.02, f"estimator_variance should be small, got {estimator_variance}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (
    abs(estimator_variance - np.var(predictions)) < 1e-10
), "estimator_variance should equal np.var(predictions)"
# END HIDDEN TESTS

---

**Problem 3g:** MSPE Decomposition

Using the bias-variance decomposition:

$$\mathrm{MSPE} = (\text{irreducible error}) + (\text{estimator variance}) + (\text{estimator bias})^2$$

Compute the estimated MSPE at $x = 0.5$ and store it in a variable called `mspe`. Use:
- Irreducible error = $\mathrm{Var}(Y | X = 0.5)$ from Problem 3b (store in `irreducible_error`)
- Estimator variance from Problem 3f (store in `estimator_variance`)
- Estimator bias = mean prediction - $f(0.5)$ from Problem 3e (store in `estimator_bias`)

In [None]:
# BEGIN SOLUTION
# Irreducible error (variance of uniform on interval of width 0.2)
irreducible_error = (0.1 - (-0.1)) ** 2 / 12

# Estimator variance from predictions
estimator_variance = np.var(predictions)

# Estimator bias: difference between mean prediction and true f(0.5)
estimator_bias = np.mean(predictions) - (-0.5)

# MSPE decomposition
mspe = irreducible_error + estimator_variance + estimator_bias**2
mspe
# END SOLUTION

In [None]:
# Test assertions
assert abs(irreducible_error - 0.00333) < 0.001, "irreducible_error should be approximately 0.00333"
assert estimator_variance > 0, "estimator_variance should be positive"
assert estimator_bias > 0.9, "estimator_bias should be approximately 1 (high bias)"
assert mspe > 0.9, "MSPE should be dominated by the squared bias term"
print("All tests passed!")

# BEGIN HIDDEN TESTS
expected_mspe = irreducible_error + estimator_variance + estimator_bias**2
assert abs(mspe - expected_mspe) < 1e-10, "MSPE should equal the sum of components"
assert mspe < 1.1, "MSPE should be less than 1.1"
# END HIDDEN TESTS