
# Cross Validation with Linear Regression — Exercises

> **Dataset:** We'll use `load_diabetes()` from scikit-learn (regression, 10 numeric features).  
> **Metrics:** Use **RMSE** (root mean squared error).  
> **CV Scheme:** Use **KFold** (e.g., 5 folds, shuffling with a fixed random state for reproducibility).

Run the **Setup** cell once, then work through Exercises 1–3. Fill in the `# TODO` parts.


In [None]:

# ===== Setup (Run once) =====
import numpy as np
import pandas as pd

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline

import matplotlib.pyplot as plt

# scikit-learn uses negative RMSE for scoring to keep "higher is better"
RMSE_SCORING = "neg_root_mean_squared_error"

# Load data
data = load_diabetes()
X = data.data          # shape (442, 10)
y = data.target        # quantitative target

# Optional: peek at the data
pd.DataFrame(X, columns=data.feature_names).head()



## Exercise 1 — Train/Test Split vs. K-Fold Cross-Validation

**Goal:** Compare a single 80/20 train–test split with a **5-fold cross-validation** estimate using **Linear Regression**. Interpret why CV is typically a better performance estimate.

**What to learn**
- Why a single split can be noisy.
- How K-fold reduces variance in the estimate.

**Your task**
1. Do an **80/20 split**, fit `LinearRegression`, compute **test RMSE**.  
2. Do **5-fold CV** on the whole dataset and compute **mean RMSE** across folds.  
3. Compare and briefly explain the difference.


In [None]:

# --- Part A: 80/20 split RMSE ---

# TODO: create X_train, X_test, y_train, y_test with test_size=0.2 and random_state=42
# X_train, X_test, y_train, y_test = ...

# TODO: build a Pipeline with StandardScaler() and LinearRegression()
# pipe = Pipeline(steps=[("scaler", StandardScaler()), ("lr", LinearRegression())])

# TODO: fit the pipeline on the training data
# pipe.fit(...)

# TODO: predict on test, compute RMSE
# y_pred = pipe.predict(...)
# rmse = np.sqrt(np.mean((y_test - y_pred) ** 2))
# print(f"Test RMSE (single 80/20 split): {rmse:.2f}")


# --- Part B: 5-fold CV RMSE ---

# TODO: define KFold with n_splits=5, shuffle=True, random_state=42
# kf = KFold(...)

# TODO: use cross_val_score with the same pipeline, scoring=RMSE_SCORING and cv=kf
# scores = cross_val_score(pipe, X, y, scoring=RMSE_SCORING, cv=kf)

# Convert to positive RMSE
# rmses = -scores
# mean_rmse = rmses.mean()
# std_rmse = rmses.std()
# print(f"CV RMSE (5-fold): {mean_rmse:.2f} ± {std_rmse:.2f}")



**Reflection (write here):**  
Explain in 2–4 sentences why the CV estimate can be more reliable than a single split.



## Exercise 2 — How Many Features Help? (Feature Count vs. CV Error)

**Goal:** Examine how performance changes as you vary the **number of input features** (e.g., first 2, 4, 6, 8, 10 features).

**What to learn**
- Adding features can reduce bias but may increase variance/noise.
- CV is a fair way to compare feature sets.

**Your task**
1. For `k` in `{2, 4, 6, 8, 10}`, select `X[:, :k]`.  
2. Use a **Pipeline(StandardScaler → LinearRegression)**.  
3. Compute **5-fold CV RMSE** for each `k`.  
4. Plot `k` vs **mean RMSE**.  
5. Briefly discuss the trend.


In [None]:

# --- Feature Count vs. CV Error ---

k_values = [2, 4, 6, 8, 10]
mean_rmses, std_rmses = [], []

# TODO: set up KFold(n_splits=5, shuffle=True, random_state=42)
# kf = KFold(...)

for k in k_values:
    Xk = X[:, :k]  # use the first k features (for simplicity)

    # TODO: define Pipeline with StandardScaler and LinearRegression
    # pipe = Pipeline([("scaler", StandardScaler()), ("lr", LinearRegression())])

    # TODO: cross_val_score with scoring=RMSE_SCORING and the KFold above
    # scores = cross_val_score(pipe, Xk, y, scoring=RMSE_SCORING, cv=kf)

    # Convert to positive RMSE
    # rmses = -scores
    # mean_rmses.append(rmses.mean())
    # std_rmses.append(rmses.std())

# Plot (do not set custom colors/styles)
import matplotlib.pyplot as plt

plt.figure()
# TODO: uncomment after computing mean_rmses/std_rmses
# plt.errorbar(k_values, mean_rmses, yerr=std_rmses, fmt='o-')
plt.xlabel("Number of features (k)")
plt.ylabel("CV RMSE (5-fold)")
plt.title("Effect of Feature Count on CV Error")
plt.show()



**Reflection (write here):**  
In 2–4 sentences, does adding more features always help here? Why might performance worsen or plateau as `k` increases?



## Exercise 3 — Linear Model Complexity via Polynomial Features

**Goal:** Keep a **linear model** (LinearRegression) but change input representation by adding **polynomial features**. Compare degrees `{1, 2, 3}` with CV.

> Even with polynomial features, `LinearRegression` is still **linear in parameters** (we’re just expanding the feature space). This isolates the effect of **feature engineering** on bias/variance.

**Your task**
1. Start with only the **first 2 original features**.  
2. Build pipelines: `StandardScaler → PolynomialFeatures(degree=d, include_bias=False) → LinearRegression`.  
3. For degrees `d ∈ {1, 2, 3}`, compute **5-fold CV RMSE**.  
4. Plot degree vs mean RMSE.  
5. Briefly discuss bias/variance trade-offs you observe.


In [None]:

# --- Polynomial Features vs. CV Error ---

degrees = [1, 2, 3]
mean_rmses_deg, std_rmses_deg = [], []

# Use only first 2 features
X2 = X[:, :2]

# TODO: set up KFold(n_splits=5, shuffle=True, random_state=42)
# kf = KFold(...)

for d in degrees:
    # TODO: define Pipeline with StandardScaler, PolynomialFeatures(degree=d, include_bias=False), LinearRegression
    # pipe = Pipeline([
    #     ("scaler", StandardScaler()),
    #     ("poly", PolynomialFeatures(degree=d, include_bias=False)),
    #     ("lr", LinearRegression())
    # ])

    # TODO: cross_val_score with scoring=RMSE_SCORING and the KFold above
    # scores = cross_val_score(pipe, X2, y, scoring=RMSE_SCORING, cv=kf)

    # rmses = -scores
    # mean_rmses_deg.append(rmses.mean())
    # std_rmses_deg.append(rmses.std())

# Plot (do not set custom colors/styles)
plt.figure()
# TODO: uncomment after computing mean_rmses_deg/std_rmses_deg
# plt.errorbar(degrees, mean_rmses_deg, yerr=std_rmses_deg, fmt='o-')
plt.xlabel("Polynomial degree")
plt.ylabel("CV RMSE (5-fold)")
plt.title("Polynomial Features vs. CV Error (Linear Regression)")
plt.xticks(degrees)
plt.show()



**Reflection (write here):**  
In ~3 sentences, relate your results to bias (underfitting) and variance (overfitting). When might `degree=2` help? When might `degree=3` hurt?



### (Optional) Extension — Fold Diagnostics

Inspect fold-by-fold RMSE to see variability:


In [None]:

# Example: variability check for k=10 features in Exercise 2

# kf = KFold(n_splits=5, shuffle=True, random_state=42)
# pipe = Pipeline([("scaler", StandardScaler()), ("lr", LinearRegression())])

# scores = cross_val_score(pipe, X[:, :10], y, scoring=RMSE_SCORING, cv=kf)
# print("Fold RMSEs:", -scores)
# print("Mean ± SD:", (-scores).mean(), "±", (-scores).std())
