
# Cross Validation with Linear Regression — **Solutions**

This notebook contains **worked solutions** for the three exercises on cross-validation using **Linear Regression** in scikit-learn.  
The code follows best practices: Pipelines, `KFold`, and RMSE as the metric. Plots are generated with matplotlib (no custom styles).


In [None]:

# ===== Setup (Run once) =====
import numpy as np
import pandas as pd

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline

import matplotlib.pyplot as plt

# scikit-learn uses negative RMSE for scoring to keep "higher is better"
RMSE_SCORING = "neg_root_mean_squared_error"

# Load data
data = load_diabetes()
X = data.data          # shape (442, 10)
y = data.target        # quantitative target

# Optional: peek at the data
pd.DataFrame(X, columns=data.feature_names).head()



## Exercise 1 — Train/Test Split vs. K-Fold Cross-Validation (Solution)

**Goal:** Compare an 80/20 train–test split to a 5-fold CV estimate with Linear Regression.


In [None]:

# --- Part A: 80/20 split RMSE ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

pipe = Pipeline(steps=[("scaler", StandardScaler()), ("lr", LinearRegression())])
pipe.fit(X_train, y_train)

y_pred = pipe.predict(X_test)
rmse_split = np.sqrt(np.mean((y_test - y_pred) ** 2))
print(f"Test RMSE (single 80/20 split): {rmse_split:.2f}")

# --- Part B: 5-fold CV RMSE ---
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, scoring=RMSE_SCORING, cv=kf)

rmses = -scores
mean_rmse = rmses.mean()
std_rmse = rmses.std()
print(f"CV RMSE (5-fold): {mean_rmse:.2f} ± {std_rmse:.2f}")
print("Fold RMSEs:", np.round(rmses, 2))



**Why CV is often more reliable:**  
A single train/test split uses only one partition of the data, so its error estimate can be **high-variance** and sensitive to how the data happened to split.  
**K-fold cross-validation** averages performance across multiple splits (folds), reducing variance and giving a **more stable, reliable estimate** of out-of-sample error. In small-to-moderate datasets (like `load_diabetes`), this stability is especially valuable.



## Exercise 2 — How Many Features Help? (Feature Count vs. CV Error) — Solution

We evaluate `k ∈ {2, 4, 6, 8, 10}` features using the same pipeline and 5-fold CV.


In [None]:

k_values = [2, 4, 6, 8, 10]
mean_rmses, std_rmses = [], []

kf = KFold(n_splits=5, shuffle=True, random_state=42)

rows = []
for k in k_values:
    Xk = X[:, :k]  # use first k features
    pipe = Pipeline([("scaler", StandardScaler()), ("lr", LinearRegression())])
    scores = cross_val_score(pipe, Xk, y, scoring=RMSE_SCORING, cv=kf)
    rmses = -scores
    mean_rmses.append(rmses.mean())
    std_rmses.append(rmses.std())
    rows.append({"k_features": k, "mean_RMSE": rmses.mean(), "std_RMSE": rmses.std()})

results_df = pd.DataFrame(rows)
print(results_df)

# Plot (matplotlib, no custom styles/colors)
plt.figure()
plt.errorbar(k_values, mean_rmses, yerr=std_rmses, fmt='o-')
plt.xlabel("Number of features (k)")
plt.ylabel("CV RMSE (5-fold)")
plt.title("Effect of Feature Count on CV Error")
plt.show()



**Interpretation:**  
- Adding features can **help at first** by capturing more signal (lower bias).  
- After some point, extra features may add **noise** or redundancy, and the model can become less stable, so CV RMSE may **plateau or even worsen**.  
- The best `k` is the one that **minimizes CV RMSE**, not necessarily “all features.”



## Exercise 3 — Linear Model Complexity via Polynomial Features — Solution

We keep a Linear Regression model but expand inputs via `PolynomialFeatures(degree=d)` for `d ∈ {1, 2, 3}` on the **first 2 features** only.


In [None]:

degrees = [1, 2, 3]
mean_rmses_deg, std_rmses_deg = [], []

X2 = X[:, :2]

kf = KFold(n_splits=5, shuffle=True, random_state=42)

rows_poly = []
for d in degrees:
    pipe = Pipeline([
        ("scaler", StandardScaler()),
        ("poly", PolynomialFeatures(degree=d, include_bias=False)),
        ("lr", LinearRegression())
    ])
    scores = cross_val_score(pipe, X2, y, scoring=RMSE_SCORING, cv=kf)
    rmses = -scores
    mean_rmses_deg.append(rmses.mean())
    std_rmses_deg.append(rmses.std())
    rows_poly.append({"degree": d, "mean_RMSE": rmses.mean(), "std_RMSE": rmses.std()})

poly_df = pd.DataFrame(rows_poly)
print(poly_df)

# Plot (matplotlib, no custom colors/styles)
plt.figure()
plt.errorbar(degrees, mean_rmses_deg, yerr=std_rmses_deg, fmt='o-')
plt.xlabel("Polynomial degree")
plt.ylabel("CV RMSE (5-fold)")
plt.title("Polynomial Features vs. CV Error (Linear Regression)")
plt.xticks(degrees)
plt.show()



**Interpretation:**  
- `degree = 1` is the baseline linear model.  
- `degree = 2` can improve fit if there are **nonlinear relationships** in the first two features.  
- `degree = 3` increases flexibility further and may **overfit**, which often shows up as **higher CV RMSE** than `degree = 2`.  
- Cross-validation helps identify the **sweet spot** where added complexity improves generalization rather than harming it.



### (Optional) Extension — Fold Diagnostics

Checking fold-by-fold RMSE can reveal variance across splits:


In [None]:

kf = KFold(n_splits=5, shuffle=True, random_state=42)
pipe = Pipeline([("scaler", StandardScaler()), ("lr", LinearRegression())])
scores = cross_val_score(pipe, X[:, :10], y, scoring=RMSE_SCORING, cv=kf)
print("Fold RMSEs:", -scores)
print("Mean ± SD:", (-scores).mean(), "±", (-scores).std())
