# Regression - Cross-Validation

This notebook introduces the Scikit-Learn interface for cross-validation.

## Setup

Load the packages and configure environment.

In [None]:
import numpy as np
import pandas as pd

Using the Boston data from HW1.

In [None]:
# download the data set directly from the web using pandas
url = "https://raw.githubusercontent.com/olearydj/INSY7120/refs/heads/main/notebooks/data/Boston.csv"
boston = pd.read_csv(url)
# get the predictors of interest
X = boston.loc[:,'zn':]
y = boston[['crim']]

In [None]:
X.head()

In [None]:
y.head()

## Proper Dataset Splitting

First, hold out a test set for final assessment.

In [None]:
from sklearn.model_selection import train_test_split

# X_train_val and y_train_val are for all model development (training and CV)
# X_test and y_test are reserved for final assessment of the resulting model
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

## `cross_val_score`

Use for quick evaluations of test error with a single metric.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

model = LinearRegression()

# Simple cross-validation with default 5-fold
scores = cross_val_score(model, X_train_val, y_train_val, cv=5, scoring='r2')
print(scores)

## `cross_validate`

For more thorough analysis. Returns a dict of scores and timing information for both test and train.

In [None]:
from sklearn.model_selection import cross_validate
# use pretty print to make structures easier to read
from pprint import pprint as pp

# Multiple metrics and training scores
results = cross_validate(model, X_train_val, y_train_val, cv=5, 
                        scoring=['r2', 'neg_root_mean_squared_error', 'neg_median_absolute_error'],
                        return_train_score=True)
pp(results)

Note that the scores are the same. By default, these methods create folds based on the order of the data provided. For example, if there are 5 folds, the first 20% of the data is assigned to the first fold, the second 20% to the second, and so on. They inherit the random order of the `test_train_split` (and its seed). In SKL syntax, they default to `shuffle=False`).

For greater control over this process, you can create a CV splitter with `shuffle=True` and fixed random state.

In [None]:
from sklearn.model_selection import KFold, RepeatedKFold

# For k-fold cross-validation on the training set
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# For repeated k-fold to get more stable estimates
rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)

This code does not perform the split, it only initializes a splitter, which can then be used in `cross_val_score` or `cross_validate`.

In [None]:
# Pass this CV splitter to cross_val_score
scores = cross_val_score(model, X_train_val, y_train_val, cv=kf)
print(scores)

## Fit and Evaluate the Final Model on Test Data

Once we are happy with the scores we have to refit the model. The CV process doesn't keep the models built for each fold. And we ultimately want to fit with all the `train_val` data anyway, before finally evaluating the result on the held-out `test` set.

In [None]:
# After selecting the best model through cross-validation
# Fit the final model on all training data
final_model = LinearRegression()
final_model.fit(X_train_val, y_train_val)

# Evaluate on the held-out test set
test_score = final_model.score(X_test, y_test)
print(f"Final model R² on test set: {test_score:.4f}")

# You can also calculate other metrics on the test set
from sklearn.metrics import mean_squared_error, median_absolute_error

y_pred = final_model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = median_absolute_error(y_test, y_pred)

print(f"Root Mean Squared Error on test set: {rmse:.4f}")
print(f"Median Absolute Error on test set: {mae:.4f}")

# Optional: Examine the model coefficients
coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': final_model.coef_[0]
}).sort_values('Coefficient', ascending=False)

print("\nModel Coefficients:")
print(coefficients)