# Introduction to Cross-Validation - Lab

## Introduction

In this lab, you'll be able to practice your cross-validation skills!


## Objectives

You will be able to:

- Perform cross validation on a model
- Compare and contrast model validation strategies

## Let's Get Started

We included the code to pre-process the Ames Housing dataset below. This is done for the sake of expediency, although it may result in data leakage and therefore overly optimistic model metrics.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

ames = pd.read_csv('ames.csv')

continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']

ames_cont = ames[continuous]

# log features
log_names = [f'{column}_log' for column in ames_cont.columns]

ames_log = np.log(ames_cont)
ames_log.columns = log_names

# normalize (subract mean and divide by std)

def normalize(feature):
    return (feature - feature.mean()) / feature.std()

ames_log_norm = ames_log.apply(normalize)

# one hot encode categoricals
ames_ohe = pd.get_dummies(ames[categoricals], prefix=categoricals, drop_first=True)

preprocessed = pd.concat([ames_log_norm, ames_ohe], axis=1)

X = preprocessed.drop('SalePrice_log', axis=1)
y = preprocessed['SalePrice_log']

## Train-Test Split

Perform a train-test split with a test set of 20% and a random state of 4.

In [2]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

In [3]:
# Split the data into training and test sets (assign 20% to test set)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

### Fit a Model

Fit a linear regression model on the training set

In [4]:
# Import LinearRegression from sklearn.linear_model
from sklearn.linear_model import LinearRegression


In [5]:
# Instantiate and fit a linear regression model
# Initialize the model
lin_reg = LinearRegression()

# Fit the model to the training data
lin_reg.fit(X_train, y_train)


### Calculate MSE

Calculate the mean squared error on the test set

In [6]:
# Import mean_squared_error from sklearn.metrics
from sklearn.metrics import mean_squared_error


In [7]:
# Calculate MSE on test set
# Generate predictions on the test set
y_test_pred = lin_reg.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse_test = mean_squared_error(y_test, y_test_pred)

# Display the MSE result
print("Test Set Mean Squared Error:", mse_test)


Test Set Mean Squared Error: 0.1523399721070817


## Cross-Validation using Scikit-Learn

Now let's compare that single test MSE to a cross-validated test MSE.

In [8]:
# Import cross_val_score from sklearn.model_selection
from sklearn.model_selection import cross_val_score

In [9]:
# Find MSE scores for a 5-fold cross-validation
mse_scores = -cross_val_score(lin_reg, X, y, cv=5, scoring='neg_mean_squared_error')

In [10]:
# Get the average MSE score
avg_mse = mse_scores.mean()

# Display the MSE scores and the average MSE
print("Cross-Validation MSE Scores:", mse_scores)
print("Average Cross-Validation MSE:", avg_mse)


Cross-Validation MSE Scores: [0.12431546 0.19350065 0.1891053  0.17079325 0.20742705]
Average Cross-Validation MSE: 0.1770283421000112


Compare and contrast the results. What is the difference between the train-test split and cross-validation results? Do you "trust" one more than the other?

# Your answer here

 # 🏠 Comparing Train-Test Split vs. Cross-Validation Results

Now that we've calculated both:  
✅ **Train-test split MSE** (single test set evaluation).  
✅ **Cross-validation MSE** (average of 5 different test set evaluations).  

---

## 🔹 Key Differences

### **Train-Test Split**  
- Evaluates the model on **one** test set.  
- **Strengths:**  
  - Fast and simple.  
  - Mimics real-world scenarios where models are trained and tested separately.  
- **Weaknesses:**  
  - The results depend on how the data is split.  
  - Can be unstable if the split is **lucky or unlucky** (leading to misleading results).  

### **Cross-Validation (5-fold)**  
- Evaluates the model on **5 different test sets** and **averages the results**.  
- **Strengths:**  
  - More reliable and stable evaluation.  
  - Uses more data for training, improving model performance.  
  - Reduces bias from lucky or unlucky splits.  
- **Weaknesses:**  
  - Computationally expensive (takes longer to run).  

---

## 🔹 Do We Trust One More?
Yes! **Cross-validation is more trustworthy** 🚀  

- **Why?** It reduces the impact of **random luck** in train-test splits.  
- Instead of relying on **one test set**, cross-validation **averages over multiple test sets** for a more **stable** and **reliable** error estimate.  
- If the **train-test split MSE is very different from the cross-validation MSE**, the train-test split might be misleading.  

---

## 🔹 When to Use Each Method?
- **Use Train-Test Split** when you need a quick evaluation or when computational efficiency is a concern.  
- **Use Cross-Validation** when you want a more reliable evaluation before making final decisions, especially in small datasets.  

---

### 💡 **Final Thought:**  
If the train-test MSE and cross-validation MSE are similar, the model is likely stable.  
If they are **very different**, it means the train-test split was either **lucky or unlucky**, so cross-validation is the better choice. 

## Level Up: Let's Build It from Scratch!

### Create a Cross-Validation Function

Write a function `kfolds(data, k)` that splits a dataset into `k` evenly sized pieces. If the full dataset is not divisible by `k`, make the first few folds one larger then later ones.

For example, if you had this dataset:

In [11]:
example_data = pd.DataFrame({
    "color": ["red", "orange", "yellow", "green", "blue", "indigo", "violet"]
})
example_data

Unnamed: 0,color
0,red
1,orange
2,yellow
3,green
4,blue
5,indigo
6,violet


`kfolds(example_data, 3)` should return:

* a dataframe with `red`, `orange`, `yellow`
* a dataframe with `green`, `blue`
* a dataframe with `indigo`, `violet`

Because the example dataframe has 7 records, which is not evenly divisible by 3, so the "leftover" 1 record extends the length of the first dataframe.

In [12]:
def kfolds(data, k):
    folds = []
    
    # Calculate the base size of each fold
    fold_size = len(data) // k
    remainder = len(data) % k  # Extra records to distribute among first few folds

    start_idx = 0
    for i in range(k):
        # First 'remainder' folds get an extra record
        extra = 1 if i < remainder else 0
        end_idx = start_idx + fold_size + extra
        folds.append(data.iloc[start_idx:end_idx])  # Append the slice to folds
        start_idx = end_idx  # Update start index for the next fold

    
    return folds

In [13]:
results = kfolds(example_data, 3)
for result in results:
    print(result, "\n")

    color
0     red
1  orange
2  yellow 

   color
3  green
4   blue 

    color
5  indigo
6  violet 



### Apply Your Function to the Ames Housing Data

Get folds for both `X` and `y`.

In [14]:
# Apply kfolds() to ames_data with 5 folds
# Apply kfolds() to the Ames Housing dataset with 5 folds

# Get folds for features (X) and target variable (y)
X_folds = kfolds(X, 5)
y_folds = kfolds(y, 5)

# Display the number of records in each fold
print("X Folds Sizes:", [len(fold) for fold in X_folds])
print("Y Folds Sizes:", [len(fold) for fold in y_folds])


X Folds Sizes: [292, 292, 292, 292, 292]
Y Folds Sizes: [292, 292, 292, 292, 292]


### Perform a Linear Regression for Each Fold and Calculate the Test Error

Remember that for each fold you will need to concatenate all but one of the folds to represent the training data, while the one remaining fold represents the test data.

In [15]:
# Replace None with appropriate code
test_errs = []
k = 5

for n in range(k):
    # Split into train and test for the fold
    X_train = pd.concat([X_folds[i] for i in range(k) if i != n])  # Combine all but the nth fold
    X_test = X_folds[n]  # The nth fold is the test set
    y_train = pd.concat([y_folds[i] for i in range(k) if i != n])  # Combine all but the nth fold
    y_test = y_folds[n]  # The nth fold is the test set
    
    # Fit a linear regression model
    lin_reg = LinearRegression()
    lin_reg.fit(X_train, y_train)

    
    # Evaluate test errors
    y_pred = lin_reg.predict(X_test)
    test_errs.append(mean_squared_error(y_test, y_pred))

print(test_errs)

[0.1243154614843743, 0.19350064631313127, 0.18910530431311184, 0.17079325250026917, 0.2074270458891695]


If your code was written correctly, these should be the same errors as scikit-learn produced with `cross_val_score` (within rounding error). Test this out below:

In [16]:
# Compare your results with sklearn results
# Perform 5-fold cross-validation using scikit-learn's cross_val_score
sklearn_mse_scores = -cross_val_score(lin_reg, X, y, cv=5, scoring='neg_mean_squared_error')

# Compare results
print("Manual K-Fold Test Errors:", test_errs)
print("Scikit-Learn Cross-Validation Test Errors:", sklearn_mse_scores)

# Check if the results are similar within a small rounding difference
difference = np.abs(np.array(test_errs) - sklearn_mse_scores)
print("Difference between Manual and Scikit-Learn MSEs:", difference)

Manual K-Fold Test Errors: [0.1243154614843743, 0.19350064631313127, 0.18910530431311184, 0.17079325250026917, 0.2074270458891695]
Scikit-Learn Cross-Validation Test Errors: [0.12431546 0.19350065 0.1891053  0.17079325 0.20742705]
Difference between Manual and Scikit-Learn MSEs: [0. 0. 0. 0. 0.]


This was a bit of work! Hopefully you have a clearer understanding of the underlying logic for cross-validation if you attempted this exercise.

##  Summary 

Congratulations! You are now familiar with cross-validation and know how to use `cross_val_score()`. Remember that the results obtained from cross-validation are more robust than train-test split.