# Chapter 2.1: Train/Test Split

Goal: Practice proper data splitting and understand why the validation set matters.

### Topics:
- Creating train/validation/test splits
- Using `train_test_split` correctly
- Implementing cross-validation
- Understanding why we need three sets

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression

## Quick Recap

- **Training set**: Model learns from this data
- **Validation set**: Used to tune hyperparameters and compare models
- **Test set**: Final evaluation only - touched ONCE at the very end

Why three sets? If you tune on test data, you're giving the model the answers to the test.

In [2]:
# For this activity, we'll use a simpler dataset - California Housing
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
df = housing.frame
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [None]:
# Set the y (MedHouseVal) and X (everything else)


## Practice

### 1. Split data into train (60%), validation (20%), test (20%)

You'll need to call `train_test_split` twice:
1. First split: 80% train+val, 20% test
2. Second split: How should you split the 80% training set so that it splits into a 60% training set + 20% validation?

*Hint: $0.8x = 0.6$, solve for $x$*

In [None]:
# Step 1: Split into train (80%) and test (20%)



# Step 2: Split training data into into train (60%) and validation (20%)


### 2. Assert that no data was lost in the split

The total number of samples across all three sets should equal the original dataset size.

In [None]:
# Step 1: Calculate total samples across train, val, and test


# Step 2: Assert it equals len(X)


# Step 3: Print all sizes to verify everything looks correct


### 3. Use AI - Fit LinearRegression on train, predict on val AND test

In [None]:
# Step 1: Create and fit a LinearRegression model on training data


# Step 2: Calculate R² score on validation set


# Step 3: Calculate R² score on test set


### 4. Use AI - Compare val score vs test score - are they similar?

If the validation and test scores are similar, that's a good sign - it means the validation set is giving us a reliable estimate of how the model will perform on unseen data.

In [None]:
# Print both scores and calculate the difference


# Are they within 0.02 of each other? That's pretty good!

**Your interpretation:** Are the validation and test scores similar? What does this tell you?

(Write your answer here)

### 5. Use AI - Use `cross_val_score` with 5 folds on training data

Cross-validation gives you multiple estimates of model performance, which is more reliable than a single validation score.

In [None]:
# Step 1: Create a new LinearRegression model
# Step 2: Use cross_val_score with cv=5 on the TRAINING data (not all data!)


# Print all 5 scores


### 6. Use AI - What's the mean and std of cross-validation scores?

In [None]:
# Calculate mean and standard deviation of the CV scores


**Your interpretation:** How does the CV mean compare to your single validation score from earlier? Is the standard deviation large or small?

(Write your answer here)

## Discussion Question

Why shouldn't you tune hyperparameters on the test set? What would happen if you did?

(Discuss with a neighbor)