# Lesson 5.8: Cross-Validation

## Why One Split Isn't Enough

A single train/test split can be lucky or unlucky. Cross-validation tests your model multiple times with different splits.

### K-Fold Cross-Validation:
1. Split data into K equal parts (folds)
2. Train on K-1 folds, test on the remaining 1
3. Repeat K times, each fold gets to be the test set once
4. Average the scores

### Analogy
Like rotating who's the code reviewer in a team - everyone gets a turn, and you get a more complete picture.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Water filter data
np.random.seed(42)
n = 300
age = np.random.randint(10, 365, n)
tds = 30 + age * 0.25 + np.random.randn(n) * 15
flow = 2.5 - age * 0.004 + np.random.randn(n) * 0.3

X = pd.DataFrame({'tds_output': tds, 'flow_rate': flow, 'age_days': age})
y = ((tds > 80) | (flow < 1.0)).astype(int)

In [None]:
# Cross-validation in ONE line!
model = LogisticRegression(random_state=42)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')  # 5-fold

print(f"5-Fold CV Scores: {scores}")
print(f"Mean: {scores.mean():.3f} (+/- {scores.std():.3f})")
# The +/- tells you how STABLE the model is

In [None]:
# Compare multiple models with CV
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
}

print("Model Comparison (5-Fold CV):")
print(f"{'Model':<25} {'Mean Accuracy':>15} {'Std':>10}")
print("-" * 52)
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name:<25} {scores.mean():>15.3f} {scores.std():>10.3f}")

In [None]:
# GridSearchCV - find the best hyperparameters automatically!
# Like A/B testing different configs

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
}

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='recall',  # We care about catching bad filters!
    n_jobs=-1  # Use all CPU cores
)

grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best recall score: {grid_search.best_score_:.3f}")

## Exercise

1. Compare models using 'recall' instead of 'accuracy' as scoring metric
2. Use GridSearchCV on DecisionTree to find the best max_depth
3. Which model would you choose for the water filter project? Why?

In [None]:
# YOUR CODE HERE