# Lesson 5.7: Overfitting & Underfitting

## The Core ML Challenge

- **Underfitting**: Model too simple. Like using a straight line for curved data.
- **Overfitting**: Model too complex. Like memorizing answers instead of learning.
- **Just right**: Captures the pattern without memorizing noise.

### PHP Parallel
- Underfitting: Validation with only `required` - too loose, misses bad data
- Overfitting: Validation so specific it only matches YOUR test data
- Just right: Smart validation rules that generalize

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

%matplotlib inline

In [None]:
# Generate non-linear data (TDS increases then levels off)
np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 30)).reshape(-1, 1)
y = 3 * np.sin(X).ravel() + np.random.randn(30) * 0.5

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Visualize: Underfitting vs Just Right vs Overfitting
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
X_plot = np.linspace(0, 10, 100).reshape(-1, 1)

degrees = [1, 4, 15]  # Simple, Good, Too complex
titles = ['Underfitting (degree=1)', 'Just Right (degree=4)', 'Overfitting (degree=15)']

for ax, degree, title in zip(axes, degrees, titles):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    model.fit(X_train, y_train)
    
    train_score = r2_score(y_train, model.predict(X_train))
    test_score = r2_score(y_test, model.predict(X_test))
    
    ax.scatter(X_train, y_train, color='blue', alpha=0.5, label='Train')
    ax.scatter(X_test, y_test, color='red', alpha=0.5, label='Test')
    ax.plot(X_plot, model.predict(X_plot), 'g-', linewidth=2)
    ax.set_title(f'{title}\nTrain R²={train_score:.2f}, Test R²={test_score:.2f}')
    ax.legend(fontsize=8)

plt.tight_layout()
plt.show()
print("Notice: Overfitting has GREAT train score but TERRIBLE test score!")

In [None]:
# How to detect overfitting: big gap between train and test scores
train_scores = []
test_scores = []
degrees_range = range(1, 16)

for d in degrees_range:
    model = make_pipeline(PolynomialFeatures(d), LinearRegression())
    model.fit(X_train, y_train)
    train_scores.append(r2_score(y_train, model.predict(X_train)))
    test_scores.append(r2_score(y_test, model.predict(X_test)))

plt.figure(figsize=(8, 5))
plt.plot(degrees_range, train_scores, 'b-o', label='Train Score')
plt.plot(degrees_range, test_scores, 'r-o', label='Test Score')
plt.xlabel('Model Complexity (polynomial degree)')
plt.ylabel('R² Score')
plt.title('Overfitting: Train goes up, Test goes down')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## How to Fix Overfitting

1. **More data** - harder to memorize with more examples
2. **Simpler model** - reduce depth/features
3. **Regularization** - penalize complexity
4. **Cross-validation** - more robust evaluation (next lesson!)

## Exercise

1. Train a DecisionTree with max_depth=2 and max_depth=None. Which overfits?
2. Plot train vs test accuracy for different max_depth values (1 to 20)
3. Find the "sweet spot" depth where test score is highest

In [None]:
# YOUR CODE HERE