# Chapter 2.5: Polynomial Regression

Goal: Create polynomial features and select the optimal degree to avoid overfitting.

### Topics:
- Creating polynomial features with `PolynomialFeatures`
- Comparing models of different degrees
- Identifying overfitting (train-val gap)
- Selecting optimal polynomial degree

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score

## Quick Recap

- **Polynomial features** turn X into [X, X², X³, ...] to capture curves
- Higher degree = more flexible, but risk of **overfitting**
- Overfitting: High training score, low validation score
- Use validation set to choose the best degree

In [None]:
# Load diamonds dataset
diamonds = sns.load_dataset('diamonds')

# Use a subset for faster training
diamonds_sample = diamonds.sample(n=5000, random_state=42)

# We'll predict price from carat (single feature for visualization)
X = diamonds_sample[['carat']].values
y = diamonds_sample['price'].values

# Three-way split: 60% train, 20% val, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")

## Practice

### 1. Create degree-2 polynomial features from `carat`

In [None]:
# Step 1: Create PolynomialFeatures with degree=2
poly2 = PolynomialFeatures(degree=2, include_bias=False)

# Step 2: Fit on training data and transform train, val, test
X_train_poly2 = poly2.fit_transform(X_train)
X_val_poly2 = poly2.transform(X_val)
X_test_poly2 = poly2.transform(X_test)

# See what features were created
print(f"Original features: {X_train.shape[1]}")
print(f"Polynomial features: {X_train_poly2.shape[1]}")
print(f"Feature names: {poly2.get_feature_names_out()}")

### 2. Fit linear regression on polynomial features, calculate train and val R²

In [None]:
# Step 1: Fit LinearRegression on polynomial features
model_poly2 = LinearRegression()
model_poly2.fit(X_train_poly2, y_train)

# Step 2: Calculate R² on train and validation
train_r2 = model_poly2.score(X_train_poly2, y_train)
val_r2 = model_poly2.score(X_val_poly2, y_val)

print(f"Degree 2 - Train R²: {train_r2:.4f}, Val R²: {val_r2:.4f}")

### 3. Repeat for degree 3, 4, 5

In [None]:
# Degree 3
# Step 1: Create PolynomialFeatures(degree=3)
# Step 2: Transform data
# Step 3: Fit model and calculate scores



In [None]:
# Degree 4



In [None]:
# Degree 5



### 4. Create a table showing degree, train R², val R²

In [None]:
# Let's do this systematically with a loop
results = []

for degree in range(1, 8):  # Degrees 1 through 7
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_train_p = poly.fit_transform(X_train)
    X_val_p = poly.transform(X_val)
    
    # Fit model
    model = LinearRegression()
    model.fit(X_train_p, y_train)
    
    # Calculate scores
    train_score = model.score(X_train_p, y_train)
    val_score = model.score(X_val_p, y_val)
    
    results.append({
        'Degree': degree,
        'Train R²': train_score,
        'Val R²': val_score,
        'Gap': train_score - val_score
    })

results_df = pd.DataFrame(results)
results_df

### 5. Which degree shows the largest train-val gap (overfitting)?

In [None]:
# Visualize the results
plt.figure(figsize=(10, 6))
plt.plot(results_df['Degree'], results_df['Train R²'], 'o-', label='Train R²')
plt.plot(results_df['Degree'], results_df['Val R²'], 's-', label='Validation R²')
plt.xlabel('Polynomial Degree')
plt.ylabel('R² Score')
plt.title('Train vs Validation R² by Polynomial Degree')
plt.legend()
plt.grid(True)
plt.show()

**Your observation:** Which degree has the largest gap between train and val R²? This is a sign of overfitting.

(Write your answer here)

### 6. Which degree would you choose and why?

In [None]:
# Find the degree with highest validation R²
best_degree = results_df.loc[results_df['Val R²'].idxmax(), 'Degree']
best_val_r2 = results_df['Val R²'].max()

print(f"Best degree based on validation: {best_degree}")
print(f"Validation R² at best degree: {best_val_r2:.4f}")

**Your recommendation:** Which degree would you choose for this model? Explain your reasoning.

(Write your answer here - consider both validation performance AND simplicity)

## Bonus: Visualize the polynomial fits

In [None]:
# Create a smooth line for plotting
X_line = np.linspace(X_train.min(), X_train.max(), 100).reshape(-1, 1)

plt.figure(figsize=(12, 8))
plt.scatter(X_val, y_val, alpha=0.3, label='Validation Data')

for degree in [1, 2, 3, 5]:
    poly = PolynomialFeatures(degree=degree, include_bias=False)
    X_train_p = poly.fit_transform(X_train)
    X_line_p = poly.transform(X_line)
    
    model = LinearRegression()
    model.fit(X_train_p, y_train)
    y_line = model.predict(X_line_p)
    
    plt.plot(X_line, y_line, label=f'Degree {degree}')

plt.xlabel('Carat')
plt.ylabel('Price ($)')
plt.title('Polynomial Regression Fits of Different Degrees')
plt.legend()
plt.show()

## Discussion Question

If you were presenting this model to a stakeholder, would you prefer degree 2 with R²=0.85 or degree 5 with R²=0.87? Why?

(Discuss with a neighbor)