# Chapter 2.5: Polynomial Regression

Goal: Create polynomial features and select the optimal degree to avoid overfitting.

### Topics:
- Creating polynomial features with `PolynomialFeatures`
- Comparing models of different degrees
- Identifying overfitting (train-val gap)
- Selecting optimal polynomial degree

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import r2_score

## Quick Recap

- **Polynomial features** turn X into [X, X², X³, ...] to capture curves
- Higher degree = more flexible, but risk of **overfitting**
- Overfitting: High training score, low validation score
- Use validation set to choose the best degree

In [None]:
# Load diamonds dataset
diamonds = sns.load_dataset('diamonds')

# Use a subset for faster training
diamonds_sample = diamonds.sample(n=5000, random_state=42)

# We'll predict price from carat (single feature for visualization)
...

# Three-way split: 60% train, 20% val, 20% test
...

## Practice

### 1. Create degree-2 polynomial features from `carat`

In [None]:
# Step 1: Create PolynomialFeatures with degree=2
...

# Step 2: Fit on training data and transform train, val, test
...


### 2. Fit linear regression on polynomial features, calculate train and val R²

In [None]:
# Step 1: Fit LinearRegression on polynomial features
...

# Step 2: Calculate R² on train and validation
...


### 3. Repeat for degree 3, 4, 5

In [None]:
# Degree 3
# Step 1: Create PolynomialFeatures(degree=3)


# Step 2: Transform data


# Step 3: Fit model and calculate scores



In [None]:
# Degree 4



In [None]:
# Degree 5



### 4. Create a table showing degree, train R², val R²

In [None]:
# Create a DataFrame from the results which shows the polynomial degree, as well as the training and validation R² values
...

### 5. Which degree shows the largest train-val gap (overfitting)?

In [None]:
# Visualize the results


**Your observation:** Which degree has the largest gap between train and val R²? This is a sign of overfitting.

(Write your answer here)

### 6. Which degree would you choose and why?

In [None]:
# Pick the model with the highest validation R²


**Your recommendation:** Which degree would you choose for this model? Explain your reasoning.

(Write your answer here - consider both validation performance AND simplicity)

## Visualize the polynomial fits

In [None]:
# Create a graph which shows the data points and the fitted polynomial overlaid on top of it