# Polynomial Regression and Overfitting

In this module, you will learn how to work with polynomial regression in scikit-learn. We will also introduce the concept of overfitting and learn how to avoid this.

<b>Functions and attributes in this lecture: </b>
- `pandas:` - Pandas package with alias `pd`
 - `.copy()` - Make a copy of a pandas Dataframe
- `sklearn.preprocessing` - Submodule for preprocessing tools like PolynomialFeatures
  - `PolynomialFeatures` - For creating of polynomial features
  - `.fit_transform()` - Fitting and transforming the data to form new polynomial features
- `sklearn.metrics` - Submodule for metrics used to evaluate models
  - `mean_absolute_error` - Taking the mean-abosolute-error of a vector

In [None]:
# Non-sklearn packages
import numpy as np
import pandas as pd

# Sklearn packages
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

In [None]:
# Load in the diabetes dataset
X, y = load_diabetes(return_X_y=True, as_frame=True)

# Get the description of the dataset
print(load_diabetes()["DESCR"])

## Manually Create Polynomial Features

We begin by manually creating features in Pandas and measure how this affects the error rate of our models.

In [None]:
# Check the correlation with the target variable


In [None]:
# Selecting only some of the columns


In [None]:
# Creating a copy for further modification


In [None]:
# Manually creating cross_terms


In [None]:
# Shows the new columns we have created


In [None]:
# Showing the correlation again


In [None]:
# Splitting into test sets and training sets


In [None]:
# Creating, training, and predicting the polynomial model


In [None]:
# Use mean-absolute-error to measure the model


In [None]:
# Create a linear model


In [None]:
# Shows score of the purely linear model


## Using Scikit-Learns Polynomial Features

We now introduce scikit-learn's built in PolynomialFeatures for handling polynomial featues.

In [None]:
# Importing polynomial features


In [None]:
# We are working with four features


In [None]:
# Creates polynomial features automatically


In [None]:
# Split into training and testing set


In [None]:
# Create model, train, and test


In [None]:
# Getting the mean-absolute-error score


## Fitting Everything Into a Pipeline

It is time to fit the previous steps into a pipeline for better reproducibility and to avoid data leakage.

In [None]:
# We are working with four features


In [None]:
# Dividing into training and testing sets


In [None]:
# Creating a pipeline (use scaler after the polynomial features!)


In [None]:
# Fitting and predicting


In [None]:
# We get the same score


## Checking if you are Overfitting

We will now loop through several polynomial degrees to check which one is best.

In [None]:
# We are working with four features


In [None]:
# Split into training and testing


In [None]:
# Checking the error for various degrees
