In this lab, you'll practice your knowledge on the bias-variance trade-off!
You will be able to:
- Look at an example where Polynomial regression leads to overfitting
- Understand how bias-variance trade-off relates to underfitting and overfitting
We'll try to predict some movie revenues based on certain factors, such as ratings and movie year.
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
df = pd.read_excel('./movie_data_detailed_with_ols.xlsx')
df.head()
# Only keep four predictors and transform the with MinMaxScaler
scale = MinMaxScaler()
df = df[[ "domgross", "budget", "imdbRating", "Metascore", "imdbVotes"]]
transformed = scale.fit_transform(df)
pd_df = pd.DataFrame(transformed, columns = df.columns)
pd_df.head()
# domgross is the outcome variable
#Your code here
#Your code
Let's plot our result for the train data. Because we have multiple predictors, we can not simply plot the income variable X on the x-axis and target y on the y-axis. Lets plot
- a line showing the diagonal of y_train. The actual y_train values are on this line
- next, make a scatter plot that takes the actual y_train on the x-axis and the predictions using the model on the y-axis. You will see points scattered around the line. The horizontal distances between the points and the lines are the errors.
import matplotlib.pyplot as plt
%matplotlib inline
# your code here
Do the same thing for the test data.
# your code here
Write a formula to calculate the bias of a models predictions given the actual data:
(The expected value can simply be taken as the mean or average value.)
import numpy as np
def bias(y, y_hat):
pass
Write a formula to calculate the variance of a model's predictions:
def variance(y_hat):
pass
Use your functions to calculate the bias and variance of your model. Do this seperately for the train and test sets.
# code for train set bias and variance
# code for test set bias and variance
Your description here (this cell is formatted using markdown)
Use PolynomialFeatures
with degree 3.
Important note: By including this, you don't only take polynomials of single variables, but you also combine variables, eg:
$ \text{Budget} * \text{MetaScore} ^ 2 $
What you're essentially doing is taking interactions and creating polynomials at the same time! Have a look at how many columns we get using np.shape
. Quite a few!
from sklearn.preprocessing import PolynomialFeatures\
# your code here
# your code here
Wow, we almost get a perfect fit!
# your code here
# your code here
# your code here
The bias and variance for the test set both increased drastically in the overfit model.
In this lab we went from 4 predictors to 35 by adding polynomials and interactions, using PolynomialFeatures
. That being said, where 35 leads to overfitting, there are probably ways to improve by just adding a few polynomials. Feel free to experiment and see how bias and variance improve!
This lab gave you insight in how bias and variance change for a training and test set by using a pretty "simple" model, and a very complex model.