## 2.73 Machine Learning - Overfitting Regression Example

We are going to explore data sampled from a true (usually unknown) function that relates say house price (y in dollars) to house size (x in square feet).

Let's assume the true function is a simple curve:

$y = a.log(bx^2) + c$

And that we can collect data with a normally distributed measurement accuracy of +-d, so our measured data is sampled from this data generating model

$y = a.log(bx^2)  + c + N(0,d)$

Because of the measurement accuracy, with a fixed set of data and a perfect model in our hypothesis set we would not be able to fully resolve y due to the measurement accuracy - this is known as the irreducible error.  We will always have uncertainty.  Here is the data and the underlying data generating function:

In [31]:
import numpy as np
import pandas as pd
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import Range1d
from bokeh.io import gridplot
from bokeh.charts import Bar

# ground truth - our function
xt = np.arange(0,20,0.1)
yt = np.log(10 * xt ** 2) + 2

n=20

# build train and test data
np.random.seed(seed=11)
x = np.random.random(n)*20
y = np.log(10 * x ** 2) + 2 + np.random.normal(scale=1, size=n)
df = pd.DataFrame(data = zip(x,y), columns=['x','y'])

p = figure()
p.circle(x=x, y=y, size=10, color='Orange')
p.line(x=xt, y=yt, color='Red')
show(p)

<bokeh.io._CommsHandle at 0x7f1ba09d3c10>

Typically in machine learning we work with a fixed sample of data of size n.  For demonstration let's fix n at 20 and generate 20 random points from our data generating model.  

In [30]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

df_train, df_test = train_test_split(df, test_size = 0.5, random_state=72)

def mse(a,b):
    return np.sum(np.power(np.subtract(a,b),2)) / len(a)

def fitPolynomialRegression(x_train, y_train, x_test, y_test, xt, degree=1):    
    model = make_pipeline(PolynomialFeatures(degree=degree), LinearRegression())
    model.fit(x_train, y_train) 
    return {
        'mse_train' : mse(y_train, model.predict(x_train)),
        'mse_test' : mse(y_test, model.predict(x_test)),
        'pred_yt' : model.predict(xt),
        'pred_train' : model.predict(x_train),
        'pred_test' : model.predict(x_test)
        
    } 


# fit models of increasing flexibility
mse_test = []
me_train = []
for degree in range(1,11):
    fit = fitPolynomialRegression(
            df_train.x.reshape(-1,1), 
            df_train.y, 
            df_test.x.reshape(-1,1), 
            df_test.y, 
            xt.reshape(-1,1),
            degree)
    mse_train.append(fit['mse_train'])
    mse_test.append(fit['mse_test'])

x = range(1,11)
p = figure(plot_width=300, plot_height=300, title="Test vs Train MSE")
p.line(x=x, y=mse_train, color = 'blue')
p.line(x=x, y=mse_test, color = 'red')
p.y_range = Range1d(0,20)
show(p)

<bokeh.io._CommsHandle at 0x7f1ba154cdd0>

Let's plot the training data and overlay the models we have fit:

In [28]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import Range1d
from bokeh.io import gridplot
from bokeh.charts import Bar

output_notebook()

plots_train = []
plots_test = []

for degree in range(1,11):
    fit = fitPolynomialRegression(
            df_train.x.reshape(-1,1), 
            df_train.y, 
            df_test.x.reshape(-1,1), 
            df_test.y, 
            xt.reshape(-1,1),
            degree)
    p = figure(plot_width=200, plot_height=200)
    p.title= "Train " + str(degree)
    p.title_text_font_size = '4pct'
    p.circle(x=df_train.x, y=df_train.y, size=8, color="orange", alpha=0.8)
    p.line(x=xt, y=fit['pred_yt'], color = 'red')
    p.y_range = Range1d(0,12)
    p.x_range = Range1d(0,20)
    plots_train.append(p)
    p = figure(plot_width=200, plot_height=200)
    p.title= "Test " + str(degree)
    p.title_text_font_size = '4pct'
    p.circle(x=df_test.x, y=df_test.y, size=8, color="orange", alpha=0.8)
    p.line(x=xt, y=fit['pred_yt'], color = 'red')
    p.y_range = Range1d(0,12)
    p.x_range = Range1d(0,20)
    plots_test.append(p)

show(gridplot([[plots_train[0], plots_train[1], plots_train[2], plots_train[3], plots_train[4]], 
               [plots_train[5], plots_train[6], plots_train[7], plots_train[8], plots_train[9]]]))

show(gridplot([[plots_test[0], plots_test[1], plots_test[2], plots_test[3], plots_test[4]], 
               [plots_test[5], plots_test[6], plots_test[7], plots_test[8], plots_test[9]]]))


<bokeh.io._CommsHandle at 0x7f1ba217cc90>