We are going to explore data sampled from a true (usually unknown) function that relates house price (y in dollars) to house size (x in square feet).

Let's assume the true function is a simple curve:

y = a.log(bx^2) + c, defined for x,y subset positive real numbers.

And that we can collect data with a normally distributed measurement accuracy of +-d, so our measured data is sampled from this data generating model

y = a.log(bx^2)  + c + N(d)

Because of this measurement accuracy, with a fixed set of data and a perfect model in our hypothesis set we would not be able to fully resolve y due to the measurement accuracy - this is known as the irreducible error.  We will always have uncertainty.  

Typically in machine learning we work with a fixed sample of data of size n.  For demonstration let's fix n at 20 and generate 20 random points from our data generating model.  

In [42]:
import numpy as np
import pandas as pd

n = 20
x = np.random.random(n)*20
y = np.log(10 * x ** 2) + 2 + np.random.normal(scale=1, size=n)
df = pd.DataFrame(data = zip(x,y), columns=['x','y'])
df

Unnamed: 0,x,y
0,9.588784,8.983029
1,12.572444,9.92679
2,10.441444,8.909004
3,13.287254,8.244418
4,3.881669,6.334197
5,12.122953,8.893576
6,9.44233,9.610408
7,13.098537,9.204448
8,15.46946,10.115136
9,6.928323,8.344253


For machine learnign we further divide this data into a train and test set - in this case let's take a 50%:50% split - we will keep our test data to measure the out of sample fit of the models we fit to the training data.

In [77]:
from sklearn.cross_validation import train_test_split

df_train, df_test = train_test_split(df, test_size = 0.5, random_state=72)

print('Training data shape : ' + str(df_train.shape))
print('Test data shape : ' + str(df_test.shape))

Training data shape : (10, 2)
Test data shape : (10, 2)


Let's plot the training data and overlay the true data generating curve:

In [78]:
from bokeh.plotting import figure, output_notebook, show

xt = np.arange(0.0,20.0,0.1)
yt = np.log(10 * xt ** 2) + 2

# output to static HTML file
output_notebook()

p = figure(plot_width=400, plot_height=400)
# points in data set
p.circle(x=df_train.x, y=df_train.y, size=10, color="navy", alpha=0.5)
# add the line
p.line(x=xt, y=yt, line_width=2)

# show the results
show(p)


<bokeh.io._CommsHandle at 0x7f0682ef58d0>

Let's fit polynomial of different degree's to this

In [89]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline


def poly(degree, X, y):
    model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
    return model.fit(X, y)

xt = np.arange(min(df_train.x),max(df_train.x),0.1)
yt = np.log(10 * xt ** 2) + 2

l1 = poly(1, df_train.x.reshape(-1,1), df_train.y).predict(xt.reshape(-1,1))
l2 = poly(2, df_train.x.reshape(-1,1), df_train.y).predict(xt.reshape(-1,1))
l3 = poly(3, df_train.x.reshape(-1,1), df_train.y).predict(xt.reshape(-1,1))
#...
l10 = poly(10, df_train.x.reshape(-1,1), df_train.y).predict(xt.reshape(-1,1))

from bokeh.models import Range1d

p = figure(plot_width=400, plot_height=400)
# points in data set
p.circle(x=df_train.x, y=df_train.y, size=10, color="navy", alpha=0.5)
# add the lines
p.line(x=xt, y=yt, line_width=2)
p.line(x=xt, y=l1, color = 'red')
p.line(x=xt, y=l2, color = 'orange')
p.line(x=xt, y=l3, color = 'yellow')
p.line(x=xt, y=l4, color = 'grey')
# show the results
p.y_range = Range1d(0,12)
show(p)

<bokeh.io._CommsHandle at 0x7f0680d869d0>

Let's compute simple MSE over train observations as a function of model complexity:

And add test observations: