## 2.71 Machine Learning - Intuition

Let's start by creating a simulated data set:

In [56]:
from sklearn import datasets
import numpy as np

n = 50
x, y, coef = datasets.make_regression(n_samples = n, n_features = 1, 
                                      n_informative = 1, noise = 10, 
                                      coef = True, random_state = 0)
x = x.flatten()

Let's plot the data:

In [57]:
from bokeh.plotting import figure, output_notebook, show
output_notebook(hide_banner=True)
p = figure(plot_width = 400, plot_height = 400)
p.circle(x, y, size=10, color="orange")
p.xaxis.axis_label='x'
p.yaxis.axis_label='y'
show(p)

<bokeh.io._CommsHandle at 0x7f4b3e1def10>

It looks like a linear model is most likely representative of the true unknown function that relates x to y.

The equation of a line is:

$y = \alpha + \beta x$

By adjusting the two parameters $\alpha$ and $\beta$ we get an infinite number of possible lines.  For example we could set  $\alpha = 3.5$ and $\beta = 10$:

In [58]:
a = 3.5
b = 10
xl = np.arange(-3,3,0.1)
yl = a + b * xl
yp = a + b * x

output_notebook(hide_banner=True)

p = figure(plot_width = 400, plot_height = 400)
p.circle(x, y, size=10, color="orange")
p.line(xl, yl, line_width=5)
p.xaxis.axis_label='x'
p.yaxis.axis_label='y'
show(p)

<bokeh.io._CommsHandle at 0x7f4b3e0ebc10>

For any line we can compute the difference between the $y$'s in our data and the $\hat{y}$'s predicted by the equation.  From this difference we can define a simple metric that measures how well the line explains the data:

$MSE = \frac{1}{N}\sum{(y_i - \hat{y}_i) ^ 2}$

For example with $\alpha = 3.5$ and $\beta = 10$ - we can plot the distance between the points ($y$) and the points predicted by the equation ($\hat{y}$):

In [59]:
lx = [ [x1, x1] for x1 in x]
ly = [ [y1, y2] for (y1, y2) in zip(y, yp)]
p = figure(plot_width = 400, plot_height = 400)
p.multi_line(xs=lx, ys=ly, line_width = 2, color="gray")
p.circle(x, y, size=10, color="orange")
p.line(xl, yl, line_width=5)
p.xaxis.axis_label='x'
p.yaxis.axis_label='y'
show(p)

<bokeh.io._CommsHandle at 0x7f4b3de3b7d0>

We can average the square of the line lengths which gives us the Mean Squared Error.  

As we vary $\alpha$ and $\beta$ we get a different MSE - the best fitting line is the one with the lowest MSE.

In [67]:
from bokeh.charts import Scatter
from bokeh.io import push_notebook
from bokeh.plotting import ColumnDataSource, figure

a = 0
b = 1
xl = np.arange(-3,3,0.1)
yl = a + b * xl
yp = a + b * x

mse = np.sum(np.power(np.subtract(y, yp), 2)) / n
title='MSE = {0:6.2f}'.format(mse)

output_notebook(hide_banner=True)

source_mse = ColumnDataSource(data=dict(text=[title]))
source_line = ColumnDataSource(data=dict(x=xl,y=yl))

p = figure(plot_width = 400, plot_height = 400)
p.circle(x, y, size=10, color="orange")
p.line(xl, yl, source=source_line, line_width=5)
p.xaxis.axis_label='x'
p.yaxis.axis_label='y'
p.text(-3,30, text=[title], source = source_mse)

def update(a, b):
    xl = np.arange(-3,3,0.1)
    yl = a + b * xl
    yp = a + b * x
    mse = np.sum(np.power(np.subtract(y, yp), 2)) / n
    title='MSE = {0:6.2f}'.format(mse)
    source_line.data['x'] = xl
    source_line.data['y'] = yl
    source_mse.data['text'] = [title]
    push_notebook()
    
show(p)

<bokeh.io._CommsHandle at 0x7f4b3d089790>

In [68]:
from ipywidgets import interact
interact(update, a = (-40, 40, 0.5), b = (-20, 20, 0.5))

We can use 'machine learning' to avoid having to search through parameters:

In [65]:
from sklearn.linear_model import LinearRegression

xm = x.reshape(-1,1)
model = LinearRegression()
model.fit(xm,y)
print("a   = {0:4.2f}".format(float(model.intercept_)))
print("b   = {0:4.2f}".format(float(model.coef_)))

yp = model.predict(xm)
mse = np.sum(np.power(np.subtract(y, yp), 2)) / n
print('MSE = {0:4.2f}'.format(mse))

a   = 1.71
b   = 13.92
MSE = 64.11
