## 2.71 Machine Learning - Intuition

Let's start by creating a simulated data set:

In [121]:
from sklearn import datasets
import numpy as np

n = 50
x, y, coef = datasets.make_regression(n_samples = n, n_features = 1, 
                                      n_informative = 1, noise = 10, 
                                      coef = True, random_state = 0)
x = x.flatten()

Let's plot the data:

In [122]:
p = figure(plot_width = 400, plot_height = 400)
p.circle(x, y, size=10, color="orange")
p.xaxis.axis_label='x'
p.yaxis.axis_label='y'
show(p)

<bokeh.io._CommsHandle at 0x7f67301bcdd0>

It looks like a linear model is most likely representative of the true unknown function that relates x to y.

The equation of a line is:

$y = \alpha + \beta x$

By adjusting the two parameters $\alpha$ and $\beta$ we get an infinite number of possible lines.

For any line we can compute the difference between the $y$'s in our data and the $\hat{y}$'s predicted by the equation.  From this difference we can define a simple metric that measures how well the line explains the data:

$MSE = \frac{1}{N}\sum{(y_i - \hat{y}_i) ^ 2}$

As we vary $\alpha$ and $\beta$ we get a different MSE - the best fitting line is the one with the lowest MSE.

In [123]:
from bokeh.charts import output_notebook, Scatter, show
from bokeh.io import push_notebook
from bokeh.plotting import ColumnDataSource, figure

a = 0
b = 1
xl = np.arange(-3,3,0.1)
yl = a + b * xl
yp = a + b * x

mse = np.sum(np.power(np.subtract(y, yp), 2)) / n
title='MSE = {0:6.2f}'.format(mse)

output_notebook(hide_banner=True)

source_mse = ColumnDataSource(data=dict(text=[title]))
source_line = ColumnDataSource(data=dict(x=xl,y=yl))

p = figure(plot_width = 400, plot_height = 400)
p.circle(x, y, size=10, color="orange")
p.line(xl, yl, source=source_line)
p.xaxis.axis_label='x'
p.yaxis.axis_label='y'
p.text(-3,30, text=[title], source = source_mse)

def update(a, b):
    xl = np.arange(-3,3,0.1)
    yl = a + b * xl
    yp = a + b * x
    mse = np.sum(np.power(np.subtract(y, yp), 2)) / n
    title='MSE = {0:6.2f}'.format(mse)
    source_line.data['x'] = xl
    source_line.data['y'] = yl
    source_mse.data['text'] = [title]
    push_notebook()
    
show(p)

<bokeh.io._CommsHandle at 0x7f672bb58f90>

In [124]:
from ipywidgets import interact
interact(update, a = (-40, 40, 0.5), b = (-20, 20, 0.5))

<function __main__.update>

We can use 'machine learning' to avoid having to search through parameters:

In [120]:
from sklearn.linear_model import LinearRegression

x = x.reshape(-1,1)
model = LinearRegression()
model.fit(x,y)
print("a   = {0:4.2f}".format(float(model.coef_)))
print("b   = {0:4.2f}".format(float(model.intercept_)))

yp = model.predict(x)
mse = np.sum(np.power(np.subtract(y, yp), 2)) / n
print('MSE = {0:4.2f}'.format(mse))

a   = 13.92
b   = 1.71
MSE = 64.11
