## 2.71 Machine Learning - Intuition Logistic Regression

Let's start by creating a simulated data set:

In [138]:
import numpy as np
import pandas as pd
import math

n = 50
# Logistic function parameters
b0 = -7
b1 = 0.15
# Measurement noise on x
s = 10
np.random.seed(seed=1973)
df = pd.DataFrame(np.random.uniform(18,80,size=n),columns=['x'])
df['p_x'] = df.x.map(lambda x: math.exp(b0+b1*x)/(1+(math.exp(b0+b1*x))))
df['y'] = df.p_x.map(lambda p_x: 1 if p_x > 0.5 else 0)  
df['x'] = df.x.map(lambda x: x + float(np.random.normal(0,s,1)))
df['color'] = df.y.map(lambda y: 'Red' if y == 1 else 'Blue')

Let's plot the data:

In [139]:
from bokeh.plotting import figure, output_notebook, show
output_notebook(hide_banner=True)
p = figure(plot_width = 400, plot_height = 400)
p.circle(df.x, df.y, size=10, color=df.color, alpha=0.5)
p.xaxis.axis_label='x'
p.yaxis.axis_label='y'
show(p)

<bokeh.io._CommsHandle at 0x7f9760405ed0>

It looks like a linear model is most likely representative of the true unknown function that relates x to y.

The equation of a line is:

$y = \alpha + \beta x$

By adjusting the two parameters $\alpha$ and $\beta$ we get an infinite number of possible lines.

For any line we can compute the difference between the $y$'s in our data and the $\hat{y}$'s predicted by the equation.  From this difference we can define a simple metric that measures how well the line explains the data:

$MSE = \frac{1}{N}\sum{(y_i - \hat{y}_i) ^ 2}$

As we vary $\alpha$ and $\beta$ we get a different MSE - the best fitting line is the one with the lowest MSE.

In [147]:
from bokeh.charts import Scatter
from bokeh.io import push_notebook
from bokeh.plotting import ColumnDataSource, figure

b0 = -5
b1 = 0.1

dfl = pd.DataFrame(np.arange(0,100,0.5),columns=['x'])
dfl['y'] = dfl.x.map(lambda x: math.exp(b0+b1*x)/(1+(math.exp(b0+b1*x))))

#f['pred'] = # add predicted value based on this fit
# compute MSE mse = np.sum(np.power(np.subtract(df.y, yp), 2)) / n
#t itle='MSE = {0:6.2f}'.format(mse)

output_notebook(hide_banner=True)

#source_mse = ColumnDataSource(data=dict(text=[title]))
source_line = ColumnDataSource(data=dict(x=dfl.x,y=dfl.y))

p = figure(plot_width = 600, plot_height = 400)
p.circle(df.x, df.y, size=10, color=df.color, alpha=0.5)
p.line(dfl.x, dfl.y, source=source_line, line_width=5, color='grey')
p.xaxis.axis_label='x'
p.yaxis.axis_label='y'
#p.text(-3,30, text=[title], source = source_mse)

def update(b0, b1):
    dfl['y'] = dfl.x.map(lambda x: math.exp(b0+b1*x)/(1+(math.exp(b0+b1*x))))
    source_line.data['y'] = dfl.y
    #source_mse.data['text'] = [title]
    push_notebook()
    
show(p)

<bokeh.io._CommsHandle at 0x7f975b754310>

In [148]:
from ipywidgets import interact
interact(update, b0 = (-30, 30, 0.5), b1 = (-10, 10, 0.05))

We can use 'machine learning' to avoid having to search through parameters:

In [7]:
from sklearn.linear_model import LinearRegression

x = x.reshape(-1,1)
model = LinearRegression()
model.fit(x,y)
print("a   = {0:4.2f}".format(float(model.coef_)))
print("b   = {0:4.2f}".format(float(model.intercept_)))

yp = model.predict(x)
mse = np.sum(np.power(np.subtract(y, yp), 2)) / n
print('MSE = {0:4.2f}'.format(mse))

a   = 13.92
b   = 1.71
MSE = 64.11
