## 2.71 Machine Learning - Intuition Logistic Regression

Let's start by creating a simulated data set:

In [1]:
import numpy as np
import pandas as pd
import math

n = 50
# Logistic function parameters
b0 = -7
b1 = 0.15
# Measurement noise on x
s = 10
np.random.seed(seed=1973)
df = pd.DataFrame(np.random.uniform(18,80,size=n),columns=['x'])
df['p_x'] = df.x.map(lambda x: math.exp(b0+b1*x)/(1+(math.exp(b0+b1*x))))
df['y'] = df.p_x.map(lambda p_x: 1 if p_x > 0.5 else 0)  
df['x'] = df.x.map(lambda x: x + float(np.random.normal(0,s,1)))
df['color'] = df.y.map(lambda y: 'Red' if y == 1 else 'Blue')

Let's plot the data:

In [2]:
from bokeh.plotting import figure, output_notebook, show
output_notebook(hide_banner=True)
p = figure(plot_width = 400, plot_height = 400)
p.circle(df.x, df.y, size=10, color=df.color, alpha=0.5)
p.xaxis.axis_label='x'
p.yaxis.axis_label='y'
show(p)

<bokeh.io._CommsHandle at 0x7fadaba6cc90>

## Fitting a linear regression model

What happens if we fit a linear regression model to this data?

In [3]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(df.x.reshape(-1,1), df.y)

x = np.arange(0,100,0.5)
y = model.predict(x.reshape(-1,1))
p = figure(plot_width = 400, plot_height = 400)
p.circle(df.x, df.y, size=10, color=df.color, alpha=0.5)
p.line(x,y, color='grey', line_width=5)
p.xaxis.axis_label='x'
p.yaxis.axis_label='y'
show(p)

<bokeh.io._CommsHandle at 0x7ff955957e10>

We can see the problem - for low and high x we get probabilities outside of the range (0.0,1.0).

## Logistic Regression

Logistic regression recognises the fact that we are trying to model a probability - specifically the $p(y=1 \mid x)$.  The linear regression model suggests the following functional form of the relationship:
    
$P(y=1 \mid x) = \beta_0 + \beta_1 x$

Logistic regression addresses this problem by choosing a function that constrains the probability to lie between in the range (10.0, 1.0).  The function chosen is known as the logistic function:

$p(y=1 \mid x) = \frac{e^{\beta_0 + \beta_1 x}}{1+e^{\beta_0 + \beta_1 x}}$

Let's take a look at this function.

In [12]:
from bokeh.charts import Scatter
from bokeh.io import push_notebook
from bokeh.plotting import ColumnDataSource, figure

b0 = -5
b1 = 0.1

dfl = pd.DataFrame(np.arange(0,100,0.5),columns=['x'])
dfl['y'] = dfl.x.map(lambda x: math.exp(b0+b1*x)/(1+(math.exp(b0+b1*x))))
df['pred_p'] = df.x.map(lambda x: math.exp(b0+b1*x)/(1+(math.exp(b0+b1*x))))
df['pred_y'] = df.pred_p.map(lambda p_x: 1 if p_x > 0.5 else 0)  

# compute MSE 
mse = np.sum(np.power(np.subtract(df.y, df.pred_p), 2)) / n
title_mse='MSE = {0:6.2f}'.format(mse)

output_notebook(hide_banner=True)

source_mse = ColumnDataSource(data=dict(text=[title_mse]))
source_line = ColumnDataSource(data=dict(x=dfl.x,y=dfl.y))

p = figure(plot_width = 600, plot_height = 400)
p.circle(df.x, df.y, size=10, color=df.color, alpha=0.5)
p.line(dfl.x, dfl.y, source=source_line, line_width=5, color='grey')
p.xaxis.axis_label='x'
p.yaxis.axis_label='y'
p.text(0,0.9, text=[title_mse], source = source_mse)

def update(b0, b1):
    dfl['y'] = dfl.x.map(lambda x: math.exp(b0+b1*x)/(1+(math.exp(b0+b1*x))))
    df['pred_p'] = df.x.map(lambda x: math.exp(b0+b1*x)/(1+(math.exp(b0+b1*x))))
    mse = np.sum(np.power(np.subtract(df.y, df.pred_p), 2)) / n
    title_mse='MSE = {0:6.2f}'.format(mse)
    source_line.data['y'] = dfl.y
    source_mse.data['text'] = [title_mse]
    push_notebook()
    
show(p)

<bokeh.io._CommsHandle at 0x7ff9516f0810>

In [15]:
from ipywidgets import interact
interact(update, b0 = (-30, 30, 0.5), b1 = (-10, 10, 0.05))

We can use 'machine learning' to avoid having to search through parameters:

In [16]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(multi_class='ovr')
model.fit(X=df.x.reshape(-1,1),y=df.y)
yp = model.predict(df.x.reshape(-1,1))

acc = float(sum([ 1 if r[0]==r[1] else 0 for r in zip(df.y, yp)])) / n
print('ACC = {0:4.2f}'.format(acc))

ACC = 0.82
