# A Simple Neural network
## Part 2: Logistic regression (classification)

This tutorial is part 2 of the [previous tutorial on neural networks](http://peterroelants.github.io/posts/2015/05/18/Simple_neural_network_part01/). While the previous tutorial described a very simple one-input-one-output linear regression model, this model will describe a 2-class classification neural network with two input dimensions. This model is known in statistics as the [logistic regression](http://en.wikipedia.org/wiki/Logistic_regression) model.

![Image of the logistic model](https://dl.dropboxusercontent.com/u/8938051/Blog_images/SimpleANN02.png)

In [None]:
# Python imports
import numpy as np # Matrix and vector computation package
from mpl_toolkits.mplot3d import Axes3D # 3D plotting
import matplotlib.pyplot as plt  # Plotting library
# Allow matplotlib to plot inside this notebook
%matplotlib inline
# Set the seed of the numpy random number generator so that the tutorial is reproducable
np.random.seed(seed=1)

## Define the class distributions

In this example the target classes $t$ will be generated from 2 class distributions: blue ($t=1$) and red ($t=0$). Samples from both classes are sampled from their respective distributions.



In [None]:
nb_of_samples_per_class = 20  # The number of sample in each class
red_mean = [-1,0]  # The mean of the red class
blue_mean = [1,0]  # The mean of the blue class
std_dev = 1.2  # standard deviation of both classes
# Generate samples from both classes
x_red = np.random.randn(nb_of_samples_per_class, 2) * std_dev + red_mean
x_blue = np.random.randn(nb_of_samples_per_class, 2) * std_dev + blue_mean

x = np.vstack((x_red, x_blue))
t = np.vstack((np.zeros((nb_of_samples_per_class,1)), np.ones((nb_of_samples_per_class,1))))
print t
# print x
# print x_red
# print x_blue
# Plot both classon on the x1, x2 plane
plt.plot(x_red[:,0], x_red[:,1], 'ro', label='class red')
plt.plot(x_blue[:,0], x_blue[:,1], 'bo', label='class blue')
plt.grid()
plt.legend(loc=2)
plt.xlabel('x1')
plt.ylabel('x2')
plt.axis([-4, 4, -4, 4])
plt.title('red vs blue classes in the input space')
plt.show()

## Logistic function

The goal to predict the target class $t$ from the input values $x$. The network is defined as having an input $x = [x_1, x_2]$ which gets transformed by the weights $w = [w_1, w_2]$ to generate the probability that sample $x$ belongs to class $t=1$. This probability $P(t=1| x,w)$ is represented by the output $y$ of the network computed as $y = \sigma(x * w^T)$. $\sigma$ is the [logistic function](http://en.wikipedia.org/wiki/Logistic_function) and is defined as:
$$ \sigma(z) = \frac{1}{1+e^{-z}} $$

This logistic function maps the input $z$ to an output between $0$ and $1$ as is illustrated in the figure below.

We can write the probabilities that the class is $t=1$ or $t=0$ given input $x$ as:

$$ P(t=1| x) = \sigma(x * w^T) = \frac{1}{1+e^{-x * w^T}} $$
$$ P(t=0| x) = 1 - \sigma(x * w^T) = \frac{e^{-x * w^T}}{1+e^{-x * w^T}} $$

Note that the logistic function is derived from the log [odds ratio](http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm) of $P(t=1|x)$ over $P(t=0|x)$.

$$ log \frac{P(t=1|x)}{P(t=0|x)} = log \frac{\frac{1}{1+e^{-x * w^T}}}{\frac{e^{-x * w^T}}{1+e^{-x * w^T}}} = log \frac{1}{e^{-x * w^T}} = $$
$$log(1) - log(e^{-x * w^T}) = x * w^T$$

This means that the logg odds ratio $log(P(t=1|x)/P(t=0|x))$ changes linearly with the parameters $w$.

In [None]:
# Define the logistic function
def logistic(z): return 1 / (1 + np.exp(-z))

# Plot the logistic function
z = np.linspace(-6,6,100)
plt.plot(z, logistic(z), 'b-')
plt.xlabel('z')
plt.ylabel('$\sigma(z)$')
plt.title('logistic function')
plt.grid()
plt.show()

## Define the cost function

As written before the output of the model $y = \sigma(x * w^T)$ can be interpreted as a probability $y$ that sample $x$ belongs to one class $(t=1)$, or probability $1-y$ that $x$ belongs to the other class $(t=0)$ in a two class classification problem. We note this down as: $P(t=1| x; w) = \sigma(x * w^T) = y$.

This model will be optimized by maximizing the [likelihood](http://en.wikipedia.org/wiki/Likelihood_function) that a given set of parameters $w$ can predict the correct class given an input set $x$ of size $n$, and corresponding labels $t$:

$$\underset{w}{\text{argmax}}\; \mathcal{L}(w|t,x) = \underset{w}{\text{argmax}} \prod_{i=1}^{n} \mathcal{L}(w|t_i,x_i)$$

We can rewrite the likelihood $\mathcal{L}(w|t,x)$ as the [joint probability](http://en.wikipedia.org/wiki/Joint_probability_distribution) of generating $t$ and $x$ given the parameters $w$: $P(t,x|w)$. Since $P(A,B) = P(A|B)*P(B)$ this can be written as:

$$P(t,x|w) = P(t|x;w)P(x|w)$$

Since we are not interested in the probability of $x$, and $x$ is independent of $w$ we can reduce this to: $\mathcal{L}(w|t,x) = P(t|x;w) = \prod_{i=1}^{n} P(t_i|x_i;w)$. 
Since $t_i$ is a [Bernoulli variable](http://en.wikipedia.org/wiki/Bernoulli_distribution) we can rewrite this as: 

$$\begin{split}
P(t|x;w) & = \prod_{i=1}^{n} P(t_i=1|x_i;w)^{t_i} * (1 - P(t_i=1|x_i;w))^{1-t_i} \\
& = \prod_{i=1}^{n} y_i^{t_i} * (1 - y_i)^{1-t_i} \end{split}$$

Since the logartimic function is a monotone increasing function we can optimize the log-likelihood function $\underset{w}{\text{argmax}}\; log \mathcal{L}(w|t,x)$. This maximum will be the same as the maximum from the regular likelihood function. The log-likelihood function can be written as:

$$\begin{split} log \mathcal{L}(w|t,x) & = log \prod_{i=1}^{n} y_i^{t_i} * (1 - y_i)^{1-t_i} \\
& = \sum_{i=1}^{n} t_i log(y_i) + (1-t_i) log(1 - y_i)
\end{split}$$

Minimizing the negative of this function (minimizing the negative log likelihood) corresponds to maximizing the likelihood. This error function $\xi(t,y)$ is typically known as the [cross-entropy error function](http://en.wikipedia.org/wiki/Cross_entropy) (also known as log-loss):

$$\begin{split}
\xi(t,y) & = - log \mathcal{L}(w|t,x) \\
& = - \sum_{i=1}^{n} \left[ t_i log(y_i) + (1-t_i)log(1-y_i) \right] \\
& = - \sum_{i=1}^{n} \left[ t_i log(\sigma((x_i * w) + b)) + (1-t_i)log(1-\sigma((x_i * w) + b)) \right]
\end{split}$$

This function looks complicated but besided the previous derivation there are a couple of intuitions why this function is used as a cost function for logistic regression. First of all it can be rewritten as:

$$ \xi(t_i,y_i) = 
   \begin{cases}
   -log(y_i) & \text{if } t_i = 1 \\
   -log(1-y_i) & \text{if } t_i = 0
  \end{cases}$$
  
Which in the case of $t_i=1$ is $0$ if $y_i=1$ $(-log(1)=0)$ and goes to infinity as $y_i \rightarrow 0$ $(\underset{y \rightarrow 0}{\text{lim}}  -log(y) = +\infty)$. The reverse effect is happening if $t_i=0$.  
So what we end up with is a cost function that is $0$ if the probability to predict the correct class is $1$, and goes to infinity as the probability to predict the correct class goes to $0$.

Notice that the cost function $\xi(t_i,y_i)$ is equal to the negative [log probability](http://en.wikipedia.org/wiki/Log_probability) that $x_i$ is classified as its correct class:  
$-log(P(t_i=1| x_i,w,b)) = -log(y_i)$,  
$-log(P(t_i=0| x_i,w,b)) =$ $-log(1-y_i)$.

By minimizing the negative log probability we will maximize the log probability.

Note that since $t_i$ can only be $0$ or $1$, we can write $\xi(t_i,y_i)$ as:
$$ \xi(t_i,y_i) = -t_i log(y_i) - (1-t_i)log(1-y_i) $$

Which will give $\xi(t,y) = - \sum_{i=1}^{n} \left[ t_i log(y_i) + (1-t_i)log(1-y_i) \right]$ if we sum over all $n$ samples.



Another reason to use the cross-entropy function is that in simple logistic regression this results in a [convex](http://en.wikipedia.org/wiki/Convex_function) cost function, of which the global minimum will be easy to find. Note that this is not necessaraly the case anymore in multilayer neural networks.

In [None]:
np.seterr(all='ignore')

# Define the neural network function y = 1 / (1 + numpy.exp(-x*w))
def nn(x, w): return logistic(x.dot(w.T))

# Define the neural network prediction function that only returns
#  1 or 0 depending on the predicted class
def nn_predict(x,w): return np.around(nn(x,w))

# define the cost function for a single sample
def cost_sample(yi, ti):
    if ti > 0.5:
        return np.log(yi)
    else: 
        return np.log(1-yi)
    
print np.multiply(np.matrix([[0],[0]]), np.matrix([[0],[0]]))
# Define the cost function
def cost(y, t):
#     return - (t,np.log(y)) + np.multiply((1-t),np.log(1-y))).sum()
    return - np.sum([cost_sample(yi, ti) for yi, ti in zip(y,t)])

print cost(np.matrix([[0],[0]]), np.matrix([[0],[0]]))

# # Define a vector of weights for which we want to plot the cost
# ws = numpy.linspace(0, 4, num=100)  # weight values
# cost_ws = numpy.vectorize(lambda w: cost(nn(x, w) , t))(ws)  # cost for each weight in ws

nb_of_ws = 100
ws1 = np.linspace(-5, 5, num=nb_of_ws)
ws2 = np.linspace(-5, 5, num=nb_of_ws)
# ws = np.mgrid[0:5,0:5]
# ws_x, ws_y = np.meshgrid(ws1, ws2)
ws_x = np.zeros((nb_of_ws, nb_of_ws))
ws_y = np.zeros((nb_of_ws, nb_of_ws))
# ws = numpy.array([numpy.linspace(i,j,5) for i,j in zip(ws1,ws2)])
# print ws_x[19,19], ws_y[19,19]
cost_ws = np.zeros((nb_of_ws, nb_of_ws))
print cost_ws.shape
for i in xrange(nb_of_ws):
    for j in xrange(nb_of_ws):
#         print np.asmatrix([ws1[i], ws2[j]])
#         print nn(x, np.asmatrix([ws1[i], ws2[j]]))
#         cost_ws[i,j] = cost(nn(x, np.asmatrix([ws_x[i,j], ws_y[i,j]])) , t)
        ws_x[i,j] = ws1[i]
        ws_y[i,j] = ws2[j]
        cost_ws[i,j] = cost(nn(x, np.asmatrix([ws_x[i,j], ws_y[i,j]])) , t)
#         print ws_x[i,j], ws_y[i,j], cost_ws[i,j]
# print cost_ws

# print cost_ws
# fig = plt.figure()
# ax = fig.gca(projection='3d')
plt.contourf(ws_x, ws_y, cost_ws, 20)
# surf = ax.plot_surface(ws_x, ws_y, cost_ws, cmap=cm.coolwarm)
plt.colorbar()
# ax.view_init(elev=25, azim=-50)
# plt.imshow(cost_ws)

plt.grid()
plt.show()

## Gradient descent optimization of the cost function

In [None]:
from matplotlib.colors import colorConverter, ListedColormap

nb_of_xs = 100
xs1 = np.linspace(-4, 4, num=nb_of_xs)
xs2 = np.linspace(-4, 4, num=nb_of_xs)
xx, yy = np.meshgrid(xs1, xs2)

im = np.zeros((nb_of_xs, nb_of_xs))
w = np.asmatrix([1,2])
for i in xrange(nb_of_xs):
    for j in xrange(nb_of_xs):
        im[i,j] = nn_predict(np.asmatrix([xx[i,j], yy[i,j]]) , w)
        
print im

cmap = ListedColormap([
        colorConverter.to_rgba('r', alpha=0.40),
        colorConverter.to_rgba('b', alpha=0.40)])

# here "model" is your model's prediction (classification) function

plt.contourf(xx, yy, im, cmap=cmap)



# implot = plt.imshow(im)
# cost_ws[i,j] = cost(nn(x, np.asmatrix([ws_x[i,j], ws_y[i,j]])) , t)

plt.plot(x_red[:,0], x_red[:,1], 'ro', label='target red')
plt.plot(x_blue[:,0], x_blue[:,1], 'bo', label='target blue')
plt.grid()
plt.legend(loc=2)
plt.xlabel('x1')
plt.ylabel('x2')
plt.title('red vs blue classification boundary')
plt.show()

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import colorConverter

fig = plt.figure()
ax = fig.gca(projection='3d')
X = np.arange(-5, 5, 0.25)
print X.shape
Y = np.arange(-5, 5, 0.25)
print Y.shape
X, Y = np.meshgrid(X, Y)
print X.shape
print Y.shape
print X, Y
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.coolwarm,
        linewidth=0, antialiased=False)
ax.set_zlim(-1.01, 1.01)

ax.zaxis.set_major_locator(LinearLocator(10))
ax.zaxis.set_major_formatter(FormatStrFormatter('%.02f'))

fig.colorbar(surf, shrink=0.5, aspect=5)

plt.show()



In [None]:
print np.log(0) * 0
np.isclose(0.99999999999999999999, 1)