The simplest neural network is a logistic regressor.  A logistic regression takes in values of any range by only outputs values between 0 and 1.  If X is the input and y is the desired output, then we want to find a close approximation function for the function f where

y = f(X)

Let's say X is the input vector with 3 components X1, X2, X3.  W is a vector of three weights.  To compute the output of the regressor, we must first do a **linear step**:
z = X.W + b
where b is the **bias** which shift the output by a constant value
Next, we perform a **nonlinear step**:
o = A(z)
where A is a **activation function**, in this case a sigmoid function that transforms *z( to a set of values *o* that ranges from 0 to 1.


In [9]:
import numpy as np
np.random.seed(1)

# Define input X and desired output y
X = np.array([[0, 1, 0], [1, 0, 0], [1, 1, 1], [0, 1, 1]])
y = np.array([[0, 1, 1, 0]]).T

# Define sigmoid function
def sigmoid(x):
    return 1/(1+np.exp(-x))

# Starts with random weights and bias=0
W = 2*np.random.random((3,1) )- 1 # Random values to have a mean of 0 and a sd of 1
b = 0

# Attempt 1:
# Linear step
z = X.dot(W) + b
# Non linear step
o = sigmoid(z)
print (o)


[[0.60841366]
 [0.45860596]
 [0.3262757 ]
 [0.36375058]]
[[ 0.60841366]
 [-0.54139404]
 [-0.6737243 ]
 [ 0.36375058]]


## Optimising Model Parameters
We need to search for better weights and bias (collectively known as parameters) to arrive at a closer approximation function f^ of our desired function f.  

We know that:

y = f(X)

and 

y^ = f^(X)

We can try to find f^ by minimising:

D(y, y^)

where f^ belongs to H (H=**hypothesis space**)

D is referred as the **loss function**


## Loss function
For a binary classification problem, we can use the binary cross entropy loss

In [10]:
def bce_loss(y, y_hat):
    N = y.shape[0]
    print(N)
    loss = -1/N * (y*np.log(y_hat) + (1 - y)*np.log(1-y_hat))
    return loss

# Loss for our initial approximation 
print(bce_loss(y, o))

4
[[0.23438731]
 [0.19489098]
 [0.28000314]
 [0.11304115]]
[[0]
 [1]
 [1]
 [0]]
[[0.60841366]
 [0.45860596]
 [0.3262757 ]
 [0.36375058]]


## Gradient descent
Most popular optimization algorithm for neural network is the **gradient descent**.  This method requires that the loss function has a derivative with respect to the parameters that we want to optimize.  **Backpropgation** allow us to apply gradient updates to the parameters of a model.

Let's say **dz** is the derivative of loss wrt 