## Multi-Layer Neural Networks

<br>
A single perceptron algorithm is a linear classifier that cannot successfully classify the outputs of an [XOR operation](https://en.wikipedia.org/wiki/Exclusive_or). With a multi-layer neural network consisting of hidden layer(s), we can construct a network beyond a linear clssifier. A simple multi-layer neural network can be broken down into a set of computational procedures, carried out in order and in cycles. Each step of the procedure has its special purpose, for clarity, we often write separate program functions for each of them. Below are some terminologies. 

### 1. Activation Function 

Activation function defines the transformation of input to output at a single node. The activation function has to be nonlinear, otherwise the node can only carry out a linear transformation. Several popular activation functions are:
* Sigmoid 
$$
\begin{equation*}
\sigma(z) = \frac{1}{1 + e^{-z}}
\end{equation*}
$$
* Hyperbolic tangent 
$$
\begin{equation*}
\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}
\end{equation*}
$$
* Rectified linear units (ReLu)
$$
\begin{equation*}
f(z) = 
\begin{cases}
z  & if & z \geq 0 \\
0  & if & z < 0 \\
\end{cases}
\end{equation*}
$$


In [25]:
# sigmoid example
import numpy as np

def sigmoid(x):
    """
    Args
        x: numerical value or array of values
    Returns 
        sigmoid transformation
    """
    return 1 / (1 + np.exp(-x))
    

In [26]:
# test
sigmoid(1)

0.7310585786300049

### 2. Cost Function 

Cost functionm, also called loss function or error function, describes the degree of error between model output and target output.  For all variations of ANN, defining a cost function is needed. With the goal of minimizing cost function, ANN learns to iteratively update the values of weights and achieve better predictive power (at least for training data). Some popular activation functions are:
* Mean squared error 
$$
\begin{equation*}
J = \frac{1}{N}\sum_i (y_i - \hat{y_i})^{2}
\end{equation*}
$$
* Cross entropy
$$
\begin{equation*}
J = -\sum_i p_i \log \hat{p_i}
\end{equation*}
$$


In [27]:
# example calculate MSE loss

def calculate_error(target, model_output):
    """
    Args
        model output (array)
        target (array)
    Returns
        MSE loss  
    """
    return np.sum((target - model_output)**2) / len(target)
    

### 3. Forward Propagation 

Forward propagation is the process of input signals propagating forward through the network, and generate an output at the last layer. Between two adjacent layers we have calculations: 
<br>
$$
\begin{equation*}
Z = XW + b \\
A = \sigma(Z) 
\end{equation*}
$$
<br>
where $X$ is input (or output of preceeding layer), $W$ the weights, and $\sigma$ being the activation function.

In [28]:
def forward_propagation(X, W):
    """
    Args 
        X (array): m X n matrix  
                   m is number of observations, n is dimension of input  
        W (array): n X p matrix 
                   n is number of nodes in preceeding layer, p is number of nodes in next layer
    Returns
    """
    return sigmoid(np.dot(X, W))

### 4. Backward Propagation and Update Weights 

A lot of methods were tested to find an optimal set of weights. So far the best way for updating weights is to look for relationship between cost function $J$ and individual $w_i$. We can start by taking the derivative of $J$ with respect to $w_i$ and update its value depending on the error at each iteration. Each weight $w_i$ and bias $b$ can be updated as follow:
<br>
$$
\begin{equation*}
w_i = w_i - \alpha \frac{\partial J}{\partial w_i} \\
b = b - \alpha \frac{\partial J}{\partial b}
\end{equation*}
$$
<br>
with $\alpha$ being learning rate, a hyperparameter which determines the amplitude of change at each iteration. 

In [29]:

def back_propagation(diff, A, X):
    """
    Args 
        diff: difference between layer_output and target
        A: output of preceding layer 
        X: input of preceding layer
    Returns
        gradient of weights
    """
    delta = diff * (A * (1.0 - A)) 
    gradient = np.dot(X.T, delta) 
    return gradient


### 5. Construct Neural Network to Learn XOR Operation  

Using the above individual pieces we can construct a simple 2-layer neural network (one output layer plus one hidden layer, input layer does not count). An ANN with input of 2 nodes, hidden layer of 3 nodes and output of one node

In [30]:
np.random.seed(0)
num_nodes = 3

X_data = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) # matrix with dimensions (4, 2)
y_data = np.array([[0, 1, 1, 0]]).T  # dim (4, 1)

W1 = 0.1 * np.random.randn(X_data.shape[1], num_nodes) # dim (2, 3)
W2 = 0.1 * np.random.randn(num_nodes, 1) # dim (3, 1)


In [31]:
epoch = 10000 
learning_rate = 1

for i in range(epoch):

    # forward
    a1 = forward_propagation(X_data, W1)
    a2 = forward_propagation(a1, W2)

    # back
    d2 = (a2 - y_data) 
    gradient2 = back_propagation(d2, a2, a1)

    d1 = np.dot(d2, W2.T)
    gradient1 = back_propagation(d1, a1, X_data)

    # update weights
    W2 -= learning_rate * gradient2
    W1 -= learning_rate * gradient1 

In [32]:

a1 = forward_propagation(X_data, W1)
a2 = forward_propagation(a1, W2)
print(a2)

[[ 0.03408027]
 [ 0.97995751]
 [ 0.97995751]
 [ 0.00142586]]
