## Multi-Layer Neural Networks

<br>
A single perceptron algorithm is a linear classifier that cannot successfully classify the outputs of an [XOR operation](https://en.wikipedia.org/wiki/Exclusive_or). With a multi-layer neural network consisting of input, hidden layer(s), and output layer, we can construct a network beyond a linear clssifier. A simple multi-layer neural network can be broken down into a set of computational procedures, carried out in order and in cycles. Each step of the procedure has its special purpose, for clarity, we often write separate program functions for each of them. Below are some terminologies. 

### 1. Activation Function 

Activation function defines the transformation of input to output at a single node. The activation function has to be nonlinear, otherwise the node can only carry out a linear transformation. Several popular activation functions are:
* Sigmoid 
$$
\begin{equation*}
\sigma(z) = \frac{1}{1 + e^{-z}}
\end{equation*}
$$
* Hyperbolic tangent 
$$
\begin{equation*}
tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}
\end{equation*}
$$
* Rectified linear units (ReLu)
$$
\begin{equation*}
f(z) = 
\begin{cases}
z  & if & z \geq 0 \\
0  & if & z < 0 \\
\end{cases}
\end{equation*}
$$


In [14]:
# sigmoid example
import numpy as np

def sigmoid(x):
    """
    Args
        x: numerical value or array of values
    Returns 
        output after sigmoid transformation
    """
    return 1 / (1 + np.exp(-x))
    

In [15]:
sigmoid(1)

0.7310585786300049

### 2. Cost Function 

Cost functionm, also called loss function or error function, describes the degree of error between model output and target output.  For all variations of ANN, defining a cost function is needed. With an objective of minimizing the cost function, ANN can learn to incrementally update the values of weights and achieve better predictive power. In ANN, whenever the network gives an output that is far from the target, we know we have to change our weights significantly for our next iteration. Some popular activation functions are:
* Mean squared error 
$$
\begin{equation*}
J = \frac{1}{N}\sum_i (y_i - \hat{y_i})^{2}
\end{equation*}
$$
* Cross entropy
$$
\begin{equation*}
J = -\sum_i p_i \log \hat{p_i}
\end{equation*}
$$


In [16]:
def calculate_error(target, model_output):
    """
    Args
        model output (array)
        target (array)
    Returns
        error defined by cost function 
    """
    return np.sum((model_output - target)**2) / (2 * len(target))
    

### 3. Forward Propagation 

Forward propagation is the process of starting from input layer, multiply corresponding weights and go through transformations at each hidden layer, then arrive at the output layer. Between two layers we have    
$$
\begin{equation*}
Z = WX + b \\
a = \sigma(Z) 
\end{equation*}
$$

In [13]:
def forward_propagation(W, X, b):
    """
    Args 
        W (array): p X q matrix 
                   p is number of nodes in former layer, q is number of nodes for latter layer
        X (array): m X n matrix  
                   m is number of observations, n is dimension of input  
    Returns
    """
    return sigmoid(np.dot(W, X.T) + b)

### 4. Backward Propagation and Update Weights 

A lot of methods were tested to find an optimal set of weights. So far the best way for updating weights is to look for relationship between cost function $J$ and individual $w_i$. We can start by taking the derivative of $J$ with respect to $w_i$ and update its value depending on the error at each iteration. Each weight $w_i$ and bias $b$ can be updated as follow:
<br>
$$
\begin{equation*}
w_i = w_i - \alpha \frac{\partial J}{\partial w_i} \\
b = b - \alpha \frac{\partial J}{\partial b}
\end{equation*}
$$
<br>
with $\alpha$ being learning rate, a hyperparameter which determines the amplitude of change at each iteration. 

### 5. Construct Neural Network to Learn XOR Operation  

Using the above individual pieces we can construct a simple 2-layer neural network (one output layer plus one hidden layer, input layer does not count). The 

In [None]:



X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) # matrix with dimensions (4, 2)
y = np.array([[0, 1, 1, 0]]).T  # dim (4, 1)


W1 = [[0.1, -0.1], [0.1, -0.1]]
b1 = 0.1
W2 = [[-0.1, -0.1], [0.1, 0.1]]
b2 = -0.1


for i in range(len(X)):

    a1 = forward_propagation(W1, X[i], b1)
    a2 = forward_propagation(W2, a1, b2)
    
    e = calculate_error(y[i], a2)

    e 
    
    
    # backward propagation
    z2_delta = (z2 - y) * z2 * (1.0 - z2) 
    z2_gradient = np.dot(z1.T, z2_delta) 
    z1_delta = np.dot(z2_delta, W2.T) * z1 * (1.0 - z1) 
    z1_gradient = np.dot(X.T, z1_delta) 
    
    # update weights
    W2 -= learning_rate * z2_gradient
    W1 -= learning_rate * z1_gradient 