<br>
<font size = '6'><b>Neural Networks</b></font>

- Ryan Harris
    - https://www.youtube.com/playlist?list=PL29C61214F2146796
    - https://www.youtube.com/playlist?list=PLRyu4ecIE9tibdzuhJr94uQeKnOFkkbq6
    - <a href="./files/BackPropagation.pdf" target="_blank">Backpropagation Slides</a> 
    
- Gene Kogan   
    - http://ml4a.github.io/classes/neural-aesthetic/
    - https://github.com/ml4a/ml4a-guides/blob/master/notebooks/simple_neural_networks.ipynb

<table style="border-style: hidden; border-collapse: collapse;" width = "90%"> 
    <tr style="border-style: hidden; border-collapse: collapse;">
        <td width = 60% style="border-style: hidden; border-collapse: collapse;">
             
        </td>
        <td width = 30%>
        Collected by Prof. Seungchul Lee<br>
        iSystems Design Lab.<br>
        http://isystems.unist.ac.kr/<br>
        UNIST
        </td>
    </tr>
</table>

Table of Contents
<div id="toc"></div>


    

# 1. Structure of Neural Networks

__The neuron__

- The sigmoid equation is what is typically used as a transfer function between neurons. It is similar to the step fuction, but is continuous and differentiable.

$$ \sigma(x) = \frac{1}{1+e^{-x}}$$

- One useful property of this transfer function is the simplicity of computing its derivative.

$$\frac{d}{dx}\sigma(x) = \sigma' = \sigma(x) (1-\sigma(x))$$

__Single input neuron__

<img src="./image_files/single_neuron.png" width = 300>

$$ O = \sigma(\xi \omega + \theta) $$

__Multiple input neuron__

<img src="./image_files/multiple_neuron.png" width = 300>

$$ O = \sigma(\xi_1 \omega_1 + \xi_2 \omega_2 + \xi_3 \omega_3 +\theta) $$

__A neural network__

<img src="./image_files/nn_03.png" width = 300>

# 2. Learning: Backpropagation Algorithm

__Notation__

- $x_j^\ell$: Input to node $j$ of layer $\ell$

- $W_{ij}^\ell$: Weight from layer $\ell - 1$ node $i$ to layer $\ell$ node $j$

- $\sigma(x) = \frac{1}{1+e^{-x}}$: Sigmoid transfer function

- $\theta_j^{\ell}$: Bias of node $j$ of layer $\ell$

- $O_j^{\ell}$: Output of node $j$ in layer $\ell$

- $t_j$: Target value of node $j$ of the output layer

<br>
<font size='4'><b>The error calculation</b></font>

Given a set of training data points $t_k$ and output layer output $O_k$ we can write the error as

$$ E = \frac{1}{2} \sum_{k \in K} (O_k - t_k)^2$$

We want to calculate $\frac{\partial E}{\partial W_{jk}^{\ell}}$, the rate of change of the error with respect to the given connective weight, so we can minimize it.

Now we consider two cases: the node is an output node, or it is in a hidden layer

__1) Output layer node__

\begin{align*}
\frac{\partial E}{\partial W_{jk}} &= \frac{\partial}{\partial W_{jk}} \frac{1}{2} (O_k - t_k)^2 = (O_k - t_k)\frac{\partial}{\partial W_{jk}} O_k = (O_k - t_k)\frac{\partial}{\partial W_{jk}} \sigma(x_k)\\
&= (O_k - t_k) \sigma(x_k) (1-\sigma(x_k)) \frac{\partial}{\partial W_{jk}} x_k \\
&= (O_k - t_k) O_k (1 - O_k) O_j
\end{align*}

$\quad$For notation purposes, I will define $\delta_k$ to be the expression $(O_k - t_k) O_k (1 - O_k)$, so we can rewrite the equation above as

$$\frac{\partial E}{\partial W_{jk}} = O_j \delta_k $$

__2) Hidden layer node__

\begin{align*}
\frac{\partial E}{\partial W_{ij}} &= \frac{\partial}{\partial W_{ij}} \frac{1}{2} \sum_{k \in K} (O_k - t_k)^2 = \sum_{k \in K} (O_k - t_k)\frac{\partial}{\partial W_{ij}} O_k = \sum_{k \in K} (O_k - t_k)\frac{\partial}{\partial W_{ij}} \sigma(x_k)\\
&= \sum_{k \in K} (O_k - t_k) \sigma(x_k) (1-\sigma(x_k)) \frac{\partial}{\partial W_{ij}} x_k \\
&= \sum_{k \in K} (O_k - t_k) O_k (1 - O_k) \frac{\partial x_k}{\partial O_j}\cdot \frac{\partial O_j}{\partial W_{ij}} = \sum_{k \in K} (O_k - t_k) O_k (1 - O_k) W_{jk}\cdot \frac{\partial O_j}{\partial W_{ij}}\\
&= \frac{\partial O_j}{\partial W_{ij}} \cdot \sum_{k \in K} (O_k - t_k) O_k (1 - O_k) W_{jk}\\
&= O_j (1-O_j)\frac{\partial x_j}{\partial W_{ij}} \cdot \sum_{k \in K} (O_k - t_k) O_k (1 - O_k) W_{jk}\\
&= O_j (1-O_j)O_i \cdot \sum_{k \in K} (O_k - t_k) O_k (1 - O_k) W_{jk}\\
&= O_i O_j (1-O_j) \sum_{k \in K} \delta_k W_{jk}
\end{align*}

$\quad$Similar to before we will now define all terms besides $O_i$ to be $\delta_j = O_j (1-O_j) \sum_{k \in K} \delta_k W_{jk}$, so we have

$$\frac{\partial E}{\partial W_{ij}} = O_i \delta_j$$


__How weights affect errors__

- For an output layer node $k \in K$

$$\frac{\partial E}{\partial W_{jk}} = O_j \delta_k $$

$\quad \;\,$where $$\delta_k = (O_k - t_k) O_k (1 - O_k)$$

- For a hidden layer node $j \in J$

$$\frac{\partial E}{\partial W_{ij}} = O_i \delta_j$$

$\quad \;\,$where $$\delta_j = O_j (1-O_j) \sum_{k \in K} \delta_k W_{jk}$$

__What about the bias?__

If we incorporate the bias term $\theta$ into the equation you will find that

$$ \frac{\partial O}{\partial \theta} = 1$$

This is why we view the bias term as output from a node which is always one. This holds for any layer $\ell$, a substitution into the previous equations gives us that

$$ \frac{\partial E}{\partial \theta} = \delta_{\ell}$$

<br>
<font size='4'><b>The backpropagation algorithm using gradient descent</b></font>

1. Run the network forward with your input data to get the netwrok output

2. For each output node compute
$$\delta_k = (O_k - t_k) O_k (1 - O_k)$$
3. For eatch hidden node calculate
$$\delta_j = O_j (1-O_j) \sum_{k \in K} \delta_k W_{jk}$$
4. Update the weights and biases as follows<br>
Given
$$\begin{align*}
\Delta W &= -\eta \delta_{\ell} O_{\ell -1}\\
\Delta \theta &= -\eta \delta_{\ell}
\end{align*}$$
apply
$$\begin{align*}
W &\leftarrow W + \Delta W \\
\theta &\leftarrow \theta + \Delta \theta
\end{align*}$$



In [1]:
%%html
<iframe src="https://www.youtube.com/embed/aVId8KMsdUU?list=PL29C61214F2146796" 
width="560" height="315" frameborder="0" allowfullscreen></iframe>

In [2]:
%%html
<iframe src="https://www.youtube.com/embed/zpykfC4VnpM?list=PL29C61214F2146796" 
width="560" height="315" frameborder="0" allowfullscreen></iframe>

# 3. Implementation in python

Try to be very explicit about what parts are "up in the air" (i.e. modifiable) so you get a sense of where you can experiment with new neural networks.

- [ml4a chapter on neural networks](http://ml4a.github.io/ml4a/neural_networks/). 

- Michael Nielsen's [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/), 

- Goodfellow, Bengio, and Courville's [Deep Learning](http://www.deeplearningbook.org/) book

- Yoav Goldberg's "[A Primer on Neural Network Models for Natural Language Processing](http://arxiv.org/abs/1510.00726)". 

- [Notes](http://frnsys.com/ai_notes/machine_learning/neural_nets.html) on neural networks include a lot more details and additional resources as well.

For the other neural network guides we will mostly rely on the excellent [Keras](http://keras.io/) library, which makes it very easy to build neural networks and can take advantage of [Theano](http://deeplearning.net/software/theano/) or [TensorFlow](https://www.tensorflow.org/)'s optimizations and speed. However, to demonstrate the basics of neural networks, we'll use `numpy` so we can see exactly what's happening every step of the way.

Each unit (neuron) has a vector of weights - these are the parameters that the network learns.

The inner operations of the basic unit is straightforward. We collapse the weight vector $w$ and input vector $v$ into a scalar by taking their dot product. Often a _bias_ term $b$ is added to this dot product; this bias is also learned. then we pass this dot product through an _activation function_ $f$, which also returns a scalar. Activation functions are typically nonlinear so that neural networks can learn nonlinear functions. I'll mention a few common activation functions in a bit, but for now let's see what a basic unit is doing:

```python
def unit(inputs, weights, b):
    return activation_function(np.dot(inputs, weights) + b)
```

Note that the output units often do not have an activation function.

## 3.1. A basic neural network with `numpy`

First we'll import `numpy`:

In [3]:
import numpy as np

With machine learning we are trying to find a hidden function that describes data that we have. Here we are going to cheat a little and define the function ourselves and then use that to generate data. Then we'll try to "reverse engineer" our data and see if we can recover our original function.

In [4]:
def unknown_function(X):
    coeff = np.array([[2., -1., 5.]])
    return np.dot(X, coeff.T)

X = np.array([
    [4.,9.,1.],
    [2.,5.,6.],
    [1.,8.,3.]
])

t = unknown_function(X) # target
print(t)

[[  4.]
 [ 29.]
 [  9.]]


Now we are going to set up our simple neural network. It will have just one hidden layer with two units (which we will refer to as unit 1 and unit 2).

<img src="./image_files/simple_nn_structure.png" width = 250>

First we have to define the weights (i.e. parameters) of our network.

- We have three inputs each going into two units, then one bias value for each unit, so we have eight parameters for the hidden layer.

- Then we have the output of those two hidden layer units going to the output layer, which has only one unit - this gives us two more parameters, plus one bias value.

- So in total, we have eleven parameters.

Let's set them to arbitrary values for now (random initialization).

In [5]:
# initial hidden layer weights
hidden_layer_weights = np.array([
    [0.5, 0.5, 0.5],    # unit 1
    [0.1, 0.1, 0.1]     # unit 2
])
hidden_layer_biases = np.array([1. ,1.])

# initial output layer weights
output_weights = np.array([[1., 1.]])
output_biases = np.array([1.])

We'll use $\tanh$ activations for our hidden units, so let's define that real quick:

In [6]:
def activation(X):
    return np.tanh(X)

$\tanh$ activations are quite common, but you may also encounter sigmoid activations and, more recently, ReLU activations (which output 0 when $x \leq 0$ and output $x$ otherwise). These activation functions have different benefits: 
- ReLUs in particular are robust against training difficulties that come when dealing with deeper networks.

To make things clearer later on, we'll also define the linear function that combines a unit's input with its weights:

In [7]:
def linear(input, weights, biases):
    return np.dot(input, weights.T) + biases

Now we can do a forward pass with our inputs $X$ to see what the predicted outputs are.

## 3.2. Forward pass

First, we'll pass the input through the hidden layer:

In [8]:
hidden_linout = linear(X, hidden_layer_weights, hidden_layer_biases)
hidden_output = activation(hidden_linout)

print('hidden output')
print(hidden_output)

hidden output
[[ 0.99999977  0.98367486]
 [ 0.99999939  0.9800964 ]
 [ 0.99999834  0.97574313]]


(We're keeping the neuron unit's intermediary value, `hidden_linout` for use in backpropagation.)

Then we'll take the hidden layer's output and pass it through the output layer to get our predicted outputs:

In [9]:
output_linout = linear(hidden_output, output_weights, output_biases)
output_output = output_linout # no activation function on output layer

predicted = output_output
print('predicted')
print(predicted)

predicted
[[ 2.98367463]
 [ 2.98009578]
 [ 2.97574147]]


Now let's compute the mean squared error of our predictions:

In [10]:
mse = np.mean((t - predicted)**2)
print('mean squared error')
print(mse)

mean squared error
238.120007837


Now we can take this error and backpropagate it through the network. This will tell us how to update our weights.

## 3.3. Backpropagation

Since backpropagation is essentially a chain of derivatives (that is used for gradient descent), we'll need the derivative of our activation function, so let's define that first:

In [11]:
def activation_deriv(X):
    return 1 - np.tanh(X)**2

Then we want to set a learning rate - this is a value from 0 to 1 which affects how large we tweak our parameters by for each training iteration.

- You don't want to set this to be too large or else training will never converge (your parameters might get really big and you'll start seeing a lot of `nan` values).

- You don't want to set this to be too small either, otherwise training will be very slow. There are more sophisticated forms of gradient descent that deal with this, but those are beyond the scope of this guide.

In [12]:
learning_rate = 0.001

First we'll propagate the error through the output layer (I won't go through the derivation of each step but they are straightforward to work out if you know a bit about derivatives):

In [13]:
# derivative of mean squared error
error = predicted - t

# delta for the output layer (no activation on output layer)
delta_output = error

# output layer updates
output_weights_update = delta_output.T.dot(hidden_output)
output_biases_update = delta_output.sum(axis=0)

Then through the hidden layer:

In [14]:
# push back the delta to the hidden layer
delta_hidden = delta_output*output_weights*activation_deriv(hidden_linout)

# hidden layer updates
hidden_weights_update = delta_hidden.T.dot(X)
hidden_biases_update = delta_hidden.sum(axis=0)

Then we can apply the updates:

In [15]:
output_weights -= output_weights_update*learning_rate
output_biases -= output_biases_update*learning_rate

hidden_layer_weights -= hidden_weights_update*learning_rate
hidden_layer_biases -= hidden_biases_update*learning_rate

That's one training iteration! In reality, you would do this many, many times - feedforward, backpropagate, update weights, then rinse and repeat. That's the basics of a neural network - at least, the "vanilla" kind. There are other more sophisticated kinds (recurrent and convolutional neural networks are two of the most common) that are covered in other guides.

In [16]:
print predicted

[[ 2.98367463]
 [ 2.98009578]
 [ 2.97574147]]


## 3.4. Neural Networks in python

In [17]:
# hidden layer weights
hidden_layer_weights = np.array([
    [0.5, 0.5, 0.5],    # unit 1
    [0.1, 0.1, 0.1]     # unit 2
])
hidden_layer_biases = np.array([1. ,1.])

# output layer weights
output_weights = np.array([[1., 1.]])
output_biases = np.array([1.])

for i in range(1000):
    
    hidden_linout = linear(X, hidden_layer_weights, hidden_layer_biases)
    hidden_output = activation(hidden_linout)
    
    output_linout = linear(hidden_output, output_weights, output_biases)
    output_output = output_linout # no activation function on output layer

    predicted = output_output

    # derivative of mean squared error
    error = predicted - t

    # delta for the output layer (no activation on output layer)
    delta_output = error

    # output layer updates
    output_weights_update = delta_output.T.dot(hidden_output)
    output_biases_update = delta_output.sum(axis = 0)

    # push back the delta to the hidden layer
    delta_hidden = delta_output*output_weights*activation_deriv(hidden_linout)

    # hidden layer updates
    hidden_weights_update = delta_hidden.T.dot(X)
    hidden_biases_update = delta_hidden.sum(axis = 0)

    output_weights -= output_weights_update*learning_rate
    output_biases -= output_biases_update*learning_rate

    hidden_layer_weights -= hidden_weights_update*learning_rate
    hidden_layer_biases -= hidden_biases_update*learning_rate

print t    
print predicted

[[  4.]
 [ 29.]
 [  9.]]
[[  5.54772718]
 [ 26.56860045]
 [  9.73852491]]


In [18]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>