<font size = '6'><b>Neural Networks</b></font>

- Ryan Harris
    - https://www.youtube.com/playlist?list=PL29C61214F2146796
    - https://www.youtube.com/playlist?list=PLRyu4ecIE9tibdzuhJr94uQeKnOFkkbq6
    - <a href="./files/BackPropagation.pdf" target="_blank">Backpropagation Slides</a> 
    
<table style="border-style: hidden; border-collapse: collapse;" width = "90%"> 
    <tr style="border-style: hidden; border-collapse: collapse;">
        <td width = 60% style="border-style: hidden; border-collapse: collapse;">
             
        </td>
        <td width = 30%>
        Prof. Seungchul Lee<br>
        iSystems<br>
        UNIST<br>
        http://isystems.unist.ac.kr/
        </td>
    </tr>
</table>

Table of Contents
<div id="toc"></div>


    

# 1. Structure of Neural Networks

__The neuron__

- The sigmoid equation is what is typically used as a transfer function between neurons. It is similar to the step fuction, but is continuous and differentiable.

$$ \sigma(x) = \frac{1}{1+e^{-x}}$$

- One useful property of this transfer function is the simplicity of computing its derivative.

$$\frac{d}{dx}\sigma(x) = \sigma' = \sigma(x) (1-\sigma(x))$$

__Single input neuron__

<img src="./image_files/single_neuron.png" width = 300>

$$ O = \sigma(\xi \omega + \theta) $$

__Multiple input neuron__

<img src="./image_files/multiple_neuron.png" width = 300>

$$ O = \sigma(\xi_1 \omega_1 + \xi_2 \omega_2 + \xi_3 \omega_3 +\theta) $$

__A neural network__

<img src="./image_files/nn_03.png" width = 300>

# 2. Learning: Backpropagation Algorithm

__Notation__

- $x_j^\ell$: Input to node $j$ of layer $\ell$

- $W_{ij}^\ell$: Weight from layer $\ell - 1$ node $i$ to layer $\ell$ node $j$

- $\sigma(x) = \frac{1}{1+e^{-x}}$: Sigmoid transfer function

- $\theta_j^{\ell}$: Bias of node $j$ of layer $\ell$

- $O_j^{\ell}$: Output of node $j$ in layer $\ell$

- $t_j$: Target value of node $j$ of the output layer

<br>
<font size='4'><b>The error calculation</b></font>

Given a set of training data points $t_k$ and output layer output $O_k$ we can write the error as

$$ E = \frac{1}{2} \sum_{k \in K} (O_k - t_k)^2$$

We want to calculate $\frac{\partial E}{\partial W_{jk}^{\ell}}$, the rate of change of the error with respect to the given connective weight, so we can minimize it.

Now we consider two cases: the node is an output node, or it is in a hidden layer

__1) Output layer node__

\begin{align*}
\frac{\partial E}{\partial W_{jk}} &= \frac{\partial}{\partial W_{jk}} \frac{1}{2} (O_k - t_k)^2 = (O_k - t_k)\frac{\partial}{\partial W_{jk}} O_k = (O_k - t_k)\frac{\partial}{\partial W_{jk}} \sigma(x_k)\\
&= (O_k - t_k) \sigma(x_k) (1-\sigma(x_k)) \frac{\partial}{\partial W_{jk}} x_k \\
&= (O_k - t_k) O_k (1 - O_k) O_j
\end{align*}

$\quad$For notation purposes, I will define $\delta_k$ to be the expression $(O_k - t_k) O_k (1 - O_k)$, so we can rewrite the equation above as

$$\frac{\partial E}{\partial W_{jk}} = O_j \delta_k $$

__2) Hidden layer node__

\begin{align*}
\frac{\partial E}{\partial W_{ij}} &= \frac{\partial}{\partial W_{ij}} \frac{1}{2} \sum_{k \in K} (O_k - t_k)^2 = \sum_{k \in K} (O_k - t_k)\frac{\partial}{\partial W_{ij}} O_k = \sum_{k \in K} (O_k - t_k)\frac{\partial}{\partial W_{ij}} \sigma(x_k)\\
&= \sum_{k \in K} (O_k - t_k) \sigma(x_k) (1-\sigma(x_k)) \frac{\partial}{\partial W_{ij}} x_k \\
&= \sum_{k \in K} (O_k - t_k) O_k (1 - O_k) \frac{\partial x_k}{\partial O_j}\cdot \frac{\partial O_j}{\partial W_{ij}} = \sum_{k \in K} (O_k - t_k) O_k (1 - O_k) W_{jk}\cdot \frac{\partial O_j}{\partial W_{ij}}\\
&= \frac{\partial O_j}{\partial W_{ij}} \cdot \sum_{k \in K} (O_k - t_k) O_k (1 - O_k) W_{jk}\\
&= O_j (1-O_j)\frac{\partial x_j}{\partial W_{ij}} \cdot \sum_{k \in K} (O_k - t_k) O_k (1 - O_k) W_{jk}\\
&= O_j (1-O_j)O_i \cdot \sum_{k \in K} (O_k - t_k) O_k (1 - O_k) W_{jk}\\
&= O_i O_j (1-O_j) \sum_{k \in K} \delta_k W_{jk}
\end{align*}

$\quad$Similar to before we will now define all terms besides $O_i$ to be $\delta_j = O_j (1-O_j) \sum_{k \in K} \delta_k W_{jk}$, so we have

$$\frac{\partial E}{\partial W_{ij}} = O_i \delta_j$$


__How weights affect errors__

- For an output layer node $k \in K$

$$\frac{\partial E}{\partial W_{jk}} = O_j \delta_k $$

$\quad \;\,$where $$\delta_k = (O_k - t_k) O_k (1 - O_k)$$

- For a hidden layer node $j \in J$

$$\frac{\partial E}{\partial W_{ij}} = O_i \delta_j$$

$\quad \;\,$where $$\delta_j = O_j (1-O_j) \sum_{k \in K} \delta_k W_{jk}$$

__What about the bias?__

If we incorporate the bias term $\theta$ into the equation you will find that

$$ \frac{\partial O}{\partial \theta} = 1$$

This is why we view the bias term as output from a node which is always one. This holds for any layer $\ell$, a substitution into the previous equations gives us that

$$ \frac{\partial E}{\partial \theta} = \delta_{\ell}$$

<br>
<font size='4'><b>The backpropagation algorithm using gradient descent</b></font>

1. Run the network forward with your input data to get the netwrok output

2. For each output node compute
$$\delta_k = (O_k - t_k) O_k (1 - O_k)$$
3. For eatch hidden node calculate
$$\delta_j = O_j (1-O_j) \sum_{k \in K} \delta_k W_{jk}$$
4. Update the weights and biases as follows<br>
Given
\begin{align*}
\Delta W &= -\eta \delta_{\ell} O_{\ell -1}\\
\Delta \theta &= -\eta \delta_{\ell}
\end{align*}
apply
\begin{align*}
W &\leftarrow W + \Delta W \\
\theta &\leftarrow \theta + \Delta \theta
\end{align*}



In [2]:
%%html
<iframe src="https://www.youtube.com/embed/aVId8KMsdUU?list=PL29C61214F2146796" 
width="560" height="315" frameborder="0" allowfullscreen></iframe>

In [3]:
%%html
<iframe src="https://www.youtube.com/embed/zpykfC4VnpM?list=PL29C61214F2146796" 
width="560" height="315" frameborder="0" allowfullscreen></iframe>

In [4]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>