# Artificial Neural Networks

Artificial Neural Networks (ANN) are inspired by the network structure of the human brain.
They are built from individual *artificial neurons*, which are mathematical models of neurons 
that can be viewed as follows (cf. Russell & Norvig, Figure 18.19).

<img src="an.png" width="256px"/>

Notes:
- The neuron receives *activation* from other input neurons ($a_i$).
- A *bias* input can be added to each neuron ($a_0$).
- Each input activation is weighted (i.e., the weight for input $i$ to neuron $j$ is $w_{ij}$).
- The output activation is computed for each neuron ($a_j = g(in_j) = g(\sum_{i=0}^n w_{ij} \cdot a_i$)).

Artificial neurons can be configured into *artificial neural networks*
comprising multiple-layers. Here is an example.

<img src="ann.png" width="256px"/>

Notes:
- This network architecture is:
    - *Feed-forward*: The activation flows from the input (top) to the output (bottom).
    - *Fully/Densely inter-connected*: The output of each node is fed into all nodes in the next layer.
- Layers in such networks can be seen as non-linear functions of the outputs of the preceeding layers, i.e.: 
    $o = layer_h(layer_i(i))$

We train these ANNs using the back-propagation algorithm:

**1. Initialization**

Set the ANN weights to "small", "random" numbers.

**2. Feed-Forward**

Take an example input from the training set and feed its activation through the network.

$
o_1 = 
\left[
\begin{array}{c c}
i_1 & i_2 \\ 
\end{array}
\right]
\cdot
\left[
\begin{array}{c c}
w_{i_1,h_1} & w_{i_1,h_2} \\ 
w_{i_2,h_1} & w_{i_2,h_2} \\
\end{array}
\right]
\cdot
\left[
\begin{array}{c}
w_{h_1, o_1} \\ 
w_{h_2, o_1} \\ 
\end{array}
\right]
$

**3. Error Computation**

Compute the error using a common error function (e.g., L2 error) by comparing the desired output ($y_j$) with the actual output ($o_j$).

$
L_2Error = \sum_{i=1}^n {(y_j - o_j)}^2 
$

**4. Back-Propagation** 

Modify the weights by propagating updates back through the network using a given learning rate ($rate$) and the raw errors generated by the output layer ($\Delta_{o_j} = y_j - o_j$).

$
\begin{aligned}
W_{i,h}^\ast &\leftarrow W_{i,h} + rate \cdot A_{i} \cdot g'(in_{h}) \odot \sum_k (W_{h, o} \cdot \Delta_{o}) \\
W_{h,o}^\ast &\leftarrow W_{h,o} + rate \cdot A_{h} \cdot g'(in_{o}) \cdot \Delta_{o}
\end{aligned}
$

These back-propagation formulae are derived by computing the following, the first for the hidden layer(s) (more complicated) and the second for the output layer (less complicated).

$
\begin{aligned}
{{\partial{Error_o}} \over {\partial{W_{i,h}}}} &= \ldots = -A_i \cdot g'(in_{h}) \odot \sum_k (W_{h, o} \cdot \Delta_{o_k}) \\
{{\partial{Error_o}} \over {\partial{W_{h,o}}}} &= \ldots = -A_h \cdot g'(in_{o}) \cdot \Delta_{o}
\end{aligned}
$

Interestingly, deep neural networks appear to work in practice without significant problems caused by local minima.

## Example

In class, we ran the following example (inspired by [Backpropagation Step by Step](https://hmkcode.github.io/ai/backpropagation-step-by-step/)).

1. Fill in random weights.

    $\begin{aligned}
    &\begin{bmatrix}
    w_{i_1,h_1} & w_{i_1,h_2} \\
    w_{i_2,h_1} & w_{i_2,h_2}
    \end{bmatrix}
    \leftarrow
    \begin{bmatrix}
    0.11 & 0.12 \\
    0.21 & 0.08
    \end{bmatrix} \\
    &\begin{bmatrix}
    w_{h_1, o_1} \\ 
    w_{h_2, o_1} 
    \end{bmatrix}
    \leftarrow
    \begin{bmatrix}
    0.14 \\
    0.15
    \end{bmatrix}
    \end{aligned}$
    
2. Compute the output for one sample (XOR: `[0, 1]` &rarr; `1`).

    $\begin{aligned}
    o_j &= 
    \begin{bmatrix}
    0 & 1 \\ 
    \end{bmatrix}
    \cdot
    \begin{bmatrix}
    0.11 & 0.12 \\
    0.21 & 0.08
    \end{bmatrix}
    \cdot
    \begin{bmatrix}
    0.14 \\
    0.15
    \end{bmatrix}
    \\ &=
    \begin{bmatrix}
    0 * 0.11 + 1 * 0.21 & 0 * 0.12 + 1 * 0.08
    \end{bmatrix}
    \cdot
    \begin{bmatrix}
    0.14 \\ 
    0.15
    \end{bmatrix}
    \\ &=
    \begin{bmatrix}
    0.21 & 0.08
    \end{bmatrix}
    \cdot
    \begin{bmatrix}
    0.14 \\ 
    0.15 
    \end{bmatrix}
    \\ &=
    \begin{bmatrix}
    0.21 * 0.14 + 0.08 * 0.15
    \end{bmatrix}
    \\ &= 0.0414 
    \end{aligned}
    \\
    $

3. Compute the error (and, more importantly, the delta).

    $\begin{aligned}
    L_2Error &= (1 - 0.0414)^2 \\
    &= 0.9189 \\
    \Delta_{o_1} &= (1 - 0.0414) \\
    &= 0.9586 \\
    \end{aligned}$

4. Back-propagate updates back through the network, assuming: 
    $learning\_rate = 0.05$; 
    RELU activation functions for all nodes.
     
    $\begin{aligned}
    \begin{bmatrix}
    w_{h_1, o_1} \\ 
    w_{h_2, o_1}
    \end{bmatrix} &\leftarrow 
    \begin{bmatrix}
    0.14 \\ 
    0.15 
    \end{bmatrix} + 0.05 \cdot 
    \begin{bmatrix}
    0.21 \\ 
    0.08 
    \end{bmatrix} \cdot 1.0 \cdot 0.9586 \\\\
    &= 
    \begin{bmatrix}
    0.14 \\ 
    0.15 
    \end{bmatrix} + 
    \begin{bmatrix}
    0.05 * 0.21 * 1.0 * 0.9586 \\
    0.05 * 0.08 * 1.0 * 0.9586 
    \end{bmatrix} \\
    &= 
    \begin{bmatrix}
    0.14 \\ 
    0.15 
    \end{bmatrix} +
    \begin{bmatrix}
    0.0100 \\
    0.00383 
    \end{bmatrix} \\
    &=
    \begin{bmatrix}
    0.1500 \\ 
    0.1538
    \end{bmatrix}
    \end{aligned}$

    $\begin{aligned}
    \begin{bmatrix}
    w_{i_1,h_1} & w_{i_1,h_2} \\ 
    w_{i_2,h_1} & w_{i_2,h_2}
    \end{bmatrix} &\leftarrow 
    \begin{bmatrix}
    0.11 & 0.12 \\
    0.21 & 0.08
    \end{bmatrix} + 0.05 \cdot
    \begin{bmatrix}
    0 & 0 \\ 
    1 & 1
    \end{bmatrix} \cdot 1.0 \odot
    \begin{bmatrix}
    0.14 & 0.15 \\ 
    0.14 & 0.15
    \end{bmatrix} \cdot 0.9586 \\ &=
    \begin{bmatrix}
    0.11 & 0.12 \\
    0.21 & 0.08
    \end{bmatrix} + 
    \begin{bmatrix}
    0.05 * 0 * 1.0 & 0.05 * 0 * 1.0 \\ 
    0.05 * 1 * 1.0 & 0.05 * 1 * 1.0 \\ 
    \end{bmatrix} \odot 
    \begin{bmatrix}
    0.14 * 0.9586 & 0.15 * 0.9586\\ 
    0.14 * 0.9586 & 0.15 * 0.9586 
    \end{bmatrix} \\ &=
    \begin{bmatrix}
    0.11 & 0.12 \\
    0.21 & 0.08
    \end{bmatrix} + 
    \begin{bmatrix}
    0.00 & 0.00 \\ 
    0.05 & 0.05 
    \end{bmatrix} \odot 
    \begin{bmatrix}
    0.1342 & 0.1438 \\
    0.1342 & 0.1438
    \end{bmatrix} \\ &=
    \begin{bmatrix}
    0.11 & 0.12 \\
    0.21 & 0.08
    \end{bmatrix} + 
    \begin{bmatrix}
    0.00 * 0.1342 & 0.00 * 0.1438 \\
    0.05 * 0.1342 & 0.05 * 0.1438
    \end{bmatrix} \\ &=
    \begin{bmatrix}
    0.11 & 0.12 \\
    0.21 & 0.08
    \end{bmatrix} +     
    \begin{bmatrix}
    0.0 & 0.0 \\
    0.0067 & 0.0072
    \end{bmatrix} \\ &= 
    \begin{bmatrix}
    0.11 & 0.12 \\
    0.2167 & 0.0872
    \end{bmatrix}  
    \end{aligned}$

Notes:
- We had to do 2 *broadcasts* (in the first line) to get the 
    required matrix dimensions. 
    - $A_i$: along the vertical (i.e., add duplicate column).
    - $W_{h,o}$ along the horizontal (i.e., add duplicate row).
    - This allows us to use element-wise multiplication, known as the 
        [Hadamard Product](https://en.wikipedia.org/wiki/Hadamard_product_(matrices)) 
        and denoted by $\odot$.
- This process:
    - works recursively for multiple layered networks.
    - is more efficient than adjusting each weight individually.
    - is generally run using *Mini-batch Stochastic Gradient Descent*.             