# Artificial Neural Networks

Artificial Neural Networks (ANN) are inspired by the network structure of the human brain.
They are built from individual artificial neurons, which are mathematical models of neurons 
that can be viewed as follows.

<img src="https://cs.calvin.edu/courses/cs/344/kvlinden/07regression/images/an.png" width="256px"/>

Notes:
- The neuron receives *activation* from other input neurons ($a_i$).
- A *bias* input can be added to each neuron ($a_0$).
- Each input activation is weighted (i.e., the weight for input $i$ to neuron $j$ is $w_{ij}$).
- The output activation is computed for each neuron ($a_j = g(in_j) = g(\sum_{i=0}^n w_{ij} \cdot a_i$)).

Artificial neurons can be configured into feed-forward networks comprising multiple-layers. 
Here is an example.

<img src="https://cs.calvin.edu/courses/cs/344/kvlinden/07regression/images/ann.png" width="256px"/>

We train these ANNs using the backpropagation algorithm:

**1. Initialization**

Set the ANN weights to "small", "random" numbers.

**2. Feed-Forward**

Take an example input from the training set and feed its activation through the network.

$
o_j = 
\left[
\begin{array}{c c}
i_1 & i_2 \\ 
\end{array}
\right]
\cdot
\left[
\begin{array}{c c}
w_{i_1,h_1} & w_{i_1,h_2} \\ 
w_{i_2,h_1} & w_{i_2,h_2} \\
\end{array}
\right]
\cdot
\left[
\begin{array}{c}
w_{h_1, o_1} \\ 
w_{h_2, o_1} \\ 
\end{array}
\right]
$

**3. Error Computation**

Compute the error using a common error function (e.g., L2 error) by comparing the desired output ($y_j$) with the actual output ($o_j$).

$
L2\_Error = \sum_{i=1}^n {(y_j - o_j)}^2
$

**4. Backpropagation** 

Modify the weights by propagating updates back through the network using a given learning rate ($rate$) and the raw errors generated by the output layer ($\Delta_{o_j} = y_j - o_j$).

$
W_{h,o}^\ast \leftarrow W_{h,o} + rate \cdot A_{h} \cdot g'(in_{o}) \cdot \Delta_{o}$
$


These backpropagation formulae are derived by computing the following, the first for the hidden layer(s) (more complicated) and the second for the output layer (less complicated).

$
{\partial{Error_o} \over {\partial{W_{i,h}}}} = \ldots = -A_i \cdot g'(in_{h}) \cdot \sum_k (W_{h, o} \cdot \Delta_{o_k}) \\
{\partial{Error_o} \over {\partial{W_{h,o}}}} = \ldots = -A_h \cdot g'(in_{o}) \cdot \Delta_{o}
$

Interestingly, deep neural networks appear to work in practice without significant problems caused by local minima.