## Chapter 10. Introduction to Artificial Neural Networks

Key idea inspired _artificial neural networks_(ANNs): study brain's architecture for inspiration on how to build an intelligent machine.

ANNs are the very core of Deep Learning. They are versatile, powerful, and scalable, making them ideal to tackle large and highly complex Machine Learning tasks.

### From Biological to Artificial Neurons

ANNS first introduced as _propositional logic_ in 1943 by Warren McCulloch and Walter Pitts. 

In the early 1980s there was a revival of interest in ANNs as new network architectures were invented and better training techniques were developed. But by the 1990s, powerful alternative Machine Learning techniques such as
Support Vector Machines.

Reasons to believe this wave of interest in ANNs is different and will have a much more profound impact on our lives:
 - Huge quantity of data available to train neural networks, and ANNs frequently outperform other ML techniques on very large and complex problems.
 - Computing power (Moore's Law, GPUs)
 - The training algorithms have been improved.
 - Some theoretical limitations of ANNs have turned out to be benign in practice.
 - ANNs seem to have entered a virtuous circle of funding and progress.
 
#### Biological Neurons

Each neuron typically connected to thousands of other neurons. Highly complex computations can be performed by a vast network of fairly simple neurons.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-1.png" width=400px alt="fig10-1" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 10-1. Biological neuron_</div>

#### Logical Computations with Neurons

_Artificial neuron_: one or more binary (on/off) inputs and one binary output. (Such simplified model can build a network of artificial neurons that computes any logical proposition.)

#### The perceptron

_Perceptron_: one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron called a _linear threshold unit_ (LTU): the inputs and output are now numbers (instead of binary on/off values) and each input connection is associated with a weight.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-4.png" width=400px alt="fig10-4" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 10-4. Linear shreshold unit_</div>

_Common step functions used in Perceptrons_

$$heaviside \ (z) = \left\{\begin{matrix}
0 \ \ if \ z<0
\\ 
1 \ \ if \ z \ge 0
\end{matrix}\right. \ \ \ \ \ \ \ sgn \ (z) = \left\{\begin{matrix}
-1 \ \ if \ z<0
\\ 
0 \ \ if \ z=0
\\
1 \ \ if \ z \ge 0
\end{matrix}\right.$$

A Perceptron is simply composed of a single layer of LTUs, with each neuron connected to all the inputs.
These connections are often represented using special passthrough neurons called _input neurons_: they just
output whatever input they are fed. Moreover, an extra bias feature is generally added ($x_0 = 1$). This bias
feature is typically represented using a special type of neuron called a _bias neuron_, which just outputs 1
all the time.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-5.png" width=400px alt="fig10-5" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 10-5. Perceptron diagram. This Perceptron can classify instances simultaneously into three different binary classes, which makes it a multioutput
classifier._</div>

How is a Perceptron trained?

Hebb's rule (Hebbian learning): the connection weight between two neurons is increased whenever they have the same
output.

Perceptrons are trained using a variant of this rule that takes into account the error made by the network; it does not reinforce connections that lead to the wrong output. More specifically, the Perceptron is fed one training instance at a time, and for each instance it makes its predictions. For every output neuron that produced a wrong prediction, it reinforces the connection weights from the inputs that would have contributed to the correct prediction. The rule is shown as

_Perceptron learning rule (weight update)_

$$ w_{i,j}^{next\_step} = w_{i,j} + \eta(\hat{y}_j - y_j)x_i$$

_Perceptron convergence Theorem_: if the training instances are linearly separable, this algorithm would converge to a solution. 

Prefer Logistic Regression over Perceptrons, because instead of outputting a class probability, Perceptrons just make predictions based on a hard threshold.

_Multi-Layer Perceptron_ (MLP) can eliminate some of the limitations of Perceptrons, while single-layer perceptrons are incapable of solving some trivial problems.

#### Multi_layer Perceptron and Backpropagation

An MLP is composed of one (passthrough) input layer, one or more layers of LTUs, called _hidden layers_, and one final layer of LTUs called the _output layer_. Every layer except the output layer includes a bias neuron and is fully connected to the next layer. 

_Deep neural network_ (DNN): ANN has two or more hidden layers.

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-7.png" width=400px alt="fig10-7" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 10-7. Multi-Layer Perceptron_</div>

Backpropagation training algorithm, same as Gradient Descent using reverse-mode autodiff: For each training instance the backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).

_Activation (Step) function_ with well-defined nonzero derivative everywhere:
 - _Logistic function_, $\sigma (z)=1/(1+\exp(z)) \ \ \in [0,1]$ 
 - _Hyperbolic tangent function_, $\tanh (z) = 2\sigma (2z)-1 \ \ \in [-1,1]$, make each layer's output normalized (i.e., centered around 0) at the beginning of training. Speed up convergence.
 - _ReLU function_, $ReLU(z)=\max(0,z) \ \  \in [0,\infty)$, fast to compute gradient.
 
<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-8.png" width=400px alt="fig10-8" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 10-8. Activation functions ans their derivatives_</div>

An MLP is often used for classification, with each output corresponding to a different binary class. When the classes are exclusive, the output layer is typically modified by replacing the individual activation
functions by a shared _softmax_ function. The output of each neuron corresponds to the estimated probability of the corresponding class. Note that the signal flows only in one direction (from the inputs to the outputs), so this architecture is an example of a _feedforward neural network_ (FNN).

<div style="width:400 px; font-size:100%; text-align:center;"> <center><img src="img/fig10-9.png" width=400px alt="fig10-9" style="padding-bottom:1.0em;padding-top:2.0em;"></center>_Figure 10-9. A modern MLP (including ReLU and softmax) for classification_</div>

<font color=blue>_NOTE_</font>
>Biological neurons seem to implement a roughly sigmoid (S-shaped) activation function, so researchers stuck to sigmoid functions for a very long time. But it turns out that the ReLU activation function generally works better in ANNs.

### Training an MLP with TensorFlow's High-Level API



