Perceptron
=======

The perceptron algorithm was invented in 1957 by Frank Rosenblatt


The learning rule
--------------------

The learning rule of the Perceptron is based on [Hebbian learning](https://en.wikipedia.org/wiki/Hebbian_theory). Rosenblatt was inspired by Donald Hebb and his 1949 book called *"The Organization of Behavior"*. The core idea of the book is the following:

*When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased.*

This simple but poweful rule has been called the **Hebb's rule** and it is the learning principle used in the original Perceptron. The three principal rule behind the learning phase can be described as follows:

1. If the output is correct, leave its weights unchanged
2. If the output incorrectly outputs zero, add the input vector to the weight vector
3. If the output incorrectly outputs a one, subtract the input vector from the weight vector




Geometric interpretation
----------------------------

The geometric interpretation of the learning rule can be understood thinking in terms of vectors and planes. We start from the **weight space** that is the space containing all the possible configurations of weights in the Perceptron. If the network only has two weights then the weight space is a Cartesian space. If the network has 14 weights then the weight space is a 14-dimensional hyperspace. To simplify we do not take into account the bias. In this simplified version each training case can be represented as a hyperplane passing through the origin. This hyperplane separate the configuration of weights that leads to a correct output from the configuration that leads to a wrong output. A point in the space represents the particular setting of all the weights. A vector starting from the origin and arriving to the point identifies a peculiar setting. In the following image (taken from Hinton course on Coursera) it is possible to see this geometric representation:


<p align="center">
<img src="../etc/img/perceptron_geometry.png" width="300">
</p>


The black line is the hyperplane that separate the weight space in two parts. The **blue vector** is an input vector passed to the perceptron. The correct answer associated to the blue vector is zero, meaning that it is orthogonal to the hyperplane. A specific configuration of weights is represented with a **red vector**. The red vector represents a wrong configuration of weights because the scalar (dot) product between it and the blue vector has a wrong sign. At the same time the **green vector** represents the correct configuration of weights because the dot product with the blue vector has the correct sign.

Here we recall what is a dot product between two vectors $\boldsymbol{a}$ and $\boldsymbol{b}$:

$$ \boldsymbol{a} \cdot \boldsymbol{b} = |\boldsymbol{a}| \ |\boldsymbol{b}| \ cos \ \theta$$

The $cos$ operator decides the sign of the resulting scalar. When $0 < \theta < \frac{\pi}{2}$ the sign is positive, when $\theta = \frac{\pi}{2}$ the result is zero, and when $\theta > \frac{\pi}{2}$ the sign is negative. The same rule applies for angles $> \pi$. The red vector represent a bad configuration of weights because the dot product with the blue vector leads to a positive scalar, that once passed through the sign activation function gives a values of one. On the other hand the green vector returns a negative scalar that lead to the correct output of zero once passed through the activation function.

The **goal** of the learning is to find a good configuration of weights that satisfies all the learning cases. For instance if another learning case is to give a correct ansewer of 1 then we need two hyperplanes and the good configuration for the weights lay between them. This example is represented in the following image (taken from Hinton's Coursera slides):


<p align="center">
<img src="../etc/img/perceptron_geometry_2.png" width="300">
</p>

An important point here is that the optimal configuration of weights **may not exist**! However, if the vector exists, then it lies in a hypercone with its apex in the origin, meaning that all the vectors inside that cone are good solutions to the problem.

At **training time** the weight vector is moved closer or farther from the input vector (blue vector). If we consider again the three rules discussed in the previous section, we can get the geometric interpretation fo them. Here the rules are showed again:

1. If the output is correct, leave its weights unchanged
2. If the output incorrectly outputs zero, add the input vector to the weight vector
3. If the output incorrectly outputs a one, subtract the input vector from the weight vector


applying the three rules to the weight vector we get as result that the weight vector is moved (rule 2 and 3). If there is an hypercone where all the training criteria are satisfied, then the weight vector will reach that cone and will stay there (repeating rule 1).

