# Chapter.07 Multilayer perceptron
---

### 7.1. Multilayer perceptron
7.1.1. Limitation of Rosenblatt's Perceptron<br>
- Working only for binary classes with linearly separable patterns
- Inability to solve nonlinear classification problem(e.g., XOR problem)
<br>

7.1.2. Model<br>

<img src="./res/ch07/fig_1_1.png" width="600" height="300"><br>
<div align="center">
  Figure.7.1.1
</div>

Above figure consists of multiple hidden layers for feature extraction. It can do nonlinear transformation with nonlinear activation functions. 

$$
\varphi_i(\sum w_i x_i + b)
$$

Also, it have high(or full) connectivity. It can be used in regression and classification.

7.1.3. Cost functions<br>

There are training samples, that is,
$$
\mathfrak{I} = \{ \mathbf{x}(n), \,\ \mathbf{d}(n)\}_{n - 1}^{N}
$$

Let $|C|$ be number of output nodes.

- MSE cost functions
    - Instantaneous MSE(Online learning)
$$
E(n) = \frac{1}{2} \sum_{j \in C} e_j^2(n)
$$
    - Average MSE(batch learning)
$$
E_{av}(n) = \frac{1}{N} \sum_{n = 1}^N E(n) = \frac{1}{2N} \sum_{n = 1}^{N} \sum_{j \in C} e_j^2(n) \quad \text{where} \,\ e_j(n) = d_j(n) - o_j(n), \,\ o_j \,\ \text{is a j-th output of neural network.}
$$

In regression process, we have to use instantaneous MSE because we have to reduce at least one summation.

- Cross-Entropy cost functions
    - Instantaneous cross-entropy
$$
E(n) = -\frac{1}{2} \sum_{j \in C} d_j(n) \log(o_j(n))
$$
    - Average cross-entropy
$$
E_{av}(n) = \frac{1}{N} \sum_{n = 1}^N E(n) = -\frac{1}{2N} \sum_{n = 1}^{N} \sum_{j \in C} d_j(n) \log(o_j(n))
$$

$$
\text{where} \,\ d_j(n) \in \{0,1\}, \,\ o_j(n) \in [0, 1] \,\ \text{and} \,\ \sum_jo_j(n) = 1
$$

In classification process, $d_j(n)$ is one-hot encoded label and $o_j(n)$ is probability value. In this context, this model should use softmax activation function so that $ \sum_j o_j(n) = 1 \quad (\because \,\ \text{Axiom.A.3.3})$.

7.1.4. Batch and online learning<br>
- Batch Learning
    - Using all the training samples for weight updates
    - Cost function : $ E_{av}(N) $
    - Approaching to the standard gradient descent
    - Large memory but stable behavior
- Online Learning
    - Weight updates are on an example-by-example basis
    - Cost function : $ E(n) $
    - Simple and effective to implement
    - Much less storage but less stable behavior

7.1.5. Back-Propagation algorithm : Overview<br>
We will consider online learning method in this chapter. <br>
Back-Propagation algorithm is the algorithm that can get gradient to training neural network. <br><br>

Let $ n $ be an index of training sample and $ w_{ij} $ be edge(connection) from neuron $ i $ to neuron $ j $.<br>
$ v_i $ is an induced local field of neuron $ i $ and $ y_i $ is output of neuron $ j $.<br>
Neuron $j$ is a node in the $ l $th layer and neuron $i$ is a node in the $ (l-1) $th layer.<br>

$$
v_j(n) = \sum_{i = 0}^m w_{ji}(n) y_i(n) 
$$

$$
y_j(n) = \varphi(v_j(n))
$$

In online learning,
$$
\begin{align*}
w_{ji}(n + 1) &= w_{ji}(n) - \eta \frac{\partial E(n)}{\partial w_{ji}(n)} \\
              &= w_{ji}(n) + \Delta w_{ji}(n) \\
\end{align*}
$$

$$
\frac{\partial E(n)}{\partial w_{ji}(n)} = \frac{\partial E(n)}{\partial v_j(n)} \frac{\partial v_j (n)}{\partial w_{ji}(n)} \quad (\because \,\ \text{Chain rule})
$$
<br>
Let $  \frac{\partial E(n)}{\partial v_j(n)} = - \delta_j(n) $. <br>

Therefore, 

$$ 
\frac{\partial E(n)}{\partial w_{ji}(n)} = -\delta_j(n) y_j(n) \quad (\because \,\ \frac{\partial v_j(n)}{\partial w_{ji}(n)} = \frac{\partial}{\partial w_{ji}}\left( \sum_{i = 0}^m w_{ji}(n) y_i(n)  \right) = y_i(n))
$$

$ \delta_j(n) $ is called __local gradient__ . In this time, the weight update is proportional to the local gradient and the input signal, that is, the error multiplied by the local gradient.<br><br>


Therefore, 
$$
\begin{align*}
\Delta w_{ij}(n) &= - \eta \frac{\partial E(n)}{\partial w_{ji} (n)} \\
                 &= \eta \delta_j(n) y_i(n) \\
\end{align*}
$$

It means, 
$$
\left(\text{Weight correction} \,\ \Delta w_{ji}(n)\right) = \left(\text{learning-rate parameter} \,\ \eta \right) \times \left(\text{local gradient} \,\ \delta_j(n)\right) \times \left(\text{input signal of neuron } \,\ j, \,\ y_i(n)\right)
$$

7.1.6. Back-Propagation algorithm : Local gradients at output nodes<br>
Suppose neuron $j$ is an output node and cost function is MSE.

<img src="./res/ch07/fig_1_2.png" width="600" height="300"><br>
<div align="center">
  Figure.7.1.2
</div>

$$
\begin{align*}
\delta_j(n) &= -\frac{\partial E(n)}{\partial v_j(n)} \quad (\because \,\ \text{Definition}) \\
            &= - \frac{\partial E(n)}{\partial e_j(n)} \frac{\partial e_j(n)}{\partial y_j(n)} \frac{\partial y_j (n)}{\partial v_j(n)} \quad (\because \,\ \text{Chain rule}) \\
\end{align*}
$$

$$
\frac{\partial E(n)}{\partial e_j(n)} = e_j(n) \quad \left( \because \,\ E(n) = \frac{1}{2} \sum_{j \in C} e_j^2(n) \right) \quad \cdots (1)
$$

$$
\frac{\partial e_j(n)}{\partial y_j(n)} = -1 \quad \left( \because \,\ e_j(n) = d_j(n) - y_j(n) \right) \quad \cdots (2)
$$

$$
\frac{\partial y_j(n)}{\partial v_j(n)} = \varphi_j^\prime (v_j(n)) \quad \left( \because \,\ y_j(n) = \varphi_j(v_j(n)) \right) \quad \cdots (3)
$$

$$
\delta_j(n) = \varphi_j^\prime (v_j(n)) e_j(n) \qquad (\because \,\ (1), \,\ (2), \,\ \text{and} \,\ (3))
$$

In the linear model, $ \delta_j(n) = e_j(n) $.

7.1.7. Back-Propagation algorithm : Local gradients at hidden nodes<br>
Suppose neuron $j$ is an hidden node at $(L - 1) $ layer and cost function is MSE.

<img src="./res/ch07/fig_1_3.png" width="600" height="300"><br>
<div align="center">
  Figure.7.1.3
</div>

$$
\delta_j(n) = - \frac{\partial E(n)}{\partial v_j(n)} = - \frac{\partial E(n)}{\partial y_j(n)} \frac{\partial y_j(n)}{\partial v_j (n)}
$$

We already know that, 
$$
\frac{\partial y_j(n)}{\partial v_j(n)} = \varphi_j^\prime (v_j(n)) \quad (\because \,\ y_j(n) = \varphi_j(v_j(n)))
$$

So, we have to focus on following term.

$$
\begin{align*}
\frac{\partial E(n)}{\partial y_j(n)} &= \sum_k e_k(n) \frac{\partial e_k(n)}{\partial y_j(n)} \quad (\because \,\ E(n) = \frac{1}{2} \sum_k e_k^2(n) ) \\
                                      &= \sum_k e_k(n) \frac{\partial e_k(n)}{\partial v_k(n)} \frac{\partial v_k(n)}{\partial y_j(n)} \quad (\because \,\ \text{Chain rule}) 
\end{align*}
$$

$$
\frac{\partial e_k(n)}{\partial v_k(n)} = \varphi_k^\prime (v_k(n)) \quad (\because \,\ e_k(n) = d_k(n) - y_k(n) = d_k(n) - \varphi_k(v_k(n))) \quad \cdots (1)
$$

$$
\frac{\partial v_k(n)}{\partial y_j(n)} = w_{kj}(n) \quad (\because \,\ v_k(n) = \sum_{j = 1}^{m}w_{kj}(n)y_j(n)) \quad \cdots (2)
$$

Therefore,
$$
\begin{align*}
\frac{\partial E(n)}{\partial y_j(n)} &= \sum_k e_k(n) \frac{\partial e_k(n)}{\partial v_k(n)} \frac{\partial v_k(n)}{\partial y_j(n)} \\
                                      &= - \sum_k e_k(n) \varphi_k^\prime (v_k(n))w_{kj}(n) \quad (\because \,\ (1) \,\ \text{and} \,\ (2))
\end{align*}
$$

Therefore,
$$
\begin{align*}
\delta_j(n) &= - \frac{\partial E(n)}{\partial v_j(n)} = - \frac{\partial E(n)}{\partial y_j(n)} \frac{\partial y_j(n)}{\partial v_j (n)} \\
            &= \varphi_j^\prime (v_j(n)) \cdot \sum_k e_k(n) \varphi_k^\prime (v_k(n)) w_{kj}(n) \\
            &= \varphi_j^\prime(v_j(n)) \cdot \sum_k \delta_k(n) w_{kj}(n) \\
\end{align*}
$$
In the above formula, we can see that the values of the next layer are multiplied by $ \delta $. In order to generalize this, we have to check the $ (l-2) $ layer, and we can prove this by mathematical induction.

Let's suppose $ i $ is an hidden node at $(L - 2) $ layer and cost function is MSE. It means, <br>
$ k $ : a neuron in the $L$th (i.e., output) layer<br>
$ j $ : a neuron in the $(L-1)$th (i.e., output) layer<br>
$ i $ : a neuron in the $(L-2)$th (i.e., output) layer<br>

$$
\delta_i(n) = - \frac{\partial E(n)}{\partial v_i(n)} = - \frac{\partial E(n)}{\partial y_i(n)} \frac{\partial y_i(n)}{\partial v_i (n)}
$$

We already know that, 
$$
\frac{\partial y_i(n)}{\partial v_i(n)} = \varphi_i^\prime (v_i(n)) \quad (\because \,\ y_i(n) = \varphi_i(v_i(n)))
$$

$$
\begin{align*}
\frac{\partial E(n)}{\partial y_i(n)} &= \sum_k e_k(n) \frac{\partial e_k(n)}{\partial y_i(n)} \quad (\because \,\ E(n) = \frac{1}{2} \sum_k e_k^2(n)) \\
                                      &= \sum_k e_k(n) \frac{\partial e_k(n)}{\partial v_k(n)} \frac{\partial v_k(n)}{\partial y_i(n)} \\
                                      &= \sum_k e_k(n) \left( - \varphi_k^\prime (v_k(n)) \right) \frac{\partial v_k(n)}{\partial y_i(n)} \quad (\because \,\ e_k(n) = d_k(n) - \varphi_k(v_k(n)))
\end{align*}
$$

In this time, the local induced field of output layer is,
$$
\begin{align*}
v_k(n) &= \sum_{j = 1}^m w_{kj}(n)y_j(n) = \sum_{j = 1}^m w_{kj}(n) \varphi_j (v_j(n)) \\
       &= \sum_{j = 1}^m w_{kj}(n) \varphi_j \left( \sum_{i = 1}^l w_{ji}(n) y_i(n) \right) 
\end{align*}
$$

Therefore,
$$
\begin{align*}
\frac{\partial v_k(n)}{\partial y_i(n)} &= \sum_{j = 1}^m w_{kj}(n) \varphi_j^\prime \left( \sum_{i = 1}^l w_{ji}(n) y_i(n) \right) w_{ji}(n) \\
                                        &= \sum_{j = 1}^m w_{kj}(n) \varphi_j^\prime \left( v_j(n) \right) w_{ji}(n) \quad \cdots (1)
\end{align*}
$$

$$
\begin{align*}
\frac{\partial E(n)}{\partial y_i(n)} &= - \sum_k e_k(n) \varphi_k^\prime (v_k(n)) \frac{\partial v_k(n)}{\partial y_i(n)} \\
                                      &= - \sum_k e_k(n) \varphi_k^\prime (v_k(n)) \sum_{j} w_{kj}(n) \varphi_j^\prime \left( v_j(n) \right) w_{ji}(n)  \quad (\because \,\ (1))\\
                                      &= - \sum_j \left[\sum_k \delta_k(n) w_{kj}(n) \varphi)j^\prime (v_j(n)) \right] w_{ji}(n) \\
                                      &= - \sum_j \delta_j(n)w_{ji}(n)
\end{align*}
$$


$$
\therefore \quad \delta_i(n) = - \varphi_i^\prime (v_i(n)) \cdot \sum_j \delta_j(n) w_{ji}(n)
$$

This is the same as the $(L-1)$ layer. $ \qquad \blacksquare $

7.1.8. Back-Propagation algorithm : Algorithm<br>

1. __Initialization__  : Initialize weights and randomly shuffle training samples.
2. __Forward Computation__  : 
$$
v_j^{(l)}(n) = \sum_i w_{ji}^{(l)}(n) y_i^{(l-1)}(n) 
$$

$$
y_j^{l}(n) = 
\begin{cases}
x_j(n), & \text{for neuron} \,\ j \,\ \text{in the input layer, i.e.,} \,\ l = 1 \\
\varphi_j(v_j(n)), & \text{for neuron} \,\ j \,\ \text{in the hidden layer} \\
o_j(n), & \text{for neuron} \,\ j \,\ \text{in the output layer, i.e.,} \,\ l = L \\
\end{cases}
$$

$$
\text{And we can get} \quad e_j(n) = d_j(n) - o_j(n)
$$

3. __Backward Computation__  :
$$
\delta_j^{(l)}(n) = 
\begin{cases}
e_j^{(L)}(n) \varphi_j^\prime \left( v_j^{(L)}(n) \right) , & \text{for neuron} \,\ j \,\ \text{in the output layer} \,\ L \\
\varphi_j^\prime \left( v_j^{(l)}(n) \right) \sum_k \delta_k^{(l + 1)}(n) w_{kj}^{(l + 1)} (n), & \text{for neuron} \,\ j \,\ \text{in the hidden layer} \,\ l \\
\end{cases}
$$

4. __Weights update__  :

$$
w_{ji}^{(l)}(n + 1) = w_{ji}^{(l)}(n) + \eta \delta_j^{(l)}(n) y_i^{(l - 1)}(n)
$$

When we update weights, we can consider momentum additionally. It goes like

$$
w_{ji}^{(l)}(n + 1) = w_{ji}^{(l)}(n) + \eta \delta_j^{(l)}(n) y_i^{(l - 1)}(n) + \alpha \Delta w_{ji}^{(l)}(n - 1)
$$

7.1.9. XOR Problem<br>
Let's solve __XOR Problem__  with above model and algorithm!!!


<strong>Reference.</strong><br>
Simon Haykin, Neural networks and learning machines<br>
Yosha Benjio, Deep Learning<br>