<script type="text/x-mathjax-config">
MathJax.Hub.Config({
    displayAlign: "center"
});
</script>

# MLP(Multi-Layer Perceptron)

## Perceptron
$$\begin{cases}
y = τ(s) \\
s = w_0 + \sum_{i=1}^d w_i x_i, & τ(s) = 
\begin{cases}
1 & s \ge 0 \\
-1 & s < 0
\end{cases}
\end{cases}$$

## Object function and Delta Rule
$w = (w_0, w_1, w_2, ... , w_d)^T$  
Object function : $J(w)$  
 * $J(w) \ge 0$
 * if w is optimized, $J(w) = 0$
 * more error in w, bigger $J(w)$

$$J(w) = \sum_{x_k \in Y} -y_k(w^T x_k)$$  

Partially differentiate $J(w)$ by $w_i$ to compute the gradient $g$  


$$\frac{\partial J(w)}{\partial w_i} = \sum_{x_k \in Y$} -y_k x_{ki}, \quad i = 0, 1, ... d$$

Delta Rule  
$$w_i = w_i + ρ \sum_{x_k \in Y} y_k x_{ki}, \quad i = 0, 1, ... d$$

## Activation Function
|Function name|Function|First derivative|Range|
|:-------:|:---------------------:|:-----------------------------------:|:---:|
|Step|$$τ(s) = \begin{cases} 1 & s \ge 0 \\ -1 & s < 0 \end{cases}$$|$$τ'(s) = \begin{cases} 0 & s \ne 0 \\ \text{none} & s = 0 \end{cases}$$|-1 and 1|
|Logistic Sigmoid|$$τ(s) = {1 \over 1 + e^{-as}}$$|$$τ'(s) = aτ(s)(1 - τ(s))$$|(0, 1)|
|Hyperbolic Tanh|$$τ(s) = {2 \over 1 + e^{-as}} - 1$$|$$τ'(s) = {a \over 2}(1 - τ(s)^2)$$|(-1, 1)|
|Softplus|$$τ(s) = log_e(1 + e^s)$$|$$τ'(s) = {1 \over 1 + e^{-s}}$$|(0, ∞)|
|ReLU|$$τ(s) = max(0, s)$$|$$τ'(s) = \begin{cases} 0 & s < 0 \\ 1 & s > 0 \\ \text{none} & s = 0 \end{cases}$$|[0, ∞)|


## Multi-Layer Perceptron
Assume two layer perceptron which is input layer-hidden layer-output layer  
<img src="./img/1_MLP.png" width="40%" height="40%">

$j^{th}$ hidden node computation:  
$$z_j = τ(zsum_j), \quad j = 1, 2, ..., p$$
$$zsum_j = \mathbf{u}_j^1 \mathbf{x}$$    
$k^{th}$ output node computation:  
$$o_k = τ(osum_k), \quad k = 1, 2, ..., c$$  
$$osum_k = \mathbf{u}_k^2 \mathbf{z}$$  

And they can be represented by matrix:  
$$\mathbf{o} = \mathbf{τ}(\mathbf{U}^2\mathbf{τ}_h(\mathbf{U}^1\mathbf{x}))$$



Object function $J(w)$ and given Data ${\mathbb{X}, \mathbb{Y}}$  
$\mathbb{X} = \{\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_n\}, \mathbb{Y} = \{\mathbf{y}_1, \mathbf{y}_2, ..., \mathbf{y}_n\}$  
They can be wrote by a feature vector matrix $\mathbf{X}(n*d)$, and a label matrix $\mathbf{Y}(n*c)$  


$$
\mathbf{X} = \begin{pmatrix}
{\mathbf{x}_1}^T \\
{\mathbf{x}_2}^T \\
\vdots \\
{\mathbf{x}_n}^T
\end{pmatrix},
\mathbf{Y} = \begin{pmatrix}
{\mathbf{y}_1}^T \\
{\mathbf{y}_2}^T \\
\vdots \\
{\mathbf{y}_n}^T
\end{pmatrix}
$$

The object is to find the optimized function $\mathbf{f}$ which is mapping $\mathbf{X}$ to $\mathbf{Y}$ perfectly. In other words, it is to find the classifier $\mathbf{f}$ classifing all samples correctly.  


$$\mathbf{Y} = \mathbf{f}(\mathbf{X})$$

So, the machine learning do compute ($Θ = {\mathbf{U}^1, \mathbf{U}^2}$)  
$$\widehat{Θ} = \underset{Θ}{\text{argmin}} \parallel\mathbf{f}(\mathbf{X};Θ) - \mathbf{Y}\parallel_2^2$$  

The object function is
$$J(Θ) = {1 \over 2}\parallel\mathbf{y} - \mathbf{o}(Θ)\parallel_2^2$$  

### Backpropagation
$$
\mathbf{U}^1 = \mathbf{U}^1 - ρ \frac{\partial J}{\partial \mathbf{U}^1} \\
\mathbf{U}^2 = \mathbf{U}^2 - ρ \frac{\partial J}{\partial \mathbf{U}^2}
$$


Compute $\frac{\partial J}{\partial \mathbf{U}^1}$ and $\frac{\partial J}{\partial \mathbf{U}^2}$ to get gradient by backpropagation  


First, compute $\frac{\partial J}{\partial u_{kj}^2}$ because $\mathbf{U}^2 is directly connected with output layer  

$$
\begin{align}
\frac{\partial J}{\partial u_{kj}^2}
& = \frac{\partial (0.5 \parallel\mathbf{y} - \mathbf{o}(\mathbf{U}^1, \mathbf{U}^2))\parallel_2^2)}{\partial u_{kj}^2} \\
& = \frac{\partial (0.5(y_k - o_k)^2)}{\partial u_{kj}^2} \\
& = -(y_k - o_k)\frac{\partial o_k}{\partial u_{kj}^2} \\
& = -(y_k - o_k)\frac{\partial τ(osum_k)}{\partial u_{kj}^2} \\
& = -(y_k - o_k)τ'(osum_k)\frac{\partial osum_k}{\partial u_{kj}^2} \\
& = -(y_k - o_k)τ'(osum_k)z_j
\end{align}
$$

Thus,   
$$
δ_k = (y_k - o_k)τ'(osum_k), \quad 1 \le k \le c$$
$$\frac{\partial J}{\partial u_{kj}^2} = Δu_{kj}^2 = -δ_k z_j, \quad 0 \le j \le p, 1 \le k \le c
$$  



Compute $\frac{\partial J}{\partial u_{ji}^1}$. $u_{ji}^1$ affects more nodes than $u_{kj}^2$  

$$
\begin{align}
\frac{\partial J}{\partial u_{ji}^1}
& = \frac{\partial (0.5 \parallel\mathbf{y} - \mathbf{o}(\mathbf{U}^1, \mathbf{U}^2))\parallel_2^2)}{\partial u_{ji}^1} \\
& = - \sum_{q=1}^c (y_q - o_q) \frac{\partial o_q}{\partial u_{ji}^1} \\
& = - \sum_{q=1}^c (y_q - o_q)τ'(osum_q) \frac{\partial osum_q}{\partial u_{ji}^1} \\
& = - \sum_{q=1}^c (y_q - o_q)τ'(osum_q) \frac{\partial osum_q}{\partial z_j} \frac{\partial z_j}{\partial u_{ji}^1} \\
& = - \sum_{q=1}^c (y_q - o_q)τ'(osum_q)u_{qj}^2 \frac{\partial z_j}{\partial u_{ji}^1} \\
& = -τ'(zsum_j)x_i \sum_{q=1}^c (y_q - o_q)τ'(osum_q)u_{qj}^2
\end{align}
$$

Thus,   
$$
η_j = τ'(zsum_j) \sum_{q=1}^c δ_q u_{qj}^2 \quad 1 \le j \le p$$
$$\frac{\partial J}{\partial u_{ji}^1} = Δu_{ji}^1 = -η_j x_i, \quad 0 \le i \le d, 1 \le j \le p
$$