# Optimization

## Objective function
The simple error function, MSE(Mean Squared Error) is $e = {1 \over 2} \lVert \mathbf{y} - \mathbf{o} \rVert_2^2$. The gradient is a demerit which corrects weight and bias to reduce an error during machine learning. Then, if there is one output node, error can be written as  
$$e = {1 \over 2} (y - o)^2 = {1 \over 2} (y - \sigma(wx + b))^2, \text{ which } \sigma \text{ is logistic sigmoid function}$$  
And gradients are  
$$
\frac {\partial e} {\partial w} = -(y - o)x\sigma'(wx + b) \\
\frac {\partial e} {\partial b} = -(y - o)\sigma'(wx + b)
$$  
Using MSE make learning slow because logistic sigmoid function. Derivative of logistic sigmoid is the biggest when input is 0, and converges to 0 when input increases or decreases. So the bigger $wx + b$, the smaller the gradient.  

### Cross-Entropy function
Cross-Entropy is  
$$H(P, Q) = -\textstyle \sum_{y \in \{0,1\}} P(y)\log_2Q(y)$$  

In other words,  
$$P(0) = 1 - y \quad Q(0) = 1 - o \\
P(1) = y \quad Q(1) = o$$  

So  
$$e = -(y\log_2 o + (1 - y) \log_2 (1 - o)), \quad o = \sigma(z), z = wx + b$$  

The objective function of cross-entropy functions are  
$$
\begin{alignat}{4}
\frac {\partial e} {\partial w} & = - ({y \over o} - {1 - y \over 1 - o})\frac {\partial o} {\partial w} \\
& = - ({y \over o} - {1 - y \over 1 - o})x\sigma'(z) \\
& = -x ({y \over o} - {1 - y \over 1 - o})o(1 - o) \\
& = x(o - y)
\end{alignat}
$$  
Thus,  
$$\frac {\partial e}{\partial w} = x(o - y) \\
\frac {\partial e}{\partial b} = (o - y)$$  

These are the objective function of cross entropy which has one output node. Then the objective function which has the output vector $\mathbf{o} = (o_1, o_2, ..., o_c)^T$ is  
$$e = - \sum_{i=1,c} (y_i\log_2o_i + (1 - y_i) \log_2(1 - o_i))$$  

### Log likelihood function
For several reasons, the output node uses some different activation function than hidden nodes. One of the activation function is softmax.  
$$o_j = {e^{s_j} \over \textstyle \sum_{i=1,c} e^{s_j}}$$  
The softmax function has the effect of activating more maximum values and suppressing smaller ones. It has a property that summation becomes 1 if you add up all the outputs.

Log likelihood function uses only one node $o_y$.  
$$e = - \log_2 o_y$$  
$o_y$ means the output value of the node corresponding to the sample label.  


The softmax includes the intention to suppress non-maximal values to make them closer to zero. Therefore, the softmax function goes weel with the log likelihood objective function of seeing only the output node values of the class indicated. For these reason, deep learning often uses a combination of softmax active function and log likelihood objective function.  


## Optimization for improvement performance
$$
\text{"}\cdots \text{the wisdom distilled here should be taken as a guideline, to be tried and challenged, not as as practice set in stone.}\cdots\text{"} \\
\text{- } \ulcorner \text{Neural Networks: Tricks of the Trade} \lrcorner
$$  

### Data preprocessing
Data has different unit of features and some features have only positive values. Such data can be slow to converge. When multiple weights increase or decrease together, the path to the lowest point is ruffled, leading to a slower convergence rate. To avoid this problem, we normalize data. Normalization makes mean of feature value 0.  
$$x_i^{new} = {x_i^{old} - \mu_i \over \sigma_i}$$  
Then, how about nominal value(categorical feature)? We one-hot encode the data. One-hot encoding keeps the corresponding bits hot(1) and the rest cold(0).  

### Initialize weight
We need to initailize weight **randomly** to do symmetry break which avoids symmetry weight of neural network. It does not matter whether you choose random number in Gaussian distribution or uniform distribution. However, the range of random number is important. If the weights are set to close to zero, the gradient will be very small, leading to very slow learning. On the other hand, if the weights are too big, it may goes to overfitting.  


There are several rules of thumb for determining the range of random numbers when using a uniform distribution.  


$$r = {1 \over \sqrt{n_{in}}} \\
r = {\sqrt{6} \over \sqrt{n_{in} + n_{out}}}$$  
After determining r selecting one expression between aboves, generate a random number in $[-r, r]$. $n_{in}$ is the number of edges coming into the node, and $n_{out}$ is the number of edges going out to the node.  

### Momentum
Momentum smooths the current gradient using a vector $\mathbf{v}$ representing velocity. In physics, the product of mass and velocity is momentum. For neural networks, only the velocity is used assuming the mass is 1.  
$$
\begin{array}{lcl}
\mathbf{v} = \alpha\mathbf{v} - \rho \frac {\partial J}{\partial \mathbf{\Theta}} \\
\mathbf{\Theta} = \mathbf{\Theta} + \mathbf{v}
\end{array}
$$  
The velocity vector $\mathbf{v}$ is an accumulation of the previous gradient. The range of $\alpha$ is $[0,1]$, and the larger the $\alpha$, the greater the weight on the previous gradient information and the smoother the trajectory $\Theta$ draws. Usually, $\alpha$ is 0.5, 0.9, or 0.99. Or starting with 0.5, the alpha value is gradually increased to reach 0.99 as the number of generations increases. Momentum reduces overshooting much more than if not applied, and consequently finds the optimal solution with much fewer iterations.  
There is a Nesterov momentum method that improves the momentum. Nesterov momentum uses the current $\mathbf{v}$ value to predict $\tilde{\mathbf{\Theta}}$ where to go and then uses the gradient $\frac {\partial J}{\partial \mathbf{\Theta}}\mid_\tilde{\mathbf{\Theta}}$ of the predicted location.  
$$
\begin{array}{lcl}
\tilde{\mathbf{\Theta}} = \mathbf{\Theta} + \alpha \mathbf{v} \\
\mathbf{v} = \alpha \mathbf{v} - \rho \frac {\partial J}{\partial \mathbf{\Theta}}\mid_\tilde{\mathbf{\Theta}} \\
\mathbf{\Theta} = \mathbf{\Theta} + \mathbf{v}
\end{array}
$$  

### Adaptive learning rate
AdaGrad(Adaptive Gradient) uses adaptive learning rate.
$$
\begin{array}{lcl}
\mathbf{r} = \mathbf{r} + (\mathbf{g} \odot \mathbf{g}) \\
\Delta \mathbf{\Theta} = -{\rho \mathbf{g} \over \epsilon + \sqrt{\mathbf{r}}} \\
\mathbf{\Theta} = \mathbf{\Theta} + \Delta \mathbf{\Theta}
\end{array}
$$  
$\mathbf{r}$ is a vector of accumulation of the previous gradient and $\Delta \mathbf{\Theta}$ is an update value. The $\epsilon$ (usually $[10^{-5}, 10^{-7}]$) prevents the denominator become zero. If $r_i$ is small, $\left\vert \Delta \theta_i \right\vert$ moves a little. In the contrary, if the cumulative value of the previous gradient is small, it moves a lot. Thus, ${\rho \over \epsilon + \sqrt{r_i}}$ is adaptive learning rate.  
Looking at $\mathbf{r} = \mathbf{r} + (\mathbf{g} \odot \mathbf{g})$, the old and new gradients play the same weight until the algorithm is finished. As a result, there is a possibility that the adaptive learning rate will approach zero when $\mathbf{r}$ becomes larger and does not converge sufficiently.  


RMSProp uses weighted moving average method to exponentially reduce the influence of the old gradient.  
$$
\mathbf{r} = \alpha \mathbf{r} (1 - \alpha)\mathbf{g} \odot \mathbf{g}
$$  


Adam(Adaptive Moment) is an algorithm that adds momentum to RMSProp.  

<img src="./img/3_Optimizers.gif" width="30%" height="30%">  
reference : http://cs231n.github.io/neural-networks-3/

### Activation function
Forward propagation:  
$$
z = \mathbf{w}^T\tilde{\mathbf{x}} + b \\
y = \tau(z)
$$  
$\tilde{\mathbf{x}}$ is the signal from $l - 1^{th}$ layer to $l^{th}$ layer. If you use linear activation functions, z is linear computation so it has the same effect as one layer. Therefore we should use non-linear activation function. Step function, tanh function, ReLU function etc. are non-linear function.  


tanh function has the range $[-1, 1]$ and it can be differentiated over the entire interval. If the value is increased to some extent, a saturation phenomenon close to 1 occurs. The parameter update happens very slowly when the derivative is close to zero.  


ReLU(Rectified Linear Unit) function is  
$$
z = \mathbf{w}^T\tilde{\mathbf{x}} + b \\
y = ReLU(z) = \max(0, z)
$$  
ReLU is linear on positive realm so that saturation phenomenon does not occur. Also, it makes neural network sparse because negative realm is zero. When the neural network becomes sparse, it is excellent for solving different change factors. There are several functions that transform ReLU: leaky ReLU, PReLU(parametic ReLU).

### Batch normalization
We use batch normalization to avoid covariate shift which is that the distribution of the sample changes during learning. One of the reason that a neural network which has deep layers cannot be learned is covariate shift. It is more efficient to normalize mini-batch trainset each than entire trainset. So,  


$$
\begin{array}{lcl}
\mu_B = {1 \over m} \sum_{i=1}^m z_i  \\
\sigma_B^2 = {1 \over m} \sum_{i=1}^m (z_i - \mu_B)^2 \\
\tilde{z}_i = {z_i - \mu_B \over \sqrt{\sigma_B^2 + \epsilon}}, \quad i = 1, 2, ..., m \\
z'_i = \gamma \tilde{z}_i + \beta, \quad i = 1, 2, ..., m
\end{array}
$$  

$\gamma$ and $\beta$ are hyper parameters.  
Batch normalization has two positive effects:  
 - The initial value of the parameter is less sensitive.
 - By setting the learning rate large, the convergence speed can be improved.  
 
 
## Regularization
### Weight penalty  
$$J_{regularized}(\Theta;\mathbb{X}, \mathbb{Y}) = J(\Theta;\mathbb{X}, \mathbb{Y}) + \lambda R(\Theta)$$  
$J$ is an objective function, and $J_{regularized}$ is a regularized one. $R(\Theta)$ is a penalty for weight. $R$ is the original prior knowledge regardless of the dataset. $\Theta$, the parameter includes weights and bias. However, we donot regularize bias because bias does not need to be regulated in relation to only one node. If you regularize bias, then bias can be underfitting.  
 - **L2 Norm(weight decay)**  
 The penalty is squared of L2 norm.  
 $$J_{regularized}(\Theta;\mathbb{X}, \mathbb{Y}) = J(\Theta;\mathbb{X}, \mathbb{Y}) + \lambda \lVert \Theta \rVert_2^2$$  
 The gradient is  
 $$\nabla J_{regularized}(\Theta;\mathbb{X}, \mathbb{Y}) = \nabla J(\Theta;\mathbb{X}, \mathbb{Y}) + 2 \lambda \Theta
 $$  
 So, update parameter using gradient.  
 $$
 \begin{alignat}{3}
 \Theta & = \Theta - \rho\nabla J_{regularized}(\Theta;\mathbb{X}, \mathbb{Y}) \\
 & = \Theta - \rho(\nabla J(\Theta;\mathbb{X}, \mathbb{Y}) + 2\lambda \Theta) \\
 & = (1 - 2 \rho \lambda)\Theta - \rho \nabla J(\Theta;\mathbb{X}, \mathbb{Y})
 \end{alignat}
 $$  
 $\rho$ is a learning rate, and $\lambda$ is a coefficient of L2 norm. Learning rate is usually set a number much smaller than 1 so $2\rho\lambda$ is a number samller than 1. Thus, it reduces the parameter by $2\rho\lambda$ and then adds $-\rho\nabla J$.  
 
 
 - **L1 Norm**  
 It is the sum of the absolute values of the parameter values.  
 $$J_{regularized}(\Theta;\mathbb{X}, \mathbb{Y}) = J(\Theta;\mathbb{X}, \mathbb{Y}) + \lambda \left\vert \Theta \right\vert_1$$  
 The gradient is  
 $$\nabla J_{regularized}(\Theta;\mathbb{X}, \mathbb{Y}) = \nabla J(\Theta;\mathbb{X}, \mathbb{Y}) + \lambda\mathbf{sign}(\Theta)
 $$  
 $\mathbf{sign}(\Theta)$ is a vector of sign for each parameter. If an element of the vector is positive, then it has 1. Otherwise, -1.  
 So, update parameter using gradient.  
 $$
 \begin{alignat}{3}
 \Theta & = \Theta - \rho\nabla J_{regularized}(\Theta;\mathbb{X}, \mathbb{Y}) \\
 & = \Theta - \rho(\nabla J(\Theta;\mathbb{X}, \mathbb{Y}) + \lambda\mathbf{sign}(\Theta)) \\
 & = \Theta - \rho \nabla J(\Theta;\mathbb{X}, \mathbb{Y}) - \rho \lambda\mathbf{sign}(\Theta)
 \end{alignat}
 $$  
 $\Theta$ moves as much as $-\rho\nabla J$ and additionaly moves by $\mathbf{sign}(\Theta)$. If you use L1 norm, saprsity can occur, which is a phenomenon that large number of parameters become zero.  

### Early stopping
The longer model learn, the more optimal model reaches. But beyond some point, the model start to memorize the training data, and the performance of the validation set is getting worse. In other words, the generalization ability begins to fall. Therefore, the strategy of stopping learning at the point of generalization ability is the most effective.  

### Data augmentation
The most obvious way to prevent overfitting is to use a sufficiently large training set. One practical way to increase the amount of data at a lower cost is to artificially modify the data you have. Such as affine transformation. Note that the degree of transformation is so large that it can be changed to another class.  

### Dropout
Dropout is an operation of randomly selecting and removing some nodes(with edges) of an input layer and a hidden layer at a predetermined ratio. Dropout can be seen as a kind of ensemble. However it is very difficult to set up appropriate hyperparameters for each dropped out neural network and train them. We can solve this problem by weight sharing.