## Smart Optimization Schemes

- **Why do we need to do SGD and/or BGD as opposed to full sample gradient calculations?**

    - When **data set is too large**, including the full sample in gradient computation (which usually is the same of gradient at each sample point) is time consuming.
    - By introducing randomness or uncertainties, SGD and BGD **avoid the optimization stuck in local minima**.

- **What are the methods that extend SGD?**
     
    - **SGD with Nesterov's Accelerated Momentum**. Maintain a running average of all past steps, then **update with the average instead of the current descent**. 
        - The intuition is in directions in which the convergence is smooth, the average will have a large value; and in directions in which the estimate swings, the positive and negative swings will cancel out in the average (see comment below about the condition number of the Hessian matrix). Thus this will **effectively mitigate the zigzag behavior around the target minimum**. 
        - *The difference between Nesterov's method vs. traditional memuntum is that is changes the order of operations*: Nesterov takes the derivative after taking a further step in the past direction, then correction. Nesterov shows that this accelarate the convergence: the intuition being that, in general the momentum vector will be pointing in the right direction (i.e., toward the optimum), so it will be slightly more accurate to use the gradient measured a bit farther in that direction rather than using the gradient at the original position. 

    \begin{align}
    \Delta \textbf{W}^{(k)} &= \mu \Delta\textbf{ W}^{(k-1)} - \eta \nabla_W L(\textbf{W}^{(k-1)}+\mu\Delta W^{(k-1)})\\
    \textbf{W}^{(k)} &= \textbf{W}^{(k-1)} + \Delta\textbf{W}^{(k)}\\
    \end{align}
     
    - **RMSProp**: **scale the learning rate by an estimate of the mean-squared derivative**: 
    
    \begin{align}
    E[\partial^2_w D]_k &= \gamma E[\partial^2_w D]_{k-1} + (1-\gamma)(\partial^2_w D)_k\\
    w_{k+1}&=w_k - \frac{\eta}{\sqrt{E[\partial^2_w D]_k+\epsilon}}\partial_w D
    \end{align}
     
     See typical parameters in the implementation below.
     
     A similar optimization scheme to RMSProp is the **AdaGrad**, where the estimation of Hessian is simply the running *sum* of past second-order derivative. As such, the learning rate often gets decays too fast, usually before the global minimum is reached, and one can see that the loss function stops decreasing pretty soon (see visualization below). Thus AdaGrad is only suitable when the loss function is highly quadratic, in particular not the case for deep neural network.
     
    - **Adam**, which stands for adaptive moment estimation, **combines the ideas of Momentum optimization and RMSProp**. 
        - See typical parameters in the implementation below, which are also the ones recommended by the paper. 
        - Usually **Adam is the go-to method**, but it is still possible that for the problem at hand, other learning rate methods are superior.
     
 - **Why do we need to do these?** The problems these methods are trying to solve are (a) avoid stuck in a narrow local minima; (b) avoids zigzag steps due to high condition number of parameter covariance or Hessian matrix (i.e. we can easily take a too big a step in one direction and too small to others), while avoiding having to normalize for the Hessian matrix. 
 
**Note**: the methods are **typically only concern with enhancing the first-order derivative methods**. This is because in deep neural network, there are typically tens of thousands of parameters and it is expensive or slow to compute the Hessian, if not difficult to fit in memory