<a name="cell-TOC-opt"></a>

### 8. [Optimizers: Newton's Method from various pragmatic perspectives](#cell-opt-fx-)

1. [Altering Newton's Methods](#cell-opt-fx-3)
2. [Newton-Like Methods](#cell-opt-fx-newtonlike)
3. [Ensuring Monotonic Iteration](#cell-opt-fx-newton-like-ascent)
4. [Quasi-Newton Methods](#cell-opt-fx-quasi-newton)
5. [Gradient Methods](#cell-opt-fx-common-optimization-algorithms)
6. [Hessian Diagonal Alternatives](#cell-opt-fx-common-optimization-algorithms2)
7. [Nelder-Mead](#cell-opt-fx-nelder-mead)


<!--
1. ~Combinatorial (Discrete) Optimzation~
    - ~Simulated Annealing~
2. ~Constrained optimization~
    - ~Expectation-Maximization~
    - ~Interior and Exterior Point Algorithms~
-->


<a name="cell-opt-fx-3"></a>

## 8.1 Altering Newton's Methods ([Return to TOC](#cell-TOC-opt)) 

---

***Iterative methods*** which optimize $g(x^{(t)})$ are ***fixed point iteration methods*** and take the form $x^{(t+1)} = x^{(t)} + \alpha h(x^{(t)})$ for $\alpha \neq 0$. E.g., for ***positive definite Hessian*** $H_{g(x)}$ examples include

- ***Newton's method*** $x^{(t+1)} = x^{(t)} - \alpha H_{g(x)}^{-1}(x^{(t)})\nabla_x g(x^{(t)})$ with ***learning rate*** $\alpha$
  > which converges if $||\alpha H_{g(x)}^{-1}(x^{(t)})H_{g(x)}(x^*)||=\lambda_\max < 1$ which is true for constant $H_{g(x)}$ and $\alpha<1$ since then $\left|\left| \alpha H_{g(x)}^{-1}(x^{(t)}) H_{g(x)}(x^*) \right|\right| = ||\alpha I||<1$

- ***Gradient Descent*** $x^{(t+1)} = x^{(t)} - \alpha I \nabla_x g(x^{(t)})$
  > which (following from the above) converges if $||\alpha H_{g(x)}(x^*)||=\lambda_\max < 1$; and, the smaller $\left|\left| I - \alpha H_{g(x)}(x^*) \right|\right| < 1$ the faster the convergence will be

- or most generally $x^{(t+1)} = x^{(t)} - \alpha [M^{(t)}]^{-1}\nabla_x g(x^{(t)})$ 
  > which converges if $||\alpha [M^{(t)}]^{-1}H_{g(x)}(x^*)||=\lambda_\max < 1$; and, the smaller $\left|\left| I - \alpha [M^{(t)}]^{-1}H_{g(x)}(x^*) \right|\right| < 1$ the faster the convergence will be

The convergence statements made here were shown and considered more closely in `STA410_W24_Week7_Extra_NewtonVariantsConvergence.ipynb`.


<a name="cell-opt-fx-newton-like-ascent"></a>

## 8.3 Ensuring Monotonic Iteration ([Return to TOC](#cell-TOC-opt)) 

---

***Newton-like updates*** $x^{(t+1)} = x^{(t)} - \alpha M^{-1}\nabla_x g(x^{(t)})$ move towards $x^*$ such that $\nabla_x g(x^*) = 0$, but only in a  monotonically decreasing (increasing) manner if $M$ is ***positive*** (***negative***) ***definite*** and some sufficiently small ***learning rate*** $\alpha \neq 0$.

Since 
- $M+cI$ will (have all positive ***eigenvalues*** and) be positive definite if $c>0$ is large enough   
- $M+cI$ will (have all negative ***eigenvalues*** and) be negative definite if $c<0$ is small enough 

  > and these forms of definiteness could also be achieved with diagonal $D_c$ with $[D_c]_{ii} = c_i$ which minimizes alteration of the original $M$ through the so-called [***modified Cholesky decomposition***](https://nhigham.com/2020/12/22/what-is-a-modified-cholesky-factorization/)

***modified Newton methods*** guarantee monotonic convergence by instead updating

$$x^{(t+1)} = x^{(t)} - \alpha \underset{\text{or } M+D_c}{(M+cI)}^{-1}\nabla_x g(x^{(t)})$$

This guarantees monotonic $g(x^{(t+1)})<g(x^{(t)})$ since $(M+cI)^{-1}\nabla_x g(x^{(t)})$ will have the same signs as $\nabla_x g(x^{(t)})$ and so 
$$g\left(x^{(t)} - \alpha {(M+cI)}^{-1}\nabla_x g(x^{(t)})\right)$$

must be less than $g(x^{(t)})$ for some small $\alpha>0$.

> Since ***gradient descent*** follows the direction of steepest descent of $g$ at $x^{(t)}$ (by moving in the negative direction of the gradient) for some small step size factor $\alpha^{(t)} > 0$, since $I$ is positive definite, we will have that $g(x^{(t+1)}) < g(x^{(t)})$.

After ensuring the necessary ***positive (negative)definiteness***, the ***step size factor*** $\alpha$ can be found by the ***line search method*** of ***backtracking***. If monotonic convergence is violated for a specific $\alpha^{(t)}$ then is can be made smaller.

<a name="cell-opt-fx-quasi-newton"></a>

## 8.4 Quasi-Newton Methods ([Return to TOC](#cell-TOC-opt)) 

---

As noted previously, the naive generalization of the ***secant method*** to scalar valued multivariate functions as a discrete approximation to ***Newton's method*** does not offer any computational incentives over simply computing the Hessian itself.  However, for 

- sequential iterations $\quad x^{(t)}$ and $x^{(t-1)}$
- sequential gradients  $\quad \nabla_{x}g(x^{(t)})$ and $\nabla_{x}g(x^{(t-1)})$

an $M^{(t)}$ satisfying the so-called ***secant condition***

$$ \underbrace{\nabla_{x}g(x^{(t)}) - \nabla_{x}g(x^{(t-1)})}_{\Delta^{(t)}_{\nabla_{x}g}} = M^{(t)}\underbrace{(x^{(t)} - x^{(t-1)})}_{\Delta_x^{(t)}}$$

provides the discrete ***secant*** approximation of the Hessian where

$$x^{(t+1)} = x^{(t)} - \left[M^{(t)}\right]^{-1}\nabla_x g(x^{(t)}) \quad \text{ replaces } \quad x^{(t+1)} = x^{(t)} - \left[H_{g(x)}(x^{(t)})\right]^{-1}\nabla_x g(x^{(t)})$$

and where $M^{(t)}$ which satisfies the ***secant condition*** can be derived iteratively on the basis of $M^{(t-1)}$ which itself already satisfies the ***secant condition***. This is shown in the following material, but understanding that there are available calculations here is more important than understanding the specific computations themsevles.

> This ***secant condition*** alone does not provide an efficient calculation of $M^{(t)}$, but the (computationally inexpensive) ***rank-one update*** [derived here](https://personal.math.ubc.ca/~loew/m604/web-ho/sr1.pdf)
>
> $$M^{(t)} = M^{(t-1)} + \underbrace{\frac{v^{(t)}[v^{(t)}]^T}{[v^{(t)}]^T\Delta_x^{(t)}c}}_{\text{rank-one update}} \quad \text{ where } \quad v^{(t)} = \left(\Delta^{(t)}_{\nabla_{x}g} - M^{(t-1)}\Delta_x^{(t)}\right)$$
>
> results in $M^{(t)}$ which satisfies the ***secant condition***, subject to the following caveats.
> - If the denomenator $[v^{(t)}]^T\Delta_x^{(t)} \approx 0$, the update might need to be skipped by setting $M^{(t+1)} = M^{(t)}$.
> - If the denomenator $[v^{(t)}]^T\Delta_x^{(t)}<0$ $(>0)$ and $M^{(t-1)}$ is ***negative*** $($***positive***$)$ ***definite***, then $M^{(t)}$ will be as well.
>   - Thus, this update only guarantees ***hereditary positive $($negative$)$ definiteness*** under the above conditions; however,
>   - scaling $[v^{(t)}]^T\Delta_x^{(t)}$ in the denominator by some large factor $c$ so the update contributes less to $M^{(t)}$ can maintain the definiteness state.

> A ***rank-two version*** of the above which both satisfies the ***secant condition*** and confers ***hereditary definiteness*** is the so-called ***BFGS*** update (named after its authors). The (*rank-two*) ***BFGS*** update is just the (*rank-two*) ***Broyden class update*** 
> 
> $$\begin{align*} M^{(t)} = {} & M^{(t-1)} - \frac{M^{(t-1)}\Delta_x^{(t)} [M^{(t-1)}\Delta_x^{(t)}]^T}{[\Delta_x^{(t)}]^TM^{(t-1)}\Delta_x^{(t)}} + \frac{\Delta^{(t)}_{\nabla_{x}g}[\Delta^{(t)}_{\nabla_{x}g}]^T}{[\Delta_x^{(t)}]^T\Delta^{(t)}_{\nabla_{x}g}} + \delta^{(t)}\left([\Delta_x^{(t)}]^TM^{(t-1)}\Delta_x^{(t)} \right)[d^{(t)}]^Td^{(t)}\\
{} & \text{where } d^{(t)} = \frac{\Delta^{(t)}_{\nabla_{x}g}}{[\Delta_x^{(t)}]^T\Delta^{(t)}_{\nabla_{x}g}} - \frac{M^{(t-1)}\Delta_x^{(t)}}{[\Delta_x^{(t)}]^TM^{(t-1)}\Delta_x^{(t)}}
\end{align*}$$
>
> with $\delta^{(t)}=0$. 

> A few points to note regarding ***quasi-Newton methods*** are:
> - many authors find the ***rank-one update*** to have superior performance to ***Broyden class updates***, including ***BFGS***
- the above ***BFGS*** update is numerically unstable, and is better approached through a ***Cholesky decomposition*** 
>   - ***quasi-Newton methods*** are very sensitive to the scale of the $x_i$ comprising $x$, with performance tending to be better for similarly scaled $x_i$
>   - ***quasi-Newton methods*** are very sensitive to the initial choice $M^{(0)}$ though for similarly scaled $x_i$ starting with $I$ (for minimization) or $-I$ (for maximization) is usually sufficient; however, in maximum likelihood estimation contexts starting with $-I(\theta^{(0)})$ is usually a better choice
- the ***observed information*** (i.e., the negative Hessian) provides a point estimate of the ***precision*** (i.e., inverse covariance) structure of $p(\hat \theta) \approx N(\theta, \Sigma^{-1} = -H_{l(\theta)}(\hat \theta))$, but quasi-Newton methods  (intentionally) do not provide close estimates of the Hessian; so, for statistical purposes, re-estimating the ***observed information*** upon convergence is an obligatory final step, e.g., with the ***central difference approximation***
>
>   $$ \widehat{[H_{l(\theta)}(\theta^{(t)})]}_{ij} = \frac{[\nabla_\theta l(\theta^{(t)} + h_{ij}e_j)]_i - [\nabla_\theta l(\theta^{(t)} - h_{ij}e_j)]_i}{2h_{ij}} $$
>   perhaps with $h_{ij} = h = \epsilon^{\frac{1}{3}}$ where $\epsilon$ is the available computer precision. 

In [1]:
import numpy as np
from scipy.optimize import fmin_l_bfgs_b
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.legacy import SGD, Adagrad, RMSprop

In [2]:
np.random.seed(3)
alpha,K = 0.01,10
d,q1,q2 = 3,2,3
# these are the data
x = np.random.normal(size=(d,1))
y = np.random.normal(size=(d,1))
# these are all the parameters
A1 = np.random.normal(size=(q1,d))
b1 = np.random.normal(size=(q1,1))
A2 = np.random.normal(size=(q2,q1))
b2 = np.random.normal(size=(q2,1))

# the parameters are passed into the function as a vector
# https://stackoverflow.com/questions/8672005/correct-usage-of-fmin-l-bfgs-b-for-fitting-model-parameters
def objective(parameters):
    # parameters get unpacked into their model form
    A1 = parameters[0:(q1*d)].reshape(q1,d)
    b1 = parameters[(q1*d):(q1*d+q1)].reshape(q1,1)
    A2 = parameters[(q1*d+q1):(q1*d+q1+q2*q1)].reshape(q2,q1)
    b2 = parameters[(q1*d+q1+q2*q1):].reshape(q2,1)
    # this is the model form
    x1 = A1@x+b1
    x1 = x1*(x1>0)
    x2 = A2@x1+b2
    # here's the residual from the prediction of this model
    epsilon = y-x2
    # and here's the loss function
    return epsilon.T.dot(epsilon)[0,0]**0.5

# https://stackoverflow.com/questions/8672005/correct-usage-of-fmin-l-bfgs-b-for-fitting-model-parameters
fmin_l_bfgs_b(func=objective, x0=np.ones(q1*d+q1+q2*q1+q2), approx_grad=True, m=4)
# showing this for a latent dimension of 4 (1 or 2 does not work)

(array([-0.96574895,  0.52026429,  0.89394714, -0.96574895,  0.52026429,
         0.89394714, -0.09902554, -0.09902554, -0.2844477 , -0.2844477 ,
         0.11431975,  0.11431975,  0.0948675 ,  0.0948675 , -1.86349271,
        -0.2773882 , -0.35475898]),
 4.94918505807411e-09,
 {'grad': array([0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.        , 0.        , 0.        , 0.04897387,
         0.7243671 , 0.68328701]),
  'task': 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH',
  'funcalls': 972,
  'nit': 24,
  'warnflag': 0})

In [3]:
# Indeed this converges for BFGS with rank 4
fit_parameters = fmin_l_bfgs_b(func=objective, x0=np.ones(q1*d+q1+q2*q1+q2), approx_grad=True, m=4)[0]
A1 = fit_parameters[0:(q1*d)].reshape(q1,d)
b1 = fit_parameters[(q1*d):(q1*d+q1)].reshape(q1,1)
A2 = fit_parameters[(q1*d+q1):(q1*d+q1+q2*q1)].reshape(q2,q1)
b2 = fit_parameters[(q1*d+q1+q2*q1):].reshape(q2,1)
x1 = A1@x+b1
x1 = x1*(x1>0)
x2 = A2@x1+b2
np.c_[y,x2]

array([[-1.8634927 , -1.86349271],
       [-0.2773882 , -0.2773882 ],
       [-0.35475898, -0.35475898]])

In [4]:
# These popular iterative optimization techniques are readily
# available in modern computational frameworks like tensorflow

np.random.seed(3)
alpha,K = 0.01,10
d,q1,q2 = 3,2,3
x = tf.constant(np.random.normal(size=(d,1)))
y = tf.constant(np.random.normal(size=(d,1)))
A1 = tf.Variable(np.random.normal(size=(q1,d)))
b1 = tf.Variable(np.random.normal(size=(q1,1)))
A2 = tf.Variable(np.random.normal(size=(q2,q1)))
b2 = tf.Variable(np.random.normal(size=(q2,1)))

@tf.function()
def objective():
    x1 = A1@x+b1
    x1 = x1*tf.cast(x1>0, tf.float64)
    x2 = A2@x1+b2
    epsilon = y-x2   
    return tf.tensordot(tf.transpose(epsilon), epsilon, axes=1)**(0.5)

# https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules
alpha_t, epsilon_t = 0.1,0.5
sgd = SGD(learning_rate=alpha_t, momentum=epsilon_t, nesterov=True)
adagrad = Adagrad(learning_rate=alpha_t)
rho_t = 0.5
rmsprop = RMSprop(learning_rate=alpha_t, rho=rho_t, momentum=epsilon_t)
rho_v_t, rho_s_t = epsilon_t, rho_t
adam = Adam(learning_rate=alpha_t, beta_1=rho_v_t, beta_2=rho_s_t)

steps = 10
for t in range(steps):
  sgd.minimize(objective, var_list=[A1, b1, A2, b2])
  #adagrad.minimize(objective, var_list=[A1, b1, A2, b2])
  #rmsprop.minimize(objective, var_list=[A1, b1, A2, b2])
  #adam.minimize(objective, var_list=[A1, b1, A2, b2])
print(objective())

x1 = A1@x+b1
x1 = x1*tf.cast(x1>0, tf.float64)
x2 = A2@x1+b2
x2,y



tf.Tensor([[1.18176167]], shape=(1, 1), dtype=float64)


(<tf.Tensor: shape=(3, 1), dtype=float64, numpy=
 array([[-1.52092405],
        [ 0.37256329],
        [ 0.57085996]])>,
 <tf.Tensor: shape=(3, 1), dtype=float64, numpy=
 array([[-1.8634927 ],
        [-0.2773882 ],
        [-0.35475898]])>)

<a name="cell-opt-fx-common-optimization-algorithms"></a>

## 8.5 Gradient Methods ([Return to TOC](#cell-TOC-opt)) 

---

***Gradient methods*** dispense with focussing on the second order derivatives of the ***Hessian*** and focus on improving computational performance using only the first order derivatives of the ***gradient*** $\nabla_\theta g(\theta)$. This makes a lot of sense in the context of ***non-convex*** functions where the ***Hessian*** $H_{g(\theta)}$ cannot reliably support convergence towards a global optimum. And, since many modern optimization contexts involve ***non-convex*** functions, ***gradient methods*** have become ubiquitous features of the modern optimization toolkit. 

> Even for ***convex*** $g(\theta)$ the compelling argument for ***gradient methods*** over ***Newton's method*** is that they may be more computationally efficient overall.  If the computation of the ***Hessian*** is expensive relative to the computations of the ***gradient*** (as is generally the case), then skipping the ***Hessian*** computation means more update steps based on cheaper ***gradient*** computations can be made, and this may end up being more computationally efficient overall. 

The following ***gradient methods*** are frequently encountered in modern optimization contexts.

- ***Stochastic Gradient Descent*** $x^{(t+1)} = x^{(t)} - \alpha I \widehat{\nabla_x g(x^{(t)})}$
  - replaces the Hessian $H_{g(\theta)}$ with the identity matrix $I$
  - uses a step size factor $\alpha_t$ which may evolve according to a prescribed schedule
  - and makes steps using gradients estimated from small ***batches*** (e.g., $m=32$ observations) rather than all of the available data

    $$\frac{1}{m} \sum_{i=1}^m \nabla_\theta g_{x_i}(\theta_{t-1}) = \nabla_\theta \frac{1}{m} \sum_{i=1}^m g_{x_i}(\theta_{t-1}) \quad \text{ estimates } \quad E_x\left[ \nabla_\theta g_x(\theta_{t-1}) \right]$$
<!-- = \nabla_\theta g(\theta_{t-1}) -->

  Sequences of ***batches*** constructed from the full data comprise one ***epoch***, and a sequence of ***stochastic gradient descent*** steps are often constructed from multiple ***epochs***, i.e., many passes through many batches of data.

  > ***Stochastic gradient descent*** drastically reduces computation because roughly accurate estimates of the gradients can be easily computed without having to use all the data.
  >
  > More than that, however, estimating gradients from batches is empirically observed to outperform calculating gradients based on the full data set. 
  > - ***Stochastic gradient descent*** introduces noise into the iterative trajectory which increases the exploration potential of the $\theta_t$ sequence, making it less likely to coverge (***overfit***) on a suboptimal local minima; and, even within an attractive region of a local minima the $\theta_t$ sequence is ***regularized*** in the sense that it will never actually achieve the local minima value since the estimated gradient will be different for each ***batch*** of data. 

  

- ***Momentum*** replaces the gradient with a running average using prescribed ***learning rate*** $\alpha_t$ and ***historic decay*** weighting schedules $\epsilon_t$ as

  $$\begin{align*}
  v_t = {} & \epsilon_t v_{t-1} + \alpha_t \nabla_\theta \frac{1}{m} \sum_{i=1}^m g_{x_i}(\theta_{t-1})\\
  \theta_t = {} & \theta_{t-1} - v_t
  \end{align*}$$

  - ***Neterov Momentum*** is a slight variant based on the "look ahead" update $$\begin{align*}
  v_t = {} & \epsilon_t v_{t-1} + \alpha_t \nabla_\theta \frac{1}{m} \sum_{i=1}^m g_{x_i}(\theta_{t-1} + \epsilon_{t} v_{t-1})
  \end{align*}$$

    though this typically has equivalent performance to the original ***momentum*** specification.

  The idea of ***momentum*** is to use the ***gradient*** history information instead of the ***Hessian*** at each update step to predict the shape of the function being optimized.  The assumption underlying ***momentum*** is that previous ***gradients*** are good estimates of future ***gradients***. 

> ***Momentum*** takes the form
>
> $$v_t = \sum_{j=1}^{t} \underbrace{\left[\prod_{k=j}^{t-1} \epsilon_{k+1} \right]}_{1 \text{ if } k=t\;>\;t-1} \alpha_j \underbrace{\nabla_\theta \frac{1}{m} \sum_{i=1}^m g_{x_i}(\theta_{j-1})}_{\widehat{\nabla_\theta g(\theta_{j-1})}}$$
>
> for which there exists some vector $d_t$ that accounts for the difference between $v_t$ and $\epsilon_t v_{t-1}$ such that
>
> $$ v_t = \alpha_t \underbrace{d_t \odot \overbrace{\nabla_\theta \frac{1}{m} \sum_{i=1}^m g_{x_i}(\theta_{t-1})}^{\widehat{\nabla_\theta g(\theta_{t-1})}}}_{\text{element-wise multiplication}}$$
>
> The ***momentum*** update 
> 
> $$\theta_t = \theta_{t-1} - \alpha_t d_t \odot \widehat{\nabla_\theta g(\theta_{t-1})} = \underbrace{\theta_{t-1} - \alpha_t D_t^{-1} \widehat{\nabla_\theta g(\theta_{t-1})}}_{\text{expressed as diagonal matrix } D_t^{-1}}$$ 
>
> can thus be viewed as a diagonal alternative to ***Newton's method*** where the $D_t$ replaces the diagonal elements of the ***Hessian*** $H_{g(\theta)}$ with values based on the decay weighted history of ***gradients*** which force a trajectory that is a decay weighted average of the ***gradient*** history.

<a name="cell-opt-fx-common-optimization-algorithms2"></a>

## 8.6 Hessian Diagonal Alternatives ([Return to TOC](#cell-TOC-opt)) 

---

***Diagonal approximations*** $H_{g(\theta)} \approx I \circ H_{g(\theta)} = D_\gamma$ which ignore the off-diagonal elements of the Hessian attempt to approximate the second order partial derivatives as 

$$[D_{\gamma}]_{ii} = \frac{\partial^2 g (\theta)}{\partial\theta_i\partial\theta_i} \approx \gamma_i \quad \text{ and } \quad [D_{\gamma}]_{ij} = \frac{\partial^2 g( \theta)}{\partial\theta_i\partial\theta_j} \approx 0$$ 

and approximate ***Newton's method*** as 

$$ \theta^{(t+1)} = \theta^{(t)} - D_\gamma^{-1} \nabla_\theta g(\theta^{(t)}) $$

As the example of ***momentum*** above shows, other alternatives for the diagonal of the ***Hessian*** are possible.

- ***Adagrad*** enables coordinate-specific learning rates by weighting the elements of a gradient step inversely to the magnitude of the step sizes accumulated along the axis so far as

  $$\begin{align*}
  g_t = {} & \nabla_\theta \frac{1}{m} \sum_{i=1}^m g_{x_i}(\theta_{t-1})\\
  r_t = {} & r_{t-1} + g_t \odot g_t \quad \text{  element-wise product}\\
  \theta_t = {} & \theta_{t-1} - \frac{\alpha_t}{\delta_t+\sqrt{r_t}} \odot g_t \quad \text{  element-wise square root, division, and product}
  \end{align*}$$

  where the square root transforms the sum back to original units, and $\delta_t>0$ ensures division by $0$ is avoided. 
  
  > The effect of the method is to make step sizes smaller along fast-moving axes and larger along slow-moving axes.


- **RMSprop** replaces the accumulation in ***Adagrad*** with a decaying running average

  $$\begin{align*}
  r_t = {} & \rho r_{t-1} + (1-\rho) g_t \odot g_t
  \end{align*}$$

  which allows for the application of locally variying coordinate-specific learning rates.

  ***RMSprop*** also admits the subsequent incorporation of ***Momentum*** as $$\begin{align*}
  v_t = {} & \epsilon_t v_{t-1} + \frac{\alpha_t}{\delta_t+\sqrt{r_t}} \odot g_t\\
  \theta_t = {} & \theta_{t-1} - v_t
  \end{align*}$$

  > ***RMSprop*** provides both the coordinate-specific learning rates of  ***Adagrad*** as well as incorporating a ***momentum*** effect.

- **Adam** is a slight variant of ***RMSprop*** which directly incorporates ***momentum*** as another decaying running average

  $$\begin{align*}
  g_t = {} & \nabla_\theta \frac{1}{m} \sum_{i=1}^m g_{x_i}(\theta_{t-1})\\
  v_t = {} & \rho_v v_{t-1} + (1-\rho_v) g_t \quad\quad \times \frac{1}{1-\rho_v^t} \text{ to correct if } v_0 = 0  \\
  r_t = {} & \rho_s r_{t-1} + (1-\rho_s) g_t \odot g_t \quad \quad \times  \frac{1}{1-\rho_s^t} \text{ to correct if } r_0=0 \\
  \theta_t = {} & \theta_{t-1} - \frac{\alpha_t}{\delta_t+\sqrt{r_t}} \odot v_t
  \end{align*}$$

  > The performance of ***Adam*** is often equivalent to that of ***RMSprop*** with ***momentum***. 

> Emtiyaz Khan characterizes these common algorithms as varying degrees of approximation to "Bayesian Learning Rules" in this [presentation](https://slideslive.com/38923183/deep-learning-with-bayesian-principles), starting around slide 60.

<a name="cell-opt-fx-nelder-mead"></a>

## 8.7 Nelder-Mead ([Return to TOC](#cell-TOC-opt)) 

---

***Newton's method*** is not the only tool in the optimization toolbox.  

For example ***Gauss-Seidel*** was previously encountered and seen to be equivalent to ***coordinate descent***; and, ***non-linear Gauss-Seidel*** was seen to naturally generalize ***Gauss-Seidel*** to nonlinear functions.


So far most of the focus has been on understanding slopes and curvature of functions in order to iteratively "hill climb" towards optima. This has been predominantly done through the use of gradiants and hessians; however, the previously introduced ***bisection*** and ***ternary search*** methods were able to find the roots and optima of nonlinear functions in a gradient-free manner. Another such method is the ***Nelder-Mead algorithm***, which prescribes a gradient-free optimization approach by substituing a (slow yet) robust heuristic algorithm in place of potentially expensive and unstable derivative computations.  

- Gradient-based methods are great when they work since they specifically leverage information about the function in question that is helpful for solving the problem at hand; however, when they fail, the more robust gradient-free methods present a very attractive, albeit more computationally demanding alternative.
- Gradient-free methods are generally more robust than their gradient-based counterparts; thus, when the computation required for gradient-free methods is tractable, they present a more general solution that gradient-based methods

Here are some fun visualations illustrating the ***Nelder-Mead method***. The ***Nelder-Mead algorithm*** is relatively simple (and widely available, e.g., fully detailed in both James E. Gentle's as well as the Givens and Hoeting *Computational Statistics* textbooks); though, certainly it requires a careful and attentive implementation.   

| | |
|-|-|
|![](http://takashiida.floppy.jp/wp-content/uploads/2020/11/05NelderMead-1.gif)| ![](https://userpages.umbc.edu/~rostamia/2020-09-math625/images/nelder-mead.gif) |
| [Takashi Ida's Personal Website](http://takashiida.floppy.jp/en/education-2/gif-nelder-mead/) |  [Rouben Rostamian's (UMBC) Comp Math + C](https://userpages.umbc.edu/~rostamia/2020-09-math625/) |
| ![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Nelder-Mead_Himmelblau.gif/640px-Nelder-Mead_Himmelblau.gif)| ![](https://upload.wikimedia.org/wikipedia/commons/0/08/Nelder_Mead2vectorised.gif) |
| [Wikepedia's Nelder-Mead page](https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method) |
| ![](http://fab.cba.mit.edu/classes/864.14/students/Calisch_Sam/10/img/nelder-mead.gif) | ![](https://upload.wikimedia.org/wikipedia/commons/1/14/Direct_search_BROYDEN.gif) | 
| [Sam Calisch's homework submission for MIT MAS 864](http://fab.cba.mit.edu/classes/864.14/students/Calisch_Sam/10/img/nelder-mead.gif) | [Wikipedia's Pattern Search page (a different but similar method)](https://en.wikipedia.org/wiki/Pattern_search_(optimization))|

<!--

Quantile Regression

lasso

SGD
-->


<!--

Quantile Regression

lasso

SGD
-->
