# Newton methods

Newton-Raphson algorithm (1669) is an algorithm for finding roots of a function. 

On each iteration the estimate can be computed by setting the dervivate (tangent line) to zero:

$$x_{k+1} = x_{k} - \frac{f(x_k)}{f'(x_k)}$$

But it's equally used in optimization problems. The necessary conditions for function optimum is that the derivative equals zero. So we apply the root-finding algorithm for the derviative and get the following expression:

$$x_{k+1} = x_{k} - \frac{f'(x_k)}{f''(x_k)}$$



### One-dimensional Newton for optimization
Let's decompose function $f(x)$ in its Taylor series (up to the second derviative)

$$f(x+t) \approx f(x) + f'(x) \cdot t + \frac{1}{2}f''(x) \cdot t^2$$

We approximated the function f(x) with a 2-degree polynomial.

Now let's find the optimum of this polynomial - compute its derviative (with respect to **t**) and set it to zero:

$$\frac{\partial f(x+t)}{\partial t} = f'(x) + f''(x) = 0 $$

The solution gives us a value for t that maximizes:

$$t = -\frac{f'(x)}{f''(x)} $$

So, the update rule $x_{k+1} := x_k + t$ becomes the following:

$$x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)} $$

<img src="newton_scheme.png" width = 700>

### Multidimensional Newton for optimization

When $f(\hat{x}) = f(x_1,x_2 ... x_n)$ is a multidimensional function:
- Derviative $f'(x)$ becomes gradient $\nabla f(x)$ - vector of first derviatives
- Second derivative $f''(x)$ becomes Hessisan $\nabla^2$ - matrix of partial derviatives

$$x_{k+1} = x_k + \nabla {f(x)} \cdot [\nabla^2 f(x)]^{-1}$$

# Quasi-Newton methods

#### Motivation

Quasi-newton methods are approximate Newton methods: 

1. They does not require exact computation of a Hessian matrix on each step. Instead they use B - approximation for matrix $\nabla^2$ which is updated iteratively using current step information.


2. Besides [Sherman-Morisson](https://en.wikipedia.org/wiki/Sherman%E2%80%93Morrison_formula) formulas allow efficient computation of B inverse => those methods are even more efficient.

**NOTE** [Hessians are symmetric](https://en.wikipedia.org/wiki/Symmetry_of_second_derivatives#Schwarz's_theorem) => B should be symmetric too



Let's repeat the 2nd order Taylor expansion from Newton method

$$f(x_k + \Delta x) \approx f(x_k) + \nabla f(x_k)^T \cdot \Delta x + \Delta x^{T} \cdot B \cdot \Delta x$$

The gradient of this approximation will give us

$$\frac{\partial f(x_k + \Delta x)}{\partial \Delta x} \approx \nabla f(x_k) + B \cdot \Delta x$$

Setting this gradient to zero we get Newton update step
$$\Delta x = -B^{-1} \nabla f(x_k)$$

How to update B without calculating it?

## Overview of popular implementations
- DFP (1959)
- BFGS (1970)
- BHHH (1974)
- L-BFGS (1989) = limited memory variant of BFGS
- SR1 (1991)

## DFP

Davidon–Fletcher–Powell (1959)

## BFGS

Broyden–Fletcher–Goldfarb–Shanno (1970)

Algorithm:
0. Set initial $B_0$

Loop until convergence


1. Compute the optimal direction using the equation from Newton method: $B \Delta x = - \nabla f(x)$


2. Choose $\alpha$ the optimal step size in that direction

    $\alpha = argmin(f(x_k + \alpha \Delta x))$
    
    This is usually done via some basic line search algorithm


3. Make a step

    $x_{k-1} = x_k + \alpha \Delta x_k$


4. Compute the change in gradient

    $y_{k} = \nabla f(x_{k+1}) - \nabla f(x_k)$


5. Update Hessian using $y$ and $s$

    $B_{k+1} = B_k + \frac{yy^T}{yTs} + \frac{Bss^TB^T}{s^TBs}$

#### Complexity
Does not require matrix inversion => it's complexity is $O(n^2)$ compared to $O(n^3)$ for Newton methods.



### Newton vs Gradient methods
Newton methods are <u>second order</u> mtehods - they require computation of second derivatives.
Gradient methods require only first derviatives.

## Comparison of Adam and L-BGFS

(from stackexchange)

Those are two best methods in two different classes of approaches: gradient descent and quasi-netwon methods.

An L-BFGS solver is a true quasi-Newton method in that it estimates the curvature of the parameter space via an approximation of the Hessian. So if your parameter space has plenty of long, nearly-flat valleys then L-BFGS would likely perform well. It has the downside of additional costs in performing a rank-two update to the (inverse) Hessian approximation at every step. While this is reasonably fast, it does begin to add up, particularly as the input space grows. This may account for the fact that ADAM outperforms L-BFGS for you as you get more data.

ADAM is a first order method that attempts to compensate for the fact that it doesn't estimate the curvature by adapting the step-size in every dimension. In some sense, this is similar to constructing a diagonal Hessian at every step, but they do it cleverly by simply using past gradients. In this way it is still a first order method, though it has the benefit of acting as though it is second order. The estimate is cruder than that of the L-BFGS in that it is only along each dimension and doesn't account for what would be the off-diagonals in the Hessian. If your Hessian is nearly singular then these off-diagonals may play an important role in the curvature and ADAM is likely to underperform relative the BFGS.

In [None]:
Scipy