In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
torch.manual_seed(0)
from torch import nn
import torch.nn.functional as F
import torchzero as tz
from visualbench import FunctionDescent, test_functions

# Quasi-Newton methods

### Introduction
Quasi-Newton methods use approximations to second order information, i.e. the hessian matrix. They do not require computing the hessian matrix via autograd, which is useful in the case when hessian is too expensive or not available. Quasi-newton methods are suitable for all kinds of objectives, including non-convex and even non-smooth. The performance on smooth objectives is however typically better, so for example in a neural network you may get better results by replacing ReLU with ELU.

There are three main classes of Quasi-Newton methods - full-matrix methods, limited-memory methods, and methods where hessian isn't approximated by a full matrix. 

The full-matrix methods store the full hessian approximation and thus require $N^2$ memory. Since in pytorch we can use GPU acceleration, full-matrix methods are fast to compute for problems under ~10,000 variables, although that number depends on how good your GPU is. Full-matrix methods usually have the fastest convergence, and they are faster to compute compared to limited-memory methods as long as the CPU/GPU can handle it.

Limited-memory methods do not store the full hessian and instead use a history of past parameter and gradient differences, so they are suitable for large scale optimization. 

Finally some methods maintain a diagonal hessian approximation or even a scalar (as in Barzilai–Borwein method), so they also do not suffer from $N^2$ memory requirement.

### Secant equation
Quasi-newton methods are usually based on the secant equation:
$$
Bs=y
$$
Here $B$ is hessian approximation, $s=x_t-x_{t-1}$ is difference between parameters, $y=\nabla f(x_t)-\nabla f(x_{t-1})$ is difference between gradients. The differences are usually between current and previous step, although it is possible to sample the differences from different points.

The secant equation is also defined for hessian inverse, allowing to maintain it directly.
$$
Hy=s
$$
Here $H$ is hessian inverse approximation.

A note that for some devious reason, when usually $H$ denotes the hessian, in quasi-newton literature $H$ universally denotes **hessian inverse** approximation and $B$ denotes **hessian** approximation.

Different quasi-newton methods differ in how they solve the secant equation. Usually it is solved in a way so that $B_{t+1}$ is as close as possible to $B_t$ in some norm.


### Quick recommendations:

There are a lot of quasi-newton methods in torchzero. However the following are very good baselines:

Problems under ~10,000 parameters:
```python
tz.Modular(
    model.parameters(),
    tz.m.RestartOnStuck(tz.m.BFGS()),
    tz.m.Backtracking(),
)
```
Problems under ~5,000 parameters
```python
tz.Modular(
    model.parameters(),
    tz.m.LevenbergMarquardt(
        tz.m.RestartOnStuck(tz.m.SR1(inverse=False))
    ),
)
```
Large scale optimization (history_size can be increased to 100 if affordable):
```python
tz.Modular(
    model.parameters(),
    tz.m.LBFGS(history_size=10),
    tz.m.Backtracking(),
)
```


### BFGS
