# `torch.optim`: All the common gradient-based optimizers

In [None]:
import numpy as np
import torch
import matplotlib.pyplot as plt
%matplotlib inline

## Define a function to minimize

This is called a _cost_ function, a _loss_ function, or an _objective_ function, among other names.

In [None]:
def f(x):
    return x**2

In [None]:
xvec = np.linspace(-2, 2, 100)
fvec = f(xvec)
plt.plot(xvec, fvec)
plt.xlabel('x')
plt.ylabel('f(x)');

## Set up a stochastic gradient descent (SGD) optimizer and run it

The input `x` need gradients, because the optimizer needs to compute the derivative:
$$
\nabla_x f(x) = \frac{\partial f}{\partial x}(x)
$$

The basic algorithm of a gradient optimizer is to repeat:
$$
x \gets x - \alpha \nabla_x f(x)
$$

The parameter $\alpha$ is called the _learning rate_.

In [None]:
x = torch.tensor([2.0], requires_grad=True)

In [None]:
opt = torch.optim.SGD([x], lr=0.1)

As the optimizer runs, observe the gradient steps:

In [None]:
x_history = [x.detach().numpy().copy()]
for i in range(10):
    print(f'##########')
    print(f'i = {i}')
    print(f'initial x = {x}')
    opt.zero_grad()
    z = f(x)
    print(f'z = {z}')
    z.backward()
    print(f'x.grad = {x.grad}')
    opt.step()
    print(f'updated x = {x}')
    x_history.append(x.detach().numpy().copy())

## Visualizing optimization history

Plot the history of `x` versus step number:

In [None]:
plt.plot(x_history, 'o')
plt.ylim(0, 2.1)
plt.xlabel('optimization step number')
plt.ylabel('x');

Plot the history of `x` values together with the cost function:

In [None]:
xvec = np.linspace(-2, 2, 100)
fvec = f(xvec)
plt.plot(xvec, fvec)
plt.plot(x_history, np.zeros(len(x_history)), 'o')
plt.xlabel('x')
plt.ylabel('f(x)');

## Tasks

Observe that the rate of change of `x` slows down as the optimization proceeds. Why does this happen?

To fix this problem, more advanced gradient methods use momentum, Nesterov acceleration, or second-order derivative information (e.g., the [L-BFGS method](https://en.wikipedia.org/wiki/Limited-memory_BFGS)).

1. Read about momentum and related ideas in [An overview of gradient descent optimization algorithms](https://arxiv.org/abs/1609.04747) by Sebastian Ruder

2. Use the Adam optimizer from PyTorch to optimize the function above. What differences do you observe?