# PyTorch ```optim```

## <a name="overview"></a> Overview

Optimization is at the core of contemporary deep learning algorithms. PyTorch implements various optimization algorithms in the ```optim``` module. In this notebook we will have a high lever overview of it. The documentation of the module can be found at <a href="https://pytorch.org/docs/stable/optim.html">torch.optim</a>


## <a name="ekf"></a> PyTorch ```optim```

```torch.optim``` is a package implementing various optimization algorithms. Most commonly used methods are already supported. Furthermore, the interface is general enough, so that more sophisticated ones can be easily integrated in the future.

To use ```torch.optim``` you have to construct an optimizer object, that will hold the current state and will update the parameters based on the computed gradients. To construct an optimizer you have to give it an iterable containing the parameters (all should be ```Variable```s) to optimize. Then, you can specify optimizer-specific options such as the learning rate, weight decay, etc. Below is an example.

```python
model = SomeTorchModule()

# an optimizer should be initialized with the
# parameters to optimize. Typically thes are the model
# paramters
optimizer = optim.SGD(model.parameters(), lr=0.01)

# ... but it can also be something else
optimizer = optim.Adam([var1, var2], lr=0.0001)
```

All optimizers implement a ```step()``` method, that updates the parameters. It can be used in two ways:

- simply calling ```optimizer.step()```
- Using ```optimizer.step(closure)```

The script below shows how to use the ```step``` method for the first case

```python
for input, target in dataset:
    optimizer.zero_grad()
    output = model(input)
    loss = loss_fn(output, target)
    loss.backward()
    optimizer.step()
```

This is, as mentioned above, a simplified version supported by most optimizers. The function can be called once the gradients are computed using e.g. ```backward()```. Some optimization algorithms such as Conjugate Gradient and LBFGS need to reevaluate the function multiple times, so you have to pass in a closure that allows them to recompute your model. The closure should clear the gradients, compute the loss, and return it.

```python
for input, target in dataset:
    def closure():
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        return loss
    optimizer.step(closure)
```

Now that we have a basic understanding about how to use an optimizer in PyTorch let's see how we
can further optimize learning in PyTorch. The usual tricks are [1]:

- Momentum
- Dropout
- Weight initialization
- Learning rate decay

Some of them can be passed as arguments to the optimizer obejct. For example

```python
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
```

Whereas dropout can be implemented using ```Module nn.Dropout(dropout_prob)``` that implements the dropout operation.
In addition, learning rate decay can be performed using the various learning schedulers that PyTorch supports.
Here is an example:

The PyTorch ```torch.optim.lr_scheduler``` class can be used to decay the
learning rate over the epochs. 

## <a name="refs"></a> References

1. Seth Weidman, ```Deep Learning from Scratch: Building with Python from First Principles```, OReilly.