In [1]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2

In [2]:
import torch
from torch import Tensor, nn, optim

  from .autonotebook import tqdm as notebook_tqdm


## Optimisers
Optimisers are <mark>classes responsible for updating parameters according to their effect on the loss.
The general update rule is:</mark>

$$\theta_{t+1} = \theta_t - \gamma\nabla_{\theta_t}\!\mathcal{L}$$

The gradient on parameter $\theta$, $\nabla_{\theta_t}\!\mathcal{L}$ can be computed by backpropagating the loss value to the parameter. <mark>The learning rate $\gamma$ controls how much the parameter is changed by the gradient</mark>; how large a step size the optimiser makes. The fact that the change is negative, means that the loss should be such that minimising it results in better performance.

The <mark>loss is often evaluated using a batch of data points, rather than all data, or just one point. The result is a trade-off between speed and precise evaluation of the loss.</mark> The stochastic nature also reduces the chance that the DNN overfits to the training data.

PyTorch implements a variety of optimisers (https://pytorch.org/docs/stable/optim.html). See https://ruder.io/optimizing-gradient-descent/index.html for a good overview.

We'll stick with the <mark>standard SGD for now. When instantiating an `Optimizer` the parameters to be optimised must be provided, along with the necessary hyper-parameters of the optimisation algorithm.</mark> We'll starts with a very simple set of layers:

In [3]:
model = nn.Sequential(nn.Linear(3,10), nn.ReLU(), nn.Linear(10,1), nn.Sigmoid())
model.state_dict()

OrderedDict([('0.weight',
              tensor([[ 0.2790, -0.5656, -0.0903],
                      [ 0.0068, -0.2715, -0.2693],
                      [-0.3594, -0.3630,  0.4677],
                      [ 0.2626,  0.3507,  0.4352],
                      [ 0.1235, -0.5480,  0.3981],
                      [ 0.3639, -0.2865, -0.2395],
                      [-0.1944,  0.1937, -0.4398],
                      [ 0.1796, -0.3070,  0.2573],
                      [ 0.2668,  0.3852, -0.5520],
                      [-0.5128, -0.3175,  0.5120]])),
             ('0.bias',
              tensor([-0.2366, -0.4597, -0.1735, -0.4391, -0.4639,  0.1803, -0.5023,  0.0026,
                       0.0280, -0.3805])),
             ('2.weight',
              tensor([[-0.1051, -0.2527, -0.2221,  0.1301, -0.2620, -0.1702, -0.1226, -0.2376,
                       -0.1948, -0.2048]])),
             ('2.bias', tensor([-0.2743]))])

<mark>To optimise the parameters of the model, we pass its `.paramaters()` generator to the optimiser constructor, which allows it to always be able to access the parameters.</mark>

In [4]:
opt = optim.SGD(params=model.parameters(), lr=1e-2)

We also need a loss function.

In [5]:
loss_fn = nn.BCELoss()

Now we pass some data through the network to get a prediction

In [9]:
inputs = torch.randn(20,3)
inputs[10:] += 0.25
targets = torch.zeros(20,1)
targets[10:] = 1

In [10]:
preds = model(inputs)
preds

tensor([[0.3746],
        [0.4166],
        [0.3722],
        [0.3707],
        [0.4085],
        [0.3947],
        [0.4036],
        [0.3702],
        [0.3399],
        [0.4305],
        [0.4044],
        [0.3465],
        [0.3991],
        [0.4298],
        [0.3400],
        [0.3747],
        [0.3438],
        [0.4319],
        [0.4158],
        [0.4281]], grad_fn=<SigmoidBackward0>)

Now we compute the loss value

In [11]:
loss = loss_fn(preds, targets)
loss

tensor(0.7172, grad_fn=<BinaryCrossEntropyBackward0>)

At this point we want to <mark>ensure that the parameters do not have any gradient value, e.g. left over from previous updates</mark>. In this case, <mark>we can see that the `.grad` attributes are `None`.</mark>

In [15]:
model[0].weight, model[0].weight.grad

(Parameter containing:
 tensor([[ 0.2790, -0.5656, -0.0903],
         [ 0.0068, -0.2715, -0.2693],
         [-0.3594, -0.3630,  0.4677],
         [ 0.2626,  0.3507,  0.4352],
         [ 0.1235, -0.5480,  0.3981],
         [ 0.3639, -0.2865, -0.2395],
         [-0.1944,  0.1937, -0.4398],
         [ 0.1796, -0.3070,  0.2573],
         [ 0.2668,  0.3852, -0.5520],
         [-0.5128, -0.3175,  0.5120]], requires_grad=True),
 None)

Just in case, though <mark>we will ensure that they are all zero or None.</mark>

In [16]:
opt.zero_grad()

In [17]:
model[0].weight.grad

Now we can backpropagate the gradient of the loss:

In [18]:
loss.backward()

Now when we check the gradients on the parameters, we'll see that they are non-zero

In [19]:
model[0].weight.grad

tensor([[ 0.0099,  0.0027,  0.0070],
        [-0.0032,  0.0056,  0.0026],
        [-0.0019, -0.0014,  0.0178],
        [-0.0060,  0.0013, -0.0129],
        [ 0.0092,  0.0040,  0.0179],
        [ 0.0316,  0.0210,  0.0195],
        [ 0.0021, -0.0052,  0.0123],
        [ 0.0039, -0.0026,  0.0119],
        [ 0.0223,  0.0103,  0.0124],
        [-0.0071, -0.0023,  0.0005]])

<mark>The values of the parameters haven't changed, yet. We need to perform an update step with the optimiser</mark>

In [20]:
opt.step()

In [21]:
model[0].weight

Parameter containing:
tensor([[ 0.2789, -0.5656, -0.0904],
        [ 0.0069, -0.2715, -0.2693],
        [-0.3594, -0.3629,  0.4675],
        [ 0.2626,  0.3507,  0.4353],
        [ 0.1234, -0.5480,  0.3979],
        [ 0.3636, -0.2867, -0.2397],
        [-0.1944,  0.1937, -0.4400],
        [ 0.1795, -0.3069,  0.2572],
        [ 0.2665,  0.3851, -0.5521],
        [-0.5127, -0.3174,  0.5120]], requires_grad=True)

<mark>The parameters have now updated slightly. They still have their gradients, though, which is why it is important that we always zero them before backpropagating the loss.</mark>

In [22]:
model[0].weight.grad

tensor([[ 0.0099,  0.0027,  0.0070],
        [-0.0032,  0.0056,  0.0026],
        [-0.0019, -0.0014,  0.0178],
        [-0.0060,  0.0013, -0.0129],
        [ 0.0092,  0.0040,  0.0179],
        [ 0.0316,  0.0210,  0.0195],
        [ 0.0021, -0.0052,  0.0123],
        [ 0.0039, -0.0026,  0.0119],
        [ 0.0223,  0.0103,  0.0124],
        [-0.0071, -0.0023,  0.0005]])