## Questionnaire

__1. What is the equation for a step of SGD, in math or code (as you prefer)?__


`p.data.add_(-lr, p.grad.data)`

params = params - lr * params.gradient


__2. What do we pass to `cnn_learner` to use a non-default optimizer?__

You pass the following:
`opt_func=partial(Optimizer, cbs=[sgd_cb])`

where `def sgd_cb(p, lr, **kwargs): p.data.add_(-lr, p.grad.data)`

__3. What are optimizer callbacks?__

This: `opt_func = partial(Optimizer, cbs=[sgd_cb])`

They help us run a desired optimiser (eg. sgd) using a few lines of code.

__4. What does `zero_grad` do in an optimizer?__

- Loops through params and sets gradients to 0
- Removes gradient history

__5. What does `step` do in an optimizer? How is it implemented in the general optimizer?__

It loops through the params and through all the callbacks. It uses the callbacks to update each param.

__6. Rewrite `sgd_cb` to use the `+=` operator, instead of `add_`.__

In [3]:
def sgd_cb(p, lr, **kwargs): p.data += -lr*p.grad.data

__7. What is "momentum"? Write out the equation.__

Exponentially weighted moving average. This equation is as follows:

In [None]:
weight_avg = beta * weight_avg + (1-beta) * p.grad.data
p.data = p.data - lr * weight_avg

__8. What's a physical analogy for momentum? How does it apply in our model training settings?__

Above

__9. What does a bigger value for momentum do to the gradients?__

The gradients will be closer toward the average gradient. It will take a while before the gradients make the trend move.

__10. What are the default values of momentum for 1cycle training?__

fit_one_cycle by default starts with a beta of 0.95, gradually adjusts it to 0.85, and then gradually moves it back to 0.95 at the end of training

__11. What is RMSProp? Write out the equation.__

It takes the moving average in a different way in order to reduce the noise.

In [None]:
w.square_avg = alpha * w.square_avg + (1-alpha) * (w.grad ** 2)
new_w = w - lr * w.grad / math.sqrt(w.square_avg + eps)

__12. What do the squared values of the gradients indicate?__

It indicates they've been squared?

__13. How does Adam differ from momentum and RMSProp?__

Momentum takes the exponentially weighted moving average.

RMSProp takes the moving average in a different way that involves taking the square and then the sqrt root.

Adam mixes the two ideas together. It uses the moving average and divides by the square root of the moving average to give an adaptive learning rate for each param.

__14. Write out the equation for Adam.__

In [None]:
beta1,beta2 = 0.9,0.999
w.avg = beta1 * w.avg + (1-beta1) * w.grad
unbias_avg = w.avg / (1 - (beta1**(i+1)))
w.sqr_avg = beta2 * w.sqr_avg + (1-beta2) * (w.grad ** 2)
new_w = w - lr * unbias_avg / sqrt(w.sqr_avg + eps)

__15. Calculate the values of `unbias_avg` and `w.avg` for a few batches of dummy values.__

In [None]:
# TODO: check if this is right

import torch
import torchvision.models as models

beta1,beta2 = 0.9,0.999
lr = 1e-3
eps = 1e-8
w_avg = 0.
w_sqr_avg = 0.

# Calculate dummy gradients
model = models.resnet34()
model(torch.randn(1, 3, 224, 224)).mean().backward() # bs x 3 channels x height x width

for i, params in enumerate(model.parameters()):
  print(params.grad.data.shape)
  w_avg = beta1 * w_avg + (1-beta1) * params.grad.data
  unbias_avg = w_avg / (1 - (beta1**(i+1)))
  w_sqr_avg = beta2 * w_sqr_avg + (1-beta2) * (params.grad.data ** 2)
  params.data = params.data - lr * unbias_avg / torch.sqrt(w_sqr_avg + eps)
  print(unbias_avg.shape)

__16. What's the impact of having a high `eps` in Adam?__

As `eps` becomes larger `w.grad / math.sqrt(w.square_avg + eps)` becomes smaller.

Therefore, `lr * w.grad / math.sqrt(w.square_avg + eps)` becomes smaller.

So, `new_w` stays closer to `w`.

Since, `new_w = w - lr * w.grad / math.sqrt(w.square_avg + eps)`



__17. Read through the optimizer notebook in fastai's repo, and execute it.__

__18. In what situations do dynamic learning rate methods like Adam change the behavior of weight decay?__

When momentum is added.

ie when `beta1` is non-zero in the following:

In [None]:
w.avg = beta1 * w.avg + (1-beta1) * w.grad
unbias_avg = w.avg / (1 - (beta1**(i+1)))
w.sqr_avg = beta2 * w.sqr_avg + (1-beta2) * (w.grad ** 2)
new_w = w - lr * unbias_avg / sqrt(w.sqr_avg + eps)

__19. What are the four steps of a training loop?__

```python
loss = loss_func(model(xb), yb)
loss.backward()
opt.step()
opt.zero_grad()
```

This is done for every `xb` and `yb` in the batch

__20. Why is using callbacks better than writing a new training loop for each tweak you want to add?__

Consistent and well defined

Modularity



__21. What aspects of the design of fastai's callback system make it as flexible as copying and pasting bits of code?__

The callback can read every piece of information availiale in the training loop. Fastai provides this functionality.

__22. How can you get the list of events available to you when writing a callback?__

Type `event.` and hit Tab

__23. Write the `ModelResetter` callback (without peeking).__

In [None]:
class ModelResetter(Callback):
    def begin_train(self): self.model.reset()
    def begin_validate(self): self.model.reset()

__24. How can you access the necessary attributes of the training loop inside a callback? When can you use or not use the shortcuts that go with them?__

You can create specialised functions in the callback itself. The names of these functions should match the print out from the events variable.

You can access attributes associated with the learner such as model and data. The full list in mentioned in the chapter.

__25. Write the `TerminateOnNaN` callback (without peeking, if possible).__

In [None]:
class TerminateOnNaN(Callback):
    run_before = Recorder
    def after_batch(self):
        if torch.isinf(self.loss) or torch.isnan(self.loss):
            raise CancelFitException

__26. How do you make sure your callback runs after or before another callback?__

You can use the `run_before` or `run_after` parameter