Sheet 3.1: Gradient descent by hand
===================================

**Author:** Michael Franke



This short notebook will optimize a parameter with gradient descent without using PyTorch&rsquo;s optimizer.
The purpose of this is to demonstrate how vanilla GD works under the hood.
We use the previous example of finding the MLE for a Gaussian mean.



Student: Jia Sheng (5371477)

## Packages



We will need the usual packages.



In [19]:
import torch
import warnings
warnings.filterwarnings('ignore')

In [20]:
torch.set_default_dtype(torch.float32)

## Training data



The training data are \`nObs\` samples from a standard normal.



In [21]:
nObs           = 10000
trueLocation   = 0 # mean of a normal
trueDist       = torch.distributions.Normal(loc=trueLocation, scale=1.0)
trainData      = trueDist.sample([nObs])
empirical_mean = torch.mean(trainData)

## Training by manual gradient descent



We will actually train two parameters on the same data in parallel.
\`location\` will be updated by hand; \`location2\` will be updated with PyTorch&rsquo;s \`SGD\` optimizer.
We will use the same learning rate for both.



In [22]:
location       = torch.tensor(1.0, requires_grad=True)
location2      = torch.tensor(1.0, requires_grad=True)
learningRate   = 0.00001
nTrainingSteps = 100
opt = torch.optim.SGD([location2], lr = learningRate)

The training loop here first updates by hand, then using the built-in\`SGD\`.
Every 5 rounds we output the current value of \`location\` and \`location2\`, as well as the difference between them.

But, oh no! What&rsquo;s this? There must be a bunch of mistakes in this code! See Exercise below.



In [23]:
print('\n%5s %15s %15s %15s' %
      ("step", "estimate", "estimate2", "difference") )

for i in range(nTrainingSteps):

    # manual computation
    prediction = torch.distributions.Normal(loc=location, scale=1.0)
    loss       = -torch.sum(prediction.log_prob(trainData))
    loss.backward()
    with torch.no_grad():
        # we must embedd this under 'torch.no_grad()' b/c we
        # do not want this update state to affect the gradients
        location  -= learningRate * location.grad
    location.grad = torch.tensor(0.0)

    # using PyTorch optimizer
    prediction2 = torch.distributions.Normal(loc=location2, scale=1.0)
    loss2       = -torch.sum(prediction2.log_prob(trainData))
    loss2.backward()
    opt.step()
    opt.zero_grad()

    # print output
    if (i+1) % 5 == 0:
        print('\n%5s %-2.14f %-2.14f %2.14f' %
              (i + 1, location.item(), location2.item(),
               location.item() - location2.item()) )


 step        estimate       estimate2      difference

    5 0.59388077259064 0.59388077259064 0.00000000000000

   10 0.35407140851021 0.35407140851021 0.00000000000000

   15 0.21246637403965 0.21246637403965 0.00000000000000

   20 0.12885002791882 0.12885002791882 0.00000000000000

   25 0.07947541028261 0.07947541028261 0.00000000000000

   30 0.05032019317150 0.05032019317150 0.00000000000000

   35 0.03310432657599 0.03310432657599 0.00000000000000

   40 0.02293853089213 0.02293853089213 0.00000000000000

   45 0.01693572849035 0.01693573035300 -0.00000000186265

   50 0.01339113619179 0.01339113712311 -0.00000000093132

   55 0.01129808928818 0.01129808928818 0.00000000000000

   60 0.01006216462702 0.01006216462702 0.00000000000000

   65 0.00933236535639 0.00933236535639 0.00000000000000

   70 0.00890142377466 0.00890142377466 0.00000000000000

   75 0.00864695757627 0.00864695757627 0.00000000000000

   80 0.00849669799209 0.00849669799209 0.00000000000000

   85 0.008407

> <strong><span style=&ldquo;color:#D83D2B;&rdquo;>Exercise 3.1.1: Understand vanilla gradient descent</span></strong>
>
> Find and correct all mistakes in this code block.
> When you are done, the parameters should show no difference at any update step, and they should both converge to the empirical mean.



### Answer to Exercise 3.1.1
see the code block above where 2 mistakes in code are corrected.