## Pytorch NN API

Pytorch already includes everything we need to train a linear model in less than 10 lines of code!

In [1]:
%matplotlib inline
import random
import torch
from torch.utils import data
from d2l import torch as d2l

In [2]:
import wandb
wandb.init(project='course') # specify the project of the current run

[34m[1mwandb[0m: Currently logged in as: [33mingambe[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.6 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [3]:
true_w = torch.tensor([2, -3.4, 5, 6])
true_b = 2.4
features, labels = d2l.synthetic_data(true_w, true_b, 2000)

In [4]:
def load_array(data_arrays, batch_size, is_train=True):
    dataset = data.TensorDataset(*data_arrays)
    return data.DataLoader(dataset, batch_size, shuffle=is_train)

batch_size = 32
data_iter = load_array((features, labels), batch_size)

We can now iterate over minibtaches

In [5]:
next(iter(data_iter))

[tensor([[-4.7159e-01, -1.0637e-01, -1.2871e+00, -7.2477e-02],
         [-3.2422e-01, -8.2392e-01, -8.8265e-01,  9.7382e-01],
         [ 4.0551e-01, -1.4486e+00,  1.9984e+00,  7.4032e-01],
         [ 3.3081e-01, -1.9210e-01, -2.4347e+00,  1.5945e+00],
         [-4.0866e-01,  7.7159e-01, -3.7675e-01,  1.0669e-01],
         [ 1.1853e+00, -2.9667e-01,  1.9322e-01,  5.6487e-01],
         [-2.4877e-01, -9.2276e-01,  6.9874e-01, -1.8382e-01],
         [-9.1143e-01, -2.4409e-01, -6.3650e-01,  1.6281e-01],
         [-1.4892e+00, -2.5719e-01, -1.9283e+00, -5.5350e-01],
         [ 9.6049e-01,  3.6614e-01, -4.2779e-01,  5.0372e-01],
         [ 6.6394e-01,  9.1312e-02,  4.8933e-01,  3.7107e-01],
         [-5.7254e-01, -7.1862e-01,  5.2669e-01, -3.6386e-01],
         [-1.6643e-01, -1.6302e+00,  3.8755e-01,  6.1777e-01],
         [ 9.3432e-01, -3.3258e-01,  8.3840e-01,  3.2955e-03],
         [ 6.9672e-01,  3.1405e-01, -1.0044e+00, -3.4184e-01],
         [-1.3616e+00,  2.6360e-01, -6.1563e-01, -3.873

For standard operations, we can **use a framework's predefined layers,**
which allow us to focus on the layers used to construct the model
rather than having to focus on the implementation.

The `Sequential` class defines a container
for several layers that will be chained together.
Given input data, a `Sequential` instance passes it through
the first layer, in turn passing the output
as the second layer's input and so forth.

The layer is said to be *fully-connected*
because each of its inputs is connected to each of its outputs
by means of a matrix-vector multiplication.

In [6]:
# `nn` is an abbreviation for neural networks
from torch import nn

net = nn.Sequential(nn.Linear(4, 1))

We need to initialize the model parameters. By default Pytorch initialize the weight using an uniform distribution considering the size of the layer.

You should **always** initialize your layer

<center><img src="images/weights init.jpeg" /></center>

Gradient descent doesn't move you far away from the initial starting point

The literature is full of different weight initialization techniques

You can write yours:

In [7]:
net[0].weight.data.normal_(0, 0.01) # net[0] is the first layer
net[0].bias.data.fill_(0)
net[0].weight.data

tensor([[ 0.0198,  0.0025, -0.0127, -0.0071]])

99.9999% of the time you will use one from the literature: [See Pytorch init doc](https://pytorch.org/docs/stable/nn.init.html)

I recommend **Xavier normal**. It usually works well.
If you have time/ressource, you can try different init and pick the best ;)

In [8]:
def _weights_init(m):
    if isinstance(m, nn.Linear):
        torch.nn.init.xavier_normal_(m.weight)
        m.bias.data.zero_()
        
net.apply(_weights_init)
net[0].weight.data

tensor([[0.0697, 0.9464, 0.7670, 0.0187]])

In [9]:
net[0].bias.data

tensor([0.])

Then we need to define the loss function we will use  
The `MSELoss` class computes the mean squared error, also known as squared $L_2$ norm  
By default, it returns the average loss over examples.

In [10]:
loss = nn.MSELoss()

Weights and Biases can kee an eye on your model, login the structure and the gradient

In [11]:
wandb.watch(net, log="all", criterion=loss, log_freq=1,  log_graph=(True)) #log frequency depend on your training

[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`


[<wandb.wandb_torch.TorchGraph at 0x7fb63f7cc3a0>]

Last piece of the puzzle, we need to define the optimizer  
When we (**instantiate an `SGD` instance,**) we will specify the parameters to optimize over (obtainable from our net via `net.parameters()`), with a dictionary of hyperparameters required by our optimization algorithm  
Minibatch stochastic gradient descent just requires that we set the value `lr`, which is set to 0.03 here.

In [12]:
optim = torch.optim.SGD(net.parameters(), lr=3e-2)

In [13]:
next(net.parameters())

Parameter containing:
tensor([[0.0697, 0.9464, 0.7670, 0.0187]], requires_grad=True)

Let's put everything together !

The training loop itself is strikingly similar to what we did when implementing everything from scratch.

For each minibatch, we go through the following ritual:

* Generate predictions by calling `net(X)` and calculate the loss `l` (the forward propagation).
* Calculate gradients by running the backpropagation.
* Update the model parameters by invoking our optimizer.

For good measure, we compute the loss after each epoch and print it to monitor progress.

In [14]:
num_epochs = 10
for epoch in range(num_epochs):
    for X, y in data_iter:
        l = loss(net(X), y)
        optim.zero_grad() # please don't forget!
        l.backward() # remember: You need to tell wrt to what the gradient is computed
        optim.step() # do a step in the gradient direction
    with torch.no_grad():
        l = loss(net(features), labels) 
        print(f'epoch {epoch + 1}, loss {l.item()}')
        wandb.log({'loss': l.item()}, step=epoch)

epoch 1, loss 0.029340332373976707
epoch 2, loss 0.00010852736158994958
epoch 3, loss 9.307710570283234e-05
epoch 4, loss 9.297434735344723e-05
epoch 5, loss 9.329492604592815e-05
epoch 6, loss 9.327488805865869e-05
epoch 7, loss 9.331026376457885e-05
epoch 8, loss 9.315063653048128e-05
epoch 9, loss 9.358949318993837e-05
epoch 10, loss 9.331631736131385e-05


# ⚠️ NEVER FORGET TO ZERO_GRAD THE OPTIMIZER ⚠️

By default the optimizer accumulate the gradient!

If you don't set it back to 0, it will keep previous gradient and sum them!

If your model doesn't converge check this first!

Now let's compare the true parameters and the learned one:

In [15]:
w = net[0].weight.data
print('error in estimating w:', true_w - w.reshape(true_w.shape))
b = net[0].bias.data
print('error in estimating b:', true_b - b)

error in estimating w: tensor([-8.2254e-05,  1.7762e-04, -2.9421e-04,  5.2738e-04])
error in estimating b: tensor([-0.0003])


We can save and load model to reuse them later

In [16]:
net.state_dict()

OrderedDict([('0.weight', tensor([[ 2.0001, -3.4002,  5.0003,  5.9995]])),
             ('0.bias', tensor([2.4003]))])

In [17]:
torch.save(net.state_dict(), 'my_model.pt')

This save model's parameters as a dictionnary, but **doesn't save the structure** of the neural network

In [18]:
new_model = nn.Sequential(nn.Linear(4, 1))
new_model.load_state_dict(torch.load('my_model.pt'))
new_model.state_dict()

OrderedDict([('0.weight', tensor([[ 2.0001, -3.4002,  5.0003,  5.9995]])),
             ('0.bias', tensor([2.4003]))])