A typical training procedure for a neural network is as follows:

-   Define the neural network that has some learnable parameters (or
    weights)
-   Iterate over a dataset of inputs
-   Process input through the network
-   Compute the loss (how far is the output from being correct)
-   Propagate gradients back into the network's parameters
-   Update the weights of the network, typically using a simple update
    rule: `weight = weight - learning_rate * gradient`

## Define the network

Let's define this network:


In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super().__init__()
        # 1 input image channel, 6 output channels, 5x5 sqare convolution kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)  # 5*5 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, input):
        # Convolution layer C1: 1 input image channel, 6 output channels,
        # 5x5 square convolution, it uses ReLU activation function, and
        # outputs a Tensor with size (N, 6, 28, 28), where N is the size of the
        # batch
        c1 = F.relu(self.conv1(input))

        # Subsampling layer S2: 2x2 grid, purely functional, this layer
        # does not have any parameter, and outputs a (N, 6, 14, 14) Tensor
        s2 = F.max_pool2d(c1, (2, 2))

        # Convolution layer C3: 6 input channels, 16 output channles,
        # 5x5 square convolution, it uses ReLU activation function, and
        # outputs a (N, 16, 10, 10) Tensor
        c3 = F.relu(self.conv2(s2))

        # Subsampling layer S4: 2x2 grid, purely functional, this layer does
        # not have any parameters, and outputs a (N, 16, 5, 5) Tensor
        s4 = F.max_pool2d(c3, (2, 2))

        # Flatten operation: purely functional, outputs a (N, 400) Tensor
        s4 = torch.flatten(s4, 1)

        # Fully connected layer F5: (N, 400) Tensor input, and outputs
        # a (N, 120) Tensor, it uses ReLU activation function
        f5 = F.relu(self.fc1(s4))

        # Fully connected layer F6: (N, 120) Tensor input, and outputs
        # a (N, 84) Tensor, it uses ReLU activation function
        f6 = F.relu(self.fc2(f5))

        # Gaussian layer output: (N, 84) Tensor input, and outputs a (N, 10)
        # Tensor
        output = self.fc3(f6)

        return output


net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


You just have to define the `forward` function, and the `backward` function (where gradients are computed) is automatically defined for you using `autograd`. You can use any of the Tensor operations in the `forward` function.

The learnable parameters of a model are returned by `net.parameters()`


In [2]:
params = list(net.parameters())
print(len(params))
for name, param in net.named_parameters():
    print(f"{name}: {param.size()}")

10
conv1.weight: torch.Size([6, 1, 5, 5])
conv1.bias: torch.Size([6])
conv2.weight: torch.Size([16, 6, 5, 5])
conv2.bias: torch.Size([16])
fc1.weight: torch.Size([120, 400])
fc1.bias: torch.Size([120])
fc2.weight: torch.Size([84, 120])
fc2.bias: torch.Size([84])
fc3.weight: torch.Size([10, 84])
fc3.bias: torch.Size([10])


Let's try a random 32x32 input. Note: expect4d input size of this net (LeNet) is 32x32. To use this net on the MNIST dataset, please resize the images from the dataset to 32x32.


In [3]:
input = torch.rand(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[ 0.0076, -0.0047, -0.0241, -0.1185, -0.1154,  0.1330,  0.0323,  0.1535,
          0.0278,  0.1174]], grad_fn=<AddmmBackward0>)


Zero the gradient buffers of all parameters and backprops with random gradients:


In [4]:
net.zero_grad()
out.backward(torch.randn(1, 10))

<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>

<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<p><code>torch.nn</code> only supports mini-batches. The entire <code>torch.nn</code>package only supports inputs that are a mini-batch of samples, and not a single sample.For example, <code>nn.Conv2d</code> will take in a 4D Tensor of<code>nSamples x nChannels x Height x Width</code>.If you have a single sample, just use <code>input.unsqueeze(0)</code> to adda fake batch dimension.</p>

</div>

Before proceeding further, let\'s recap all the classes you've seen so
far.

**Recap:**

: - `torch.Tensor` - A _multi-dimensional array_ with support for
autograd operations like `backward()`. Also _holds the gradient_
w.r.t. the tensor. - `nn.Module` - Neural network module. _Convenient way of
encapsulating parameters_, with helpers for moving them to GPU,
exporting, loading, etc. - `nn.Parameter` - A kind of Tensor, that is _automatically
registered as a parameter when assigned as an attribute to a_
`Module`. - `autograd.Function` - Implements _forward and backward
definitions of an autograd operation_. Every `Tensor` operation
creates at least a single `Function` node that connects to
functions that created a `Tensor` and _encodes its history_.

**At this point, we covered:**

: - Defining a neural network - Processing inputs and calling backward

**Still Left:**

: - Computing the loss - Updating the weights of the network

# Loss Function

A loss function takes the (output, target) pair of inputs, and computes
a value that estimates how far away the output is from the target.

There are several different [loss
functions](https://pytorch.org/docs/nn.html#loss-functions) under the nn
package . A simple loss is: `nn.MSELoss` which computes the mean-squared
error between the output and the target.

For example:


In [5]:
output = net(input)
target = torch.rand(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(0.3166, grad_fn=<MseLossBackward0>)


Now, if you follow `loss` in the backward direction, using its
`.grad_fn` attribute, you will see a graph of computations that looks
like this:

```{.sh}
input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
      -> flatten -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss
```

So, when we call `loss.backward()`, the whole graph is differentiated
w.r.t. the neural net parameters, and all Tensors in the graph that have
`requires_grad=True` will have their `.grad` Tensor accumulated with the
gradient.

For illustration, let us follow a few steps backward:


In [7]:
fn = loss.grad_fn  # MSELoss
while fn:
    print(fn)
    if not hasattr(fn, "next_functions"):
        break
    next_fns = [n[0] for n in fn.next_functions if n[0] is not None]
    if not next_fns:
        break
    fn = next_fns[0]


from torchviz import make_dot

make_dot(loss, params=dict(net.named_parameters())).render(
    "graph", format="png"
)

<MseLossBackward0 object at 0x1329523b0>
<AddmmBackward0 object at 0x132952020>
<AccumulateGrad object at 0x1329523b0>


ExecutableNotFound: failed to execute PosixPath('dot'), make sure the Graphviz executables are on your systems' PATH

# Backprop

To backpropagate the error all we have to do is to `loss.backward()`.
You need to clear the existing gradients though, else gradients will be
accumulated to existing gradients.

Now we shall call `loss.backward()`, and have a look at conv1\'s bias
gradients before and after the backward.


In [8]:
net.zero_grad()  # zeros the gradient buffers of all parameters

print("conv1.bias.grad before backward")
print(net.conv1.bias.grad)

loss.backward()

print("conv1.bias.grad after backward")
print(net.conv1.bias.grad)

conv1.bias.grad before backward
None
conv1.bias.grad after backward
tensor([6.5913e-03, 1.4019e-02, 5.4263e-03, 7.7456e-03, 1.3549e-05, 2.5218e-03])


Now, we have seen how to use loss functions.

**Read Later:**

> The neural network package contains various modules and loss functions
> that form the building blocks of deep neural networks. A full list
> with documentation is [here](https://pytorch.org/docs/nn).

**The only thing left to learn is:**

> -   Updating the weights of the network

# Update the weights

The simplest update rule used in practice is the Stochastic Gradient
Descent (SGD):

```{.python}
weight = weight - learning_rate * gradient
```

We can implement this using simple Python code:

```{.python}
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)
```

However, as you use neural networks, you want to use various different
update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. To enable
this, we built a small package: `torch.optim` that implements all these
methods. Using it is very simple:

```{.python}
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update
```
