In [0]:
#%matplotlib inline


Neural Networks
===============

Use the torch.nn package to build a neural network.

In the last lecture, I have already talked about `` autograd ``, `` nn ``package depends on `` autograd `` package to define the model and get derivative.
An ``nn.Module`` contains each layer and a forward (input) method, which returns `` output``.

E.g:

![](https://pytorch.org/tutorials/_images/mnist.png)

It is a simple feed-forward neural network that accepts an input, then passes it layer by layer, and finally outputs the result of the calculation.

The typical training process of neural network is as follows:

1. Define a neural network model containing some learnable parameters (or weights)
2. Iterate over the dataset
3. Process input through neural network
4. Calculate the loss (the difference between the output and the correct value)
5. Parameters of backpropagating the gradient back to the network
6. Update the network parameters, mainly using the following simple update principle:
``weight = weight - learning_rate * gradient``

Create a network:
------------------




In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 10 output channels, 5x5 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 10, 5)
        self.conv2 = nn.Conv2d(10, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 10, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(10, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


The forward function must be defined in the model. The backward function (used to calculate the gradient) is automatically created by ``autograd``. You can use any operation for Tensor in the forward function.

``net.parameters()`` returns a list and values of parameters (weights) that can be learned



In [4]:
params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1's .weight

10
torch.Size([10, 1, 5, 5])



Note: The expected input size of this network (LeNet) is 32 × 32. If you use the MNIST dataset to train this network, please resize the image to 32 × 32.


In [6]:
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[-0.0315,  0.0076,  0.0138,  0.0684,  0.0567, -0.0410,  0.0404,  0.0657,
         -0.0451, -0.0593]], grad_fn=<AddmmBackward>)


Clear the gradient buffer of all parameters to zero, and then perform the back propagation of the random gradient:



In [10]:
net.zero_grad()
out.backward(torch.randn(1, 10))


RuntimeError: ignored

## Note
 ``torch.nn`` only supports small batch input. The whole `` torch.nn``
Packages only support small batch samples, not individual samples.

For example, ``nn.Conv2d`` accepts a 4-dimensional tensor,
  
  ``Each dimension is numSamples * nChannels * Height * Width (number of samples * number of channels * height * width) ``.

If you have a single sample, just use `` input.unsqueeze (0) `` to add other dimensions 

Before continuing, let's review the classes used so far.

**review:**
  * `` torch.Tensor``: a used multi-dimensional array * that automatically calls `` backward() `` to support automatic gradient calculation,
      And save the *gradient* w.r.t about this vector.
  * `` nn.Module``: neural network module. Package parameters, move to GPU, run, export, load, etc.
  * `` nn.Parameter``: A variable, when it is assigned to a `` Module ``, it is *automatically registered as a parameter*.
  * `` autograd.Function ``: To achieve the forward and reverse definition of an automatic derivation operation, each variable operation creates at least one function node, and each `` Tensor `` operation creates and receives one ``Tensor`` and the ``Function`` node of the function that encodes its history.

**The key points are as follows:**
 

*    Create a network
*    Forward operation of input
*    Calculate loss then backward operation
*    Update network weights

  



Loss function
-------------
A loss function accepts a pair of (output, target) as input and calculates a value to estimate how much the network output differs from the target value.

***Translator's Note: output is the output of the network, and target is the actual value***

There are many different [loss functions] in the nn package (https://pytorch.org/docs/nn.html#loss-functions).
`` nn.MSELoss `` is a relatively simple loss function, which calculates the **mean square error** between the output and the target,
E.g:

In [13]:
output = net(input)
target = torch.randn(10)  
target = target.view(1, -1)  
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(0.9499, grad_fn=<MseLossBackward>)


Now, if you follow `` loss`` in the reverse process, use its
`` .grad_fn`` attribute, you will see the calculation diagram shown below.

::

     input-> conv2d-> relu-> maxpool2d-> conv2d-> relu-> maxpool2d
           -> view-> linear-> relu-> linear-> relu-> linear
           -> MSELoss
           -> loss

So, when we call `` loss.backward () ``, the entire calculation graph will be
Differentiate according to loss, and all tensors in the figure set to `` requires_grad = True ``
Will have a `` .grad `` tensor that accumulates with the gradient.

To illustrate, let us take a few steps back:



In [14]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<MseLossBackward object at 0x7fae1aa25c88>
<AddmmBackward object at 0x7fae1aa25cc0>
<AccumulateGrad object at 0x7fae1aa25c88>


Back propagation
--------
Call ``loss.backward()`` to get the error of back propagation.

However, you need to clear the existing gradient before calling, otherwise the gradient will be accumulated to the existing gradient.

Now, we will call ``loss.backward()`` and look at the gradient of the bias term of the conv1 layer before and after back propagation.



In [15]:
net.zero_grad()     

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([ 0.0064,  0.0044, -0.0034, -0.0062, -0.0086, -0.0061, -0.0030,  0.0063,
        -0.0017,  0.0067])


How to use the loss function

**Read later:**

   The `nn` package contains various modules and loss functions used to form the building blocks of deep neural networks. For complete documentation, please see [here] (https://pytorch.org/docs/nn).



Update weights
------------------
In practice, the simplest weight update rule is stochastic gradient descent (SGD):

      `` weight = weight-learning_rate * gradient ``

We can implement this rule using simple Python code:

```python

learning_rate = 0.01
for f in net.parameters ():
     f.data.sub_ (f.grad.data * learning_rate)
``` 
But when using a neural network to use various update rules, such as SGD, Nesterov-SGD, Adam, RMSPROP, etc., a package `` torch.optim `` is built in PyTorch to implement all these rules.
Using them is very simple:

In [0]:
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

**Note:** 
    
Observe how to use ``optimizer.zero_grad ()`` to manually set the gradient buffer to zero. This is because the gradient is accumulated as described in the Backprop section.
