# Neural Networks

neural netowrks can be constructed using the `torch.nn` package. the `nn.Module` contains layers and a method called `forward(input)` that passes it forward and returns the output. 

A simple neural network is trained as follows: 

- define the neural network with the learnable parameters/weights
- iterate over the dataset of inputs
- forward pass through the network
- compute the loss (how far it is from being correct with the loss, even loss function)
- propagate the gradients back into the network's parameters
- update the weights of the network, usikng a simple rule like `w = w - lr * grad`

# Define a simple network using this cnn style of networks:

![mnist classification cnn](figures/mnist.png)

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F

In [2]:
class Net(nn.Module):
    '''
    Note that this net is not exactly the same as the above image
    '''

    def __init__(self, *args, **kwargs):
        super(Net, self).__init__(*args, **kwargs)
        # first, take 1 input channel, 6 output channel, 5x5 square convolution
        # note that these convolution blocks are image size agnostic. we just have to keep the image size in mind when we're desigining this netowkr
        # for example, the mnist images are 32x32 in size. 
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5)
        # then use linear operations
        self.fc1 = nn.Linear(in_features=16*5*5, out_features=128) # note that the 5*5 is from the image dimension but I'm not sure about this yet
        self.fc2 = nn.Linear(in_features=128, out_features=64)
        self.fc3 = nn.Linear(in_features=64, out_features=10)

    def forward(self, input):
        # the first convolution layer C1, in inpout image, 6 output channnels
        # outputs tensor with size (N, 6, 28, 28), N is batch size, reduces image 32 - 5 (kernel size) +1= 28
        c1 = F.relu(input=self.conv1(input))
        # subsample this layer with pooling
        # this pools the information from the 28x28 layer with a square that is 2x2 and outputs the max of those values, which is 14x14
        s2 = F.max_pool2d(input=c1, kernel_size=(2, 2))
        # convolution layer agian, l3, 6 channels in and 16 out
        # out dimensions is 14 - 5 + 1 = 10
        c3 = F.relu(input=self.conv2(s2))
        # another pooling layer, reduces the 14x14 layer to a 7x7 layer
        s4 = F.max_pool2d(input=c3, kernel_size=(2, 2))
        # flatten this output to prep for the linear layers:not sure which flatten dimension to use
        s4 = torch.flatten(input=s4, start_dim=1) # out goes a 16*5*5 = 400 dimensional nerual network
        # Dimension indices:
        # dim 0 = batch (2)
        # dim 1 = channels (16)
        # dim 2 = height (5)
        # dim 3 = width (5)
        # now for the activation functions for the linear layers, this is fairly straightforward
        f5 = F.relu(input=self.fc1(s4)) # out goes a N, 128 dimensional output
        f6 = F.relu(input=self.fc2(f5)) # out N, 64
        output = self.fc3(f6)
        return output
    
net = Net()
print(net)


Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=64, bias=True)
  (fc3): Linear(in_features=64, out_features=10, bias=True)
)


note that while the `forward` function was just defined, the `backward` function where the gradients are computed is automatically defined with `autograd` whoooo. use any Tensor operation you want in the `forward` function

you can also get the number of learnable parameters in `net.parameters()`

In [3]:
net.parameters() #should return a generator object

<generator object Module.parameters at 0x7c04f97c9b60>

In [4]:
params = list(net.parameters())
len(params) # 10 layers?

10

In [5]:
# for instance, this is the randomly selected 5x5 convolution window for the first layer.
params[0][0]

tensor([[[-0.1875, -0.0438,  0.0765,  0.1787,  0.1355],
         [ 0.1600,  0.1718,  0.0235, -0.0605,  0.1365],
         [ 0.1336,  0.0359, -0.1687, -0.1266, -0.0010],
         [ 0.0709,  0.1641,  0.1973, -0.1499,  0.1020],
         [ 0.0903,  0.0761,  0.0278, -0.0678, -0.0064]]],
       grad_fn=<SelectBackward0>)

In [6]:
# this is the one for the 3rd layer, 5th block along the length, and 2nd 5x5 conv block

params[2][4][1]

tensor([[ 0.0542,  0.0231,  0.0772, -0.0564, -0.0419],
        [ 0.0076, -0.0521, -0.0377,  0.0711,  0.0509],
        [-0.0490, -0.0367, -0.0698, -0.0325,  0.0687],
        [ 0.0718, -0.0469, -0.0032, -0.0028, -0.0448],
        [ 0.0540, -0.0557, -0.0769, -0.0178, -0.0524]],
       grad_fn=<SelectBackward0>)

In [7]:
for layer, parameter in enumerate(params):
    print(f"layer {layer}")
    print(parameter[0].size()) 

layer 0
torch.Size([1, 5, 5])
layer 1
torch.Size([])
layer 2
torch.Size([6, 5, 5])
layer 3
torch.Size([])
layer 4
torch.Size([400])
layer 5
torch.Size([])
layer 6
torch.Size([128])
layer 7
torch.Size([])
layer 8
torch.Size([64])
layer 9
torch.Size([])


note that there are a few that do not have any paramters. these are, you guessed it, the functional parts. they are functions, and don't have paramters (usually)

In [8]:
# random input

input = torch.randn(size=(1, 1, 32, 32))
input

tensor([[[[-0.7512,  0.5462, -0.2981,  ..., -0.5835,  0.7102, -1.5084],
          [ 0.1149,  0.2066, -0.7144,  ...,  0.2295, -0.2283,  0.2047],
          [-0.1035,  0.1221, -0.0750,  ..., -0.3387,  0.5969,  0.3123],
          ...,
          [ 0.5910,  1.0748, -0.0282,  ..., -1.2341, -0.9134,  0.2557],
          [ 0.2921, -0.9647, -1.2756,  ..., -0.5231, -0.7410, -0.6116],
          [ 0.3731, -0.7414, -0.3270,  ..., -0.2612, -0.7963, -0.2805]]]])

In [9]:
out = net(input)
out

tensor([[-0.1125,  0.0413, -0.0474,  0.0844, -0.0224,  0.0643, -0.0182, -0.0483,
         -0.0563,  0.0161]], grad_fn=<AddmmBackward0>)

what pytorch will do is accumulate the gradients for all the backward passes. this means that for each batch, the new gradients will be added ot the existing ones, and you have to set them to zero if you want to do so before each new training iteration/batch. training will break if the gradients are not cleared. for each batch, you want to update the gradient accordingly. sometimes, you want to accumulate the gradients, say, to save memory and simulate a larger batch size for a smaller GPU

In [10]:
net.zero_grad()

In [11]:
out.backward(torch.randn(1, 10)) # with a random label

> Note that `torch.nn` ony suppoprts batches and **not single samples**. this means that the if you have a single sample, simply use `input.unsqueeze(0)`

now just to recap

- `torch.Tensor`: a multidimensional array with support for `autograd` and it's operations, holds gradients with respect to the tensor itself
- `nn.Module`, neural netowrk module, convenient way of encapsulating the parameters
    - includes helpers to move them to the GPU, io, etc
- `nn.Parameter` sort of a tensor that automatically registered as a parameter when assigned an attribute to a MOdule
- `autograd.Function` implements the forward/backward passes of a module. every `Tensor` creates at least a single `Function` note that connects to functions that created a tensor and encodes its history.

# Loss functions

a loss function compues a value that is supposed to illustrate the difference (loss) of the io values. a simple loss is Mean Square Error loss, which is `nn.MSELoss`, computing the mse 

In [12]:
torch.randn(10)

tensor([ 0.7547,  0.1452,  1.1708,  0.6067,  2.8473, -1.2971,  0.4390,  0.9339,
        -0.4208,  0.0997])

In [13]:
output = net(input)
target = torch.randn(10)
output, target

(tensor([[-0.1125,  0.0413, -0.0474,  0.0844, -0.0224,  0.0643, -0.0182, -0.0483,
          -0.0563,  0.0161]], grad_fn=<AddmmBackward0>),
 tensor([-0.1600,  0.9063,  0.4171,  0.5948, -1.0110, -0.2048,  1.0940,  0.1469,
         -0.7494, -0.7266]))

In [14]:
# defining the criterion
criterion = nn.MSELoss()

loss = criterion(output, target) # must be in the order of (prediction, target)
print(loss)

tensor(0.4584, grad_fn=<MseLossBackward0>)


  return F.mse_loss(input, target, reduction=self.reduction)


if you want to follow the gradients backwards

In [15]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<MseLossBackward0 object at 0x7c03b8696ec0>
<AddmmBackward0 object at 0x7c03b4865e10>
<AccumulateGrad object at 0x7c03b4865e10>


# backpropagation

backpropagation, especially for linear layers are very straightforward. 

In [16]:
net.zero_grad()     # this is the gradient zeroed out/buffer cleared for all parameters

print("conv1.bias.grad before the backprop")
net.conv1.bias.grad

conv1.bias.grad before the backprop


In [17]:
loss.backward()     # after the backprop

print("conv1.bias.grad before the backprop")
net.conv1.bias.grad

conv1.bias.grad before the backprop


tensor([ 0.0099, -0.0038, -0.0022,  0.0040, -0.0085,  0.0044])

# update the weights of the network

now we just have to update the weights

In [18]:
# this is a simple way to do this using code:
lr = 0.01
for p in net.parameters():
    p.data.sub_(p.grad.data * lr) # with the _ decorator at the end, it just is an inplace operation

in order to fully utililze the power of pytorch, you should really just use an abstracted optimizer for this

In [19]:
import torch.optim as optim

optimizer = optim.SGD(net.parameters(), lr=lr)

In [20]:
# then in the trainig loop:

optimizer.zero_grad() # zero gradients
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() # does the update itself