<h1 style="color:#BF66F2 "> Residual Networks in PyTorch 1 </h1>
<div style="margin-top: -30px;">
<h4> 2 examples of ResNets based on the Street View House Numbers (SVHN) dataset. Focus on learning rate schedulers. </h4>
</div>
<div style="margin-top: -18px;">
<span style="display: inline-block;">
    <h3 style="color: lightblue; display: inline;">Keywords:</h3>
    resnet18 + torch.optim + torch.autograd + optim.step() + requires_grad
</span>
</div>

In [5]:
# autocompletion
#!pip install jedi
%config Completer.use_jedi = True

In [None]:
import torch
from torchvision.models import resnet18, ResNet18_Weights

<h3 style="color:#BF66F2"> Recap: ResNets</h3>
<div style="margin-top: -17px;">
ResNet-18 model is implemented as a class in the torchvision.models module of the PyTorch library, it is loaded with its default pre-trained weights. <br>
ResNet is trained on the ImageNet dataset, a large-scale image classification dataset: consists of 1.2 million images belonging to 1000 different classes. <br>

The model takes an input image of size 224x224 pixels and outputs a probability distribution over the 1000 classes. <br>
Then it uses this information to adjust its weights so that it can predict the correct label for a given input image. <br>
The process of adjusting the weights of the model based on the ground-truth labels is known as supervised learning.
</div>

In [None]:
# NN with 18 convolutional layers
model = resnet18(weights=ResNet18_Weights.DEFAULT) 

<h2 style="color:#BF66F2 "> <u> Example 1 </u> </h2>

In [35]:
data_tensor = torch.rand(1, 3, 64, 64)
# Create the ground-truth outputs (targets)
labels_tensor = torch.rand(1, 1000) 

In [36]:
print(f"type model = {type(model)}")

type model = <class 'torchvision.models.resnet.ResNet'>


In [37]:
""" Assign to each class a unique label in the range from 0 to 999.
The model has to predict the correct label for a given input image.
"""
print(type(data_tensor))
print(type(labels_tensor))

<class 'torch.Tensor'>
<class 'torch.Tensor'>


In [38]:
## Slicing
labels_tensor2 = labels_tensor[0][0]
labels_tensor2 = labels_tensor[:1]
print(len(labels_tensor))
print(len(labels_tensor2))

1
1


<h4 style="color:#BF66F2 ">  Step #1 </h4>
Run the input data through the model through each of its layers to make a prediction => Performing the Forward Pass.

In [39]:
# Perform forward pass
prediction = model(data_tensor) 

In [40]:
prediction

tensor([[-0.4706, -0.2791, -0.5038, -1.2571, -0.4725,  0.0677, -0.0241,  0.4223,
          0.5338, -0.8712, -1.0038, -0.5996, -0.0404, -0.5431, -1.0367, -0.7903,
         -0.6267,  0.0956, -0.3114, -0.5726, -1.6220, -0.6621, -1.4162,  0.2537,
         -0.7347, -1.1150, -1.0951, -1.1698, -0.7301, -0.4161, -0.5347, -0.7960,
         -0.3765, -0.4521, -0.5154, -0.3166,  0.4740, -0.5638, -0.7520, -0.0622,
         -0.6628, -1.0831, -1.1103, -0.3416, -0.7653, -0.5388, -0.6666, -0.6966,
         -1.6033, -1.1438, -0.3556,  0.3514, -0.5565, -0.6628, -0.3979, -1.1646,
         -0.6183, -1.3783, -0.3140, -0.6261,  0.5639, -0.1567, -0.6663, -0.1863,
         -0.7991, -0.2993, -0.6027, -0.2634, -1.0620, -0.9908, -1.3605, -0.1366,
         -1.3244, -0.4294, -1.0877, -1.0231, -0.2424, -0.7421,  0.0159,  0.1607,
         -0.7783, -1.5676, -0.0120, -0.7848, -0.9231, -0.1835, -0.0653,  0.1628,
         -0.0711, -0.7416, -1.0887, -1.2781, -1.7643, -0.2562,  0.3406, -1.9942,
         -0.4248,  0.0100, -

In [41]:
prediction[0][:5]

tensor([-0.4706, -0.2791, -0.5038, -1.2571, -0.4725], grad_fn=<SliceBackward0>)

<h4 style="color:#BF66F2 ">  Step #2 </h4>
<div style="margin-top: -20px;">
<div style="line-height:1.3">

- Calculate the error (loss = difference from model’s predictions and the corresponding labels. <br>
- Backpropagate the error through the network calling the '.backward()' method. <br>
The optimizer used is 'Autograd' to calculate and store the gradients for each model parameter in the parameter’s '.grad' attribute.
</div>
</div>

In [42]:
loss = (prediction - labels_tensor).sum()
# Perform the backward pass
loss.backward() 

<div style="line-height:0.1">
<h4 style="color:#BF66F2 ">  Step #3 </h4>
</div>
<div style="line-height:1.2">
Load an optimizer, in this case SGD with a learning rate of 0.01 and momentum of 0.9. <br>
All parameters are registered in the optimizer.
</div>

In [43]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
optim

SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.01
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)

<div style="line-height:0.1">
<h4 style="color:#BF66F2 ">  Step #4 </h4>
</div>
<div style="line-height:1.2">
Finally, call '.step()' method to initiate gradient descent. <br> 
The optimizer adjusts each parameter by its gradient stored in '.grad'.
</div>

In [44]:
""" Compute the gradient descent.
N.B.
All optimizers implement a step() method, that updates the parameters.
"""
op = optim.step() 
op

<h2 style="color:#BF66F2 "> <u> Example 2 </u> </h2>

In [45]:
# Create two new tensors 
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

In [46]:
# Create tensor Q from a and b.
Q = 3*a**3 - b**2
Q

tensor([-12.,  65.], grad_fn=<SubBackward0>)

In [47]:
""" Find the gradients.
N.B.
The gradient corresponds to a tensor of the same shape as Q.
"""
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

# Check if collected gradients are correct
print(9*a**2 == a.grad)
print(-2*b == b.grad)

tensor([True, True])
tensor([True, True])


**Recap:** <br>
Torch.autograd is an engine for computing vector-Jacobian. <br>
It tracks operations on all tensors which have their requires_grad flag set to True. <br>
The output tensor of an operation will require gradients even if only a single input tensor has requires_grad=True. <br>

For tensors that don’t require gradients, setting this attribute to False excludes it from the gradient computation DAG. <br>
In a NN, parameters that don’t compute gradients are usually called frozen parameters. <br>
It is useful to “freeze” part of your model if you know in advance that you won’t need the gradients of those parameters. <br>

In [48]:
# Set requires_grad
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)

a = x + y
print(f"Does 'a' require gradients? : {a.requires_grad}")
b = x + z
print(f"Does 'b' require gradients?: {b.requires_grad}")

Does 'a' require gradients? : False
Does 'b' require gradients?: True


In [49]:
## Freeze all the parameters in the network
for param in model.parameters():
    param.requires_grad = False

In [50]:
""" Replace the last linear layer model.fc. (the ResNet classifier) with a new linear layer (unfrozen by default) 
that acts as our classifier.
In order to finetune the model on a new dataset with 10 labels.
"""
model.fc = torch.nn.Linear(512, 10)
model.fc

Linear(in_features=512, out_features=10, bias=True)

In [53]:
""" Stochastic Gradient Descent """
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)
optimizer

SGD (
Parameter Group 0
    dampening: 0
    differentiable: False
    foreach: None
    lr: 0.01
    maximize: False
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)

In [54]:
x = torch.randn(5, requires_grad=True)
y = x.pow(2)
z = x.exp()
print(x.equal(y.grad_fn._saved_self))
print(x is y.grad_fn._saved_self)
print(z.equal(z.grad_fn._saved_result))
print(z is z.grad_fn._saved_result)

True
True
True
False
