# **ResNet with PyTorch**


In [None]:
import torchvision.models as models
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt


In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Assuming that we are on a CUDA machine, this should print a CUDA device:

print(device)

cuda:0


In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

Files already downloaded and verified
Files already downloaded and verified


In [None]:
# 1)Load
# 2)Transformation
# 3)DataLoader  (Batches)
# 4)Model


#**Residual Block**

For more read check the original [Residual Net paper](https://arxiv.org/pdf/1512.03385.pdf). 

Then main idea is to train a very deep netowrk without degradation. 

Let's describe the Residual block in short. Below is a simplified picture-- 

<img src ="https://www.researchgate.net/profile/Kunjin_Chen/publication/325430477/figure/fig4/AS:633970848960514@1528161837288/An-illustration-of-the-modified-deep-residual-network-ResNetPlus-structure-The-blue.png">

**Why Resnet work?**

**Below is Andrew Ng's explanation :** 

Consider the activation function of layer l is $a^{[l]}$. In general in a node first we apply a linear operation (depending upon the layer weights and biases) and then apply the non-linearity (sigmoid, tanh, Relu etc.) operator. To write it in more simple terms, with reference to the figure -- 

$Z^{[l + 1]} = W^{[l + 1]}\, a^{[l]} + b^{[l+1]};\, a^{[l + 1]} = g\left(Z^{[l + 1]} \right)$ ... (1)

$Z^{[l + 2]} = W^{[l + 2]}\, a^{[l + 1]} + b^{[l+2]};\, a^{[l + 2]} = g\left(Z^{[l + 2]} \right)$ ... (2)

Including the Skip connection the activation $a^{[l]}$ will be sent (copied) much further into the network and add before applying the non-linearity. Let's see how the second equation changes in Resnet-- 

$Z^{[l + 2]} = W^{[l + 2]}\, a^{[l + 1]} + b^{[l+2]};\, a^{[l + 2]} = g\left(Z^{[l + 2]} + a^{[l]} \right)$ ...(3). 

The idea is to take many such resnet block and stack them together. The reason we can increase the depth of the network with training error continulously going down in resnet is becasue of the identity blocks. In equation 3, assuming L2 regularization, we can see that weights and biases would shrink (go close to zero) and thus $a^{[l+2]} = g(\sim 0+a^{[l]})$. With relu non-linearity function this would be $a^{[l+2]} \approx a^{[l]}$. Thus going deep doesn't hurt the performance as learning the identity function is easy and, in the process we can learn some more important features.  

The main building block of the ResNet are the residual blocks. The framework strictly stems from the previous ideas discussed. 

If we consider $x$ as input and the desired mapping from input to output is denoted by $g(x)$. We stack layers (including non-linearity coming from the activation function) to fit a different function $f(x) : = g(x) - x $. The original mapping is then recast to $f(x) + x$. He et.al. hypothesized that it is easier to optimize the residual $(f)$ than the original mapping $(g)$. The basic residual block is shown below --
![residual_block](https://miro.medium.com/max/1140/1*D0F3UitQ2l5Q0Ak-tjEdJg.png)

One more important point to remember is about the dimension of the input $x$ and output $f(x)$. The building block can usually be defined as -- 
$y = f(x, \{W_i\}) + x$; 

and the network learns the residual mapping $f$. In the figure above there are 2 weight layers and an activation function in between in the residual block.  So $f = W_2\sigma (W_1x)$. Then $f+x$ is performed by elementwise addition. The skip/shortcut connection doesn't use any additional parameters. The dimesion of $f$ and $x$ must be equal, if this is not the case, then one can perform a linear projection $W_s$  by shortcut connection to match dimension. A square matrix $W_s$ can also be used ($y = f(x, \{W_i\}) + W_sx $) but the authors suggested that the identity conncection is sufficient to address the degradation problem. 

With all these in mind, let's implement ResNet. 

## **Build ResNet with PyTorch**


![res_arch](https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/02/Plot-of-Convolutional-Neural-Network-Architecture-with-a-Efficient-Inception-Module.png)  

# **Skip Connections**

<img src ="https://raw.githubusercontent.com/mhuzaifadev/ml_zero_to_hero/master/ddd.png?token=AMSGSQ3WKCWD6M2NEI2RY2DABBC5K">

Reference: Deeplearning.ai Andrew Ng

In [None]:
resnet = models.resnet50(pretrained=False, progress=True)

In [None]:
resnet.to(device)

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (downsample): Sequential(
        (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 

In [None]:
import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(resnet.parameters(), lr=0.001)

In [None]:
for epoch in range(1):  # loop over the dataset multiple times
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data[0].to(device), data[1].to(device)
        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = resnet(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

[1,  2000] loss: 3.067
[1,  4000] loss: 2.525
[1,  6000] loss: 2.241
[1,  8000] loss: 2.311
[1, 10000] loss: 2.360
[1, 12000] loss: 2.215
Finished Training


In [None]:
def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()