# Deep Residual Learning for Image Recognition

**Notebook author: Shuang HOU**

At the end of 2015, Microsoft Research Asia released a paper titled ["Deep Residual Learning for Image Recognition"](https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf), authored by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. The paper achieved state-of-the-art results in Image classification and detection, winning the ImageNet and COCO competitions. This notebook is an implementation of the Residual Network (**ResNet** for short) in PyTorch based on this paper.

This notebook is prepared for students who have participated in the AML course (or a fairly close course). It supposes a basic knowledge of Deep Learning and Convolutional Neural Networks, which have been introduced in the previous courses ([DL](https://github.com/SupaeroDataScience/deep-learning/tree/main/deep), [CNN](https://github.com/fchouteau/isae-practical-deep-learning)), you can refer to them if needed.

**Table of contents:**
0. [Preparation](#sec0)
1. [Problem introduction](#sec1)
2. [Theory of ResNet](#sec2)
    1. [Residual Learning](#sec2-1)
    2. [Identity Mapping by Shortcuts](#sec2-2)
    3. [Network Architectures](#sec2-3)
3. [Implementation of ResNet](#sec3)
    1. [ImageNet Classification](#sec3-1)
    2. [CIFAR-10 and Analysis](#sec3-2)
4. [Conclusion](#sec4)

# <a id="sec0"></a>0. Preparation

In this notebook, we'll be using `torch` and `torchvision`, which we have already used in previous AML courses. Run the following code blocks to install the necessary packages and verify that everything is working by importing everything. 

Please refer to the [PyTorch](https://pytorch.org/get-started/locally/) website for installation instructions if necessary. We'll also be using packages `sklearn`, `numpy`, and `matplotlib`. 

Note that this notebook is fairly compute intensive and might be better [run in Google Colab].

In [3]:
# !pip install torch torchvision

In [4]:
import torch
import torchvision
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

# <a id="sec1"></a>1. Problem introduction

From experience, the depth of the network is crucial to the performance of the model. When the number of network layers is increased, the network can extract more complex feature patterns, so theoretically better results can be achieved when the model is deeper. 

But the experiment found that the deep network has a *degradation* problem: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. This phenomenon can be seen directly in Figure 1 which shows the training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. This is not caused by overfitting, because the training error of the 56-layer network is also high.

<img src="img/degradation.JPG" width="50%"></img>

<center><font size=1.5><br>Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks.<br>
    The deeper network has higher training error, and thus test error.</font></center>

<div class="alert alert-warning">

**Think about this question:**<br>
How to effectively solve the "degradation" problem caused by the increase in network depth?
    
</div>

<div class="alert alert-danger"><a href="#answer1" data-toggle="collapse"><b>Ready to see the answer? (click to expand)</b></a><br>
<div id="answer1" class="collapse">

The problem of degradation is mainly due to the increase in network depth. During model training, the gradient cannot be effectively transmitted to the shallow network, resulting in [vanishing/exploding gradients](). **Batch Normalization** (BN) changes the data distribution by normalizing the output data, which is a forward process to solve the vanishing/exploding gradients problem. The residual network (ResNet) directly connects the shallow network and the deep network by adding **shortcut connection** (Identity Map), so that the gradient can be well transmitted to the shallow layer.
    
</div>
</div>

# <a id="sec2"></a>2. Theory of ResNet

### <a id="sec2-1"></a>2.1. Residual Learning

In response to the "degradation" problem, the author Dr. He proposed a **deep residual learning** framework, which uses a multi-layer network to fit a residual mapping.

Formally, denoting the desired underlying mapping as $H(x)$, we let the stacked nonlinear layers fit another mapping:

$$F(x) := H(x)−x$$ 

The original mapping is recast into:

$$F(x)+x$$. 

We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

The formulation of $F(x)+x$ can be realized by feedforward neural networks with "**[shortcut connections]()**". Shortcut connections are those skipping one or more layers. For the case in paper, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers. 

Figure 2 shows a building block of residual learning in the deep residual network:

<img src="img/2-layer building block.JPG" width="340px">

<center><font size=1.5><br>Figure 2. Residual learning: a building block.</font></center>

Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., [Caffe]) without modifying the solvers.

<div class="alert alert-success">

**In brief:**<br>
- If identity mappings are added, a deeper network will not perform worse than a shallow network.
- It is difficult to learn identify mapings in a network structure composed of multiple non-linear layers.
- If identity mapings is the optimal link method, then the weight parameters of $F(x)$ will tend to $0$.
- If the optimal mapping is close to identity mappings, it is much easier to find the $F(x)$ corresponding to the identity mappings (initial parameters near 0) during optimization than to approximately fit a completely new function.
    
</div>

### <a id="sec2-2"></a>2.2. Identity Mapping by Shortcuts

Two connection methods are proposed for shortcut connection.

**Method 1:**

$$y = F(x,\{W_{i}\}) + x$$

Where:

- $x$ represents the input vector of the building block for the layers considered.
- $y$ represents the output vector of the building block for the layers considered.
- $F(x,\{W_{i}\})$ represents the residual mapping to be learned, which is the superposition of multiple nonlinear convolutional layers.<br>
  For the example in figure above that has two layers, $F = W_2 \sigma(W_1 x)$ in which $\sigma$ represents the nonlinear activation function ReLU, and the biases are omitted for simplifying notations.
- $F+x$ means shortcut connection, which corresponds to the addition of each pixel.

<div class="alert alert-success">

**Note:** This network structure introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in the comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).
    
<div>

**Method 2:**

In method 1, the dimensions of $x$ and $F$ must be equal. If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection $W_s$ by the shortcut connections to match the dimensions:

$$y = F(x,\{W_{i}\}) + W_{s}x$$

Where:

- $x$ represents the input vector of the building block for the layers considered.
- $y$ represents the output vector of the building block for the layers considered.
- $F(x,\{W_{i}\})$ represents the residual mapping to be learned, which is the superposition of multiple nonlinear convolutional layers.<br>
  For the example in figure above that has two layers, $F = W_2 \sigma(W_1 x)$ in which $\sigma$ represents the nonlinear activation function ReLU, and the biases are omitted for simplifying notations.
- $F+x$ means shortcut connection, which corresponds to the addition of each pixel.

<div class="alert alert-success">
    
**Note:** The identity mapping is sufficient for addressing the degradation problem and is economical, thus $W_s$ will be only used when matching dimensions.
    
<div>

It is also mentioned in paper that, for $F(x,\{W_{i}\})$, it should not be limited to the two-layer convolution connection mentioned above, it can be more diverse, such as the three-layer building block on the right of Figure 3. One such small unit is called a *block*. When building a deep network structure, the author calls the second structure *bottleneck* building block.

<img src="img/3-layer building block.JPG">

<center><font size=1.5><br>Figure 3. Two different building blocks for residual learning.<br> 
    Left: a building block (on $56 \times 56$ feature maps) as in Figure 4 for ResNet34.<br>
    Right: a "bottleneck" building block for ResNet-50/101/152.</font></center>

### <a id="sec2-3"></a>2.3. Network Architectures

The subsequent implementation part is mainly to compare the two network structures of **plain nets** and **residual nets**, so this part focuses on the description of these two network structures.

<img src="img/plain-res nets.jpg" width="55%">

<center><font size=1.5><br>Figure 4. Example network architectures for ImageNet.<br>
    Left: the [VGG-19] model(19.6 billion FLOPs) as a reference.<br> 
    Middle: a plain network with 34 parameter layers (3.6 billion FLOPs).<br>
    Right: a residual network with 34 parameter layers (3.6 billionFLOPs).<br>
    The dotted shortcuts increase dimensions.</font></center>

**Plain Network**

The plain baselines (Figure 4, middle) are mainly inspired by the philosophy of [VGG nets] (Figure 4, left). The convolutional layers mostly have $3 \times 3$ filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 (Figure 4, middle).

**Residual Network**

Based on the above plain network, we insert shortcut connections (Figure 4, right) which turn the network into its counterpart residual version. The identity shortcuts (method 1) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Figure 4). When the dimensions increase (dotted line shortcuts in Figure 4), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in method 2 is used to match dimensions (done by $1 \times 1$ convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

### <a id="sec3"></a>3. Implementation of ResNet