# Deep Residual Learning for Image Recognition

**Notebook author: Shuang HOU**

At the end of 2015, Microsoft Research Asia released a paper titled ["Deep Residual Learning for Image Recognition"](https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf), authored by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. The paper achieved state-of-the-art results in Image classification and detection, winning the ImageNet and COCO competitions. This notebook is an implementation of the Residual Network (**ResNet** for short) in PyTorch based on this paper.

This notebook is prepared for students who have participated in the AML course (or a fairly close course). It supposes a basic knowledge of Deep Learning and Convolutional Neural Networks, which have been introduced in the previous courses ([DL](https://github.com/SupaeroDataScience/deep-learning/tree/main/deep), [CNN](https://github.com/fchouteau/isae-practical-deep-learning)), you can refer to them if needed.

**Table of contents:**
0. [Preparation](#sec0)
1. [Problem introduction](#sec1)
2. [Dataset: Fashion-MNIST](#sec2)
3. [Construction of ResNet](#sec3)
    1. [Plain network (as a comparaison)](#sec3-1)
    2. [Residual Learning](#sec3-2)
    3. [Identity Mapping by Shortcuts](#sec3-3)
    4. [Network Architectures](#sec3-4)
4. [Experiments](#sec4)
5. [Conclusion](#sec5)

# <a id="sec0"></a>0. Preparation

In this notebook, we'll be using `torch` and `torchvision`, which we have already used in previous AML courses. Run the following code blocks to install the necessary packages and verify that everything is working by importing everything. 

Please refer to the [PyTorch](https://pytorch.org/get-started/locally/) website for installation instructions if necessary. We'll also be using packages `sklearn`, `numpy`, and `matplotlib`. 

Note that this notebook is fairly compute intensive and might be better [run in Google Colab]().

In [1]:
# !pip install torch torchvision

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.data.dataloader as Data
from torch.autograd import Variable
import torchvision
from torchvision import datasets, models, transforms
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

# <a id="sec1"></a>1. Dataset: Fashion-MNIST

[Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist) is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28 $\times$ 28 grayscale image, associated with a label from 10 classes. Fashion-MNIST is a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits but is more complex.

<img src="img/fashion-mnist-small.png">

In [2]:
labels_text = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat", "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

PyTorch comes with this dataset by default, but we need to download it. We'll then make dataloaders which lazily iterate through the datasets. We'll use a training set and a validation set and greatly reduce their sizes to make this notebook run in a reasonable time.

In [24]:
# Hyperparameter
BATH_SIZE = 512

In [25]:
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.5,), (0.5,))])

full_trainset = torchvision.datasets.FashionMNIST(root='data', train=True, download=True, transform=transform)
trainset, full_validset = torch.utils.data.random_split(full_trainset, (10000, 50000))
validset, _ = torch.utils.data.random_split(full_validset, (1000, 49000))
testset = torchvision.datasets.FashionMNIST(root='data', train=False, download=True, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=BATH_SIZE, shuffle=True, num_workers=2)
validloader = torch.utils.data.DataLoader(validset, batch_size=BATH_SIZE, shuffle=True, num_workers=2)
testloader = torch.utils.data.DataLoader(testset, batch_size=BATH_SIZE, shuffle=True, num_workers=2)

In [26]:
# Checking the dataset
for images, labels in trainloader:
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', labels.shape)
    break

Image batch dimensions: torch.Size([512, 1, 28, 28])
Image label dimensions: torch.Size([512])


# <a id="sec2"></a>2. Problem introduction

From experience, the depth of the network is crucial to the performance of the model. When the number of network layers is increased, the network can extract more complex feature patterns, so theoretically better results can be achieved when the model is deeper. 

But the experiment found that the deep network has a *degradation* problem: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. This phenomenon can be seen directly in Figure 1 which shows the training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks. The deeper network has higher training error, and thus test error. This is not caused by overfitting, because the training error of the 56-layer network is also high.

<img src="img/degradation.JPG" width="50%"></img>

<center><font size=1.5><br>Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer "plain" networks.<br>
    The deeper network has higher training error, and thus test error.</font></center>

<div class="alert alert-warning">

**Think about this question:**<br>
How to effectively solve the "degradation" problem caused by the increase in network depth?
    
</div>

<div class="alert alert-danger"><a href="#answer1" data-toggle="collapse"><b>Ready to see the answer? (click to expand)</b></a><br>
<div id="answer1" class="collapse">

The problem of degradation is mainly due to the increase in network depth. During model training, the gradient cannot be effectively transmitted to the shallow network, resulting in [vanishing/exploding gradients](). **Batch Normalization** (BN) changes the data distribution by normalizing the output data, which is a forward process to solve the vanishing/exploding gradients problem. The residual network (ResNet) directly connects the shallow network and the deep network by adding **shortcut connection** (Identity Map), so that the gradient can be well transmitted to the shallow layer.
    
</div>
</div>

# <a id="sec2"></a>2. Construction of ResNet

### <a id="sec2-1"></a>2.1 Plain network (as a comparaison)

In [11]:
class PlainBlock(nn.Module):
    """
    A basic building block for Plain Network
    
    Parameters：
        - in_channel: Number of input channel
        - out_channel: Number of output channel
        - stride: Number of stride 
        - downsample: "None" for identity downsample, otherwise for a real downsample
    
    """
    
    expansion = 1    # Record whether the number of convolution kernels in each layer has changed
 
    def __init__(self, in_channel, out_channel, stride=1, downsample=None):
        super(PlainBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=in_channel, out_channels=out_channel,
                               kernel_size=3, stride=stride, padding=1, bias=False)  # 有无bias对bn没多大影响
        self.bn1 = nn.BatchNorm2d(out_channel)
        self.relu = nn.ReLU()
 
        self.conv2 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel,
                               kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channel)
 
        self.downsample = downsample
 
    def forward(self, x):
        identity = x      # Record the output of the last residual block
        
        if self.downsample is not None:  # Determine if need to downsample for dimension matching
            identity = self.downsample(x)
 
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
 
        out = self.conv2(out)
        out = self.bn2(out)
 
        out = self.relu(out)
 
        return out

### <a id="sec2-2"></a>2.2. Residual Learning

In response to the "degradation" problem, the author Dr. He proposed a **deep residual learning** framework, which uses a multi-layer network to fit a residual mapping.

Formally, denoting the desired underlying mapping as $H(x)$, we let the stacked nonlinear layers fit another mapping:

$$F(x) := H(x)−x$$ 

The original mapping is recast into:

$$F(x)+x$$. 

We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

The formulation of $F(x)+x$ can be realized by feedforward neural networks with "**[shortcut connections]()**". Shortcut connections are those skipping one or more layers. For the case in paper, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers. 

Figure 2 shows a building block of residual learning in the deep residual network:

<img src="img/2-layer building block.JPG" width="340px">

<center><font size=1.5><br>Figure 2. Residual learning: a building block.</font></center>

Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., [Caffe]) without modifying the solvers.

<div class="alert alert-success">

**In brief:**<br>
- If identity mappings are added, a deeper network will not perform worse than a shallow network.
- It is difficult to learn identify mapings in a network structure composed of multiple non-linear layers.
- If identity mapings is the optimal link method, then the weight parameters of $F(x)$ will tend to $0$.
- If the optimal mapping is close to identity mappings, it is much easier to find the $F(x)$ corresponding to the identity mappings (initial parameters near 0) during optimization than to approximately fit a completely new function.
    
</div>

In [12]:
def conv3x3(in_planes, out_planes, stride=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, 
                     stride=stride, padding=1, bias=False)

In [13]:
class BasicBlock(nn.Module):
    
    """
    A basic building block for 18/34-layer ResNet.
    
    Args：
        - inplanes: Number of input channel
        - planes: Number of output channel
        - stride: Number of stride 
        - downsample: "None" for identity downsample, otherwise for a real downsample
    
    """
    
    expansion = 1    # Record whether the number of convolution kernels in each layer has changed
    
    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x    # Record the output of the last residual block

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:    # Determine if need to downsample for dimension matching
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out

### <a id="sec2-2"></a>2.2. Identity Mapping by Shortcuts

Two connection methods are proposed for shortcut connection.

**Method 1:**

$$y = F(x,\{W_{i}\}) + x$$

Where:

- $x$ represents the input vector of the building block for the layers considered.
- $y$ represents the output vector of the building block for the layers considered.
- $F(x,\{W_{i}\})$ represents the residual mapping to be learned, which is the superposition of multiple nonlinear convolutional layers.<br>
  For the example in figure above that has two layers, $F = W_2 \sigma(W_1 x)$ in which $\sigma$ represents the nonlinear activation function ReLU, and the biases are omitted for simplifying notations.
- $F+x$ means shortcut connection, which corresponds to the addition of each pixel.

<div class="alert alert-success">

**Note:** This network structure introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in the comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).
    
<div>

**Method 2:**

In method 1, the dimensions of $x$ and $F$ must be equal. If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection $W_s$ by the shortcut connections to match the dimensions:

$$y = F(x,\{W_{i}\}) + W_{s}x$$

Where:

- $x$ represents the input vector of the building block for the layers considered.
- $y$ represents the output vector of the building block for the layers considered.
- $F(x,\{W_{i}\})$ represents the residual mapping to be learned, which is the superposition of multiple nonlinear convolutional layers.<br>
  For the example in figure above that has two layers, $F = W_2 \sigma(W_1 x)$ in which $\sigma$ represents the nonlinear activation function ReLU, and the biases are omitted for simplifying notations.
- $F+x$ means shortcut connection, which corresponds to the addition of each pixel.

<div class="alert alert-success">
    
**Note:** The identity mapping is sufficient for addressing the degradation problem and is economical, thus $W_s$ will be only used when matching dimensions.
    
<div>

It is also mentioned in paper that, for $F(x,\{W_{i}\})$, it should not be limited to the two-layer convolution connection mentioned above, it can be more diverse, such as the three-layer building block on the right of Figure 3. One such small unit is called a *block*. When building a deep network structure, the author calls the second structure *bottleneck* building block.

<img src="img/3-layer building block.JPG">

<center><font size=1.5><br>Figure 3. Two different building blocks for residual learning.<br> 
    Left: a building block (on 56 $\times$ 56 feature maps) as in Figure 4 for ResNet34.<br>
    Right: a "bottleneck" building block for ResNet-50/101/152.</font></center>

In [14]:
class Bottleneck(nn.Module):
    
    """
    A "bottleneck" building block for 50/101/152-layer ResNet.
    
    Args：
        - in_channel: Number of input channel
        - out_channel: Number of output channel
        - stride: Number of stride 
        - downsample: "None" for identity downsample, otherwise for a real downsample
    
    """
    
    expansion = 4       # The number of convolution kernels in the third layer (256, 512, 1024, 2048) 
                        # is 4 times the number of convolution kernels in the first or second layer (64, 128, 256, 512)
 
    def __init__(self, in_channel, out_channel, stride=1, downsample=None):
        super(Bottleneck, self).__init__()
        
        self.conv1 = nn.Conv2d(in_channels=in_channel, out_channels=out_channel,
                               kernel_size=1, stride=1, bias=False)  # Squeeze channels for dimensionality reduce
        self.bn1 = nn.BatchNorm2d(out_channel)
        self.relu = nn.ReLU(inplace=True)
 
        self.conv2 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel,
                               kernel_size=3, stride=stride, bias=False, padding=1)
        self.bn2 = nn.BatchNorm2d(out_channel)
        self.relu = nn.ReLU(inplace=True)
 
        self.conv3 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel*self.expansion,
                               kernel_size=1, stride=1, bias=False)  # Unsqueeze channels for dimensionality increase
        self.bn3 = nn.BatchNorm2d(out_channel*self.expansion)
 
        self.downsample = downsample
 
    def forward(self, x):
        residual = x
 
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
 
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)
 
        out = self.conv3(out)
        out = self.bn3(out)
        
        if self.downsample is not None:
            residual = self.downsample(x)
 
        out += residual
        out = self.relu(out)
 
        return out

### <a id="sec2-3"></a>2.3. Network Architectures

The subsequent implementation part is mainly to compare the two network structures of **Plain net** and **Residual net**, so this part focuses on the description of these two network structures.

<img src="img/plain-res nets.jpg" width="55%">

<center><font size=1.5><br>Figure 4. Example network architectures for ImageNet.<br>
    Left: the [VGG-19] model(19.6 billion FLOPs) as a reference.<br> 
    Middle: a plain network with 34 parameter layers (3.6 billion FLOPs).<br>
    Right: a residual network with 34 parameter layers (3.6 billionFLOPs).<br>
    The dotted shortcuts increase dimensions.</font></center>

**Plain Network**

The plain baselines (Figure 4, middle) are mainly inspired by the philosophy of [VGG nets] (Figure 4, left). The convolutional layers mostly have $3 \times 3$ filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 (Figure 4, middle).

**Residual Network**

Based on the above plain network, we insert shortcut connections (Figure 4, right) which turn the network into its counterpart residual version. The identity shortcuts (method 1) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Figure 4). When the dimensions increase (dotted line shortcuts in Figure 4), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in method 2 is used to match dimensions (done by 1 $\times$ 1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

In [15]:
class ResNet(nn.Module):
    """
    Implementation of ResNet architecture.
    
    Arg：
        - block: "BasicBlock" for 18/34-layer ResNet, "Bottleneck" for 50/101/152-layer ResNet
        - layers: The number of each residual layer, for example, [3,4,6,3] for the 34-layer ResNet

    """

    def __init__(self, block, layers, num_classes, grayscale):
        self.inplanes = 64
        if grayscale:
            in_dim = 1
        else:
            in_dim = 3
        super(ResNet, self).__init__()
        
        #  part 1: conv1 + maxpooling
        self.conv1 = nn.Conv2d(in_dim, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        
        #  part 2: conv2,3,4,5
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        
        #  part 3: avgpooling + fully connected layer
        self.avgpool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        #  Initialization of the convolutional layer
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')

    def _make_layer(self, block, planes, blocks, stride=1):
        downsample = None
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.inplanes, planes * block.expansion, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample))
        self.inplanes = planes * block.expansion
        for i in range(1, blocks):
            layers.append(block(self.inplanes, planes))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.view(x.size(0), -1)
        x = self.fc(x)

        return x

Now we can define different functions for construction of a 34-layer ResNet with "BasicBlock" and a 101-layer ResNet with "Bottleneck".

In [18]:
def resnet34(num_classes, grayscale):
    """Constructs a ResNet-34 model."""
    model = ResNet(block = BasicBlock,
                 layers = [3, 4, 6, 3],
                 num_classes = num_classes,
                 grayscale = grayscale)
    return model

# <a id="sec3"></a>3. Experiments

In [21]:
# Set up the device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('Training on {}'.format(device))

In [23]:
# Hyperparameters
LEARNING_RATE = 0.001
BATCH_SIZE = 512
NUM_EPOCHS = 30

# Architecture
NUM_FEATURES = 32*32
NUM_CLASSES = 10

# Other
GRAYSCALE = True

In [None]:
model = resnet34(NUM_CLASSES, GRAYSCALE)
model.to(device)

**Train just 1 epoch:**

In [None]:
def train(model):
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    train_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        inputs = inputs.to(device)
        labels = labels.to(device)
        # zero the parameter gradients
        optimizer.zero_grad()
        # forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    return train_loss

In [None]:
%time train_loss = train(model)
print(train_loss)

In [None]:
def get_valid_predictions(model):
    all_labels = np.array([])
    predictions = np.array([])
    with torch.no_grad():
        for data in validloader:
            images, labels = data
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            all_labels = np.append(all_labels, labels.cpu().numpy())
            predictions = np.append(predictions, predicted.cpu().numpy())
    return all_labels, predictions

In [None]:
y_valid, predictions = get_valid_predictions(model)

In [None]:
from sklearn.metrics import accuracy_score, classification_report

print('Accuracy: ', accuracy_score(predictions, y_valid))
print(classification_report(predictions, y_valid, target_names=labels_text))

Accuracy:  0.608
              
              precision    recall  f1-score   support

    T-shirt/top       0.50      0.60      0.55        96
     Trouser       0.85      0.96      0.90        81
    Pullover       0.20      0.34      0.25        56
       Dress       0.66      0.63      0.65        93
        Coat       0.50      0.45      0.48       118
      Sandal       0.61      0.64      0.62        84
       Shirt       0.47      0.25      0.33       164
     Sneaker       0.74      0.72      0.73       108
         Bag       0.67      0.81      0.74        91
     Ankle boot       0.85      0.86      0.86       109

    accuracy                           0.61      1000
    macro avg       0.61      0.63      0.61      1000
    weighted avg       0.61      0.61      0.60      1000


Try with another optimizer Adam, to see whether the performance will be better.

In [25]:
def train_Adam(model):
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(model.parameters())
    train_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        inputs = inputs.to(device)
        labels = labels.to(device)
        # zero the parameter gradients
        optimizer.zero_grad()
        # forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    return train_loss

In [None]:
%time train_loss = train_Adam(model)
print(train_loss)

In [None]:
y_valid, predictions = get_valid_predictions(model)

from sklearn.metrics import accuracy_score, classification_report

print('Accuracy: ', accuracy_score(predictions, y_valid))
print(classification_report(predictions, y_valid, target_names=labels_text))

Accuracy:  0.809
              
              precision    recall  f1-score   support

    T-shirt/top       0.83      0.80      0.81       119
     Trouser       0.95      0.94      0.94        93
    Pullover       0.60      0.66      0.63        88
       Dress       0.81      0.81      0.81        89
        Coat       0.87      0.64      0.74       142
      Sandal       0.94      0.90      0.92        93
       Shirt       0.38      0.56      0.45        59
     Sneaker       0.88      0.85      0.87       109
         Bag       0.92      0.93      0.92       109
    Ankle boot       0.86      0.96      0.91        99

    accuracy                           0.81      1000
    macro avg       0.80      0.80      0.80      1000
    weighted avg       0.83      0.81      0.81      1000

**Training iteration for NUM_EPOCHS times.**

In [None]:
criterion = nn.CrossEntropyLoss()

def validation(model):
    valid_loss = 0
    with torch.no_grad():
        for data in validloader:
            images, labels = data
            images = images.to(device)
            labels = labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            valid_loss += loss.item()
    return valid_loss

def train(model):
#     optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
    optimizer = torch.optim.Adam(model.parameters())
    train_history = []
    valid_history = []
    for epoch in range(NUM_EPOCHS):
        train_loss = 0.0
        for i, data in enumerate(trainloader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data
            inputs = inputs.to(device)
            labels = labels.to(device)
            # zero the parameter gradients
            optimizer.zero_grad()
            # forward + backward + optimize
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        valid_loss = validation(model)
        train_history.append(train_loss)
        valid_history.append(valid_loss)
        print('Epoch %02d: train loss %0.5f, validation loss %0.5f' % (epoch, train_loss, valid_loss))
    return train_history, valid_history

In [None]:
train_history, valid_history = train(model)

In [None]:
def plot_train_val(train, valid):
    fig, ax1 = plt.subplots()
    color = 'tab:red'
    ax1.set_ylabel('Training', color=color)
    ax1.plot(train, color=color)
    ax2 = ax1.twinx()
    color = 'tab:blue'
    ax2.set_ylabel('Validation', color=color)
    ax2.plot(valid, color=color)
    fig.tight_layout()
    
plot_train_val(train_history, valid_history)

The loss of the training set is getting smaller, but the loss of the validation set is getting larger.
Over-fitting occurs, other methods are needed to solve the over-fitting problem.

## Test

In [None]:
def run_test(model):
    with torch.no_grad():
        model.eval()
        pred = []
        all_label = []
        for batch_idx, (data, label) in enumerate(testloader):
            batch_x, batch_y = data.to(device), label.to(device)
            batch_x, batch_y = Variable(batch_x), Variable(batch_y)
            output = model(batch_x)
            
            pred2 = output.max(1, keepdim=True)[1]
            pred2 = pred2.cpu().numpy()  
            for ii in range(len(pred2)):
                pred.append((pred2[ii])[0])
            all_label = np.append(all_label, batch_y.cpu().numpy())

    return pred, all_label

In [None]:
pred, y_test = run_test(model)

In [None]:
from sklearn.metrics import accuracy_score, classification_report

print('Accuracy: ', accuracy_score(pred, y_test))
print(classification_report(pred, y_test, target_names=labels_text))

Accuracy:  0.8685
              
              precision    recall  f1-score   support

    T-shirt/top       0.87      0.77      0.82      1124
     Trouser       0.97      0.99      0.98       974
    Pullover       0.81      0.80      0.81      1020
       Dress       0.91      0.86      0.88      1054
        Coat       0.71      0.85      0.77       833
      Sandal       0.95      0.95      0.95      1002
       Shirt       0.64      0.65      0.65       978
     Sneaker       0.92      0.92      0.92      1002
         Bag       0.96      0.96      0.96      1004
    Ankle boot       0.95      0.94      0.94      1009

    accuracy                           0.87     10000
    macro avg       0.87      0.87      0.87     10000
    weighted avg       0.87      0.87      0.87     10000


# Draft...

In [34]:
def run_model(net, loader, criterion, optimizer, train = True):
    running_loss = 0
    running_accuracy = 0
    running_history = []

    # Set mode
    if train:
        net.train()
    else:
        net.eval()


    for i, data in enumerate(loader):
        per_run_loss = 0.0

        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        
        # Pass to gpu or cpu
        inputs, labels = inputs.to(device), labels.to(device)

        # Zero the parameter gradients
        optimizer.zero_grad()

        with torch.set_grad_enabled(train):
            output = net(inputs)
            _, pred = torch.max(output, 1)
            loss = criterion(output, labels)

        # If on train backpropagate
        if train:
            loss.backward()
            optimizer.step()

        # Calculate stats
        per_run_loss += loss.item()
        running_loss += loss.item()
        running_accuracy += torch.sum(pred == labels.detach())
        running_history.append(running_loss)
        if train:
            print('Epoch %02d: train loss %0.5f' % (epoch, per_run_loss))
        else:
            print('Epoch %02d: validation loss %0.5f' % (epoch, per_run_loss))
        
    return running_loss / len(loader), running_accuracy.double() / len(loader.dataset), running_history

In [1]:
# Train the network
import time
# import utils

patience = 3
best_loss = 1e4
    
for epoch in range(30):
    start = time.time()
    train_loss, train_acc, train_history = run_model(net, trainloader,
                                      criterion, optimizer)
    val_loss, val_acc, valid_history = run_model(net, validloader,
                                  criterion, optimizer, False)
    end = time.time()

    # print stats
    stats = """Epoch: {}\t train loss: {:.3f}, train acc: {:.3f}\t
            val loss: {:.3f}, val acc: {:.3f}\t
            time: {:.1f}s""".format(epoch, train_loss, train_acc, val_loss,
                                        val_acc, end - start)
    print(stats)

#     # early stopping and save best model
#     if val_loss < best_loss:
#         best_loss = val_loss
#         patience = patience
#         utils.save_model({
#             'arch': net,
#             'state_dict': net.state_dict()
#         }, 'saved-models/{}-run-{}.pth.tar'.format(net, run))
#     else:
#         patience -= 1
#         if patience == 0:
#             print('Run out of patience!')
#             break

### MNIST dataset

In [61]:
train_data = torchvision.datasets.MNIST(
    './mnist', train=True, transform=torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    ]), download=True
)

train_data.data = train_data.data[:10000]
train_data.targets = train_data.targets[:10000]

test_data = torchvision.datasets.MNIST(
    './mnist', train=False, transform=torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    ]), download=True
)

print("train_data:", train_data.train_data.size())
print("train_labels:", train_data.train_labels.size())
print("test_data:", test_data.test_data.size())

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./mnist\MNIST\raw\train-images-idx3-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ./mnist\MNIST\raw\train-images-idx3-ubyte.gz to ./mnist\MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./mnist\MNIST\raw\train-labels-idx1-ubyte.gz



HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ./mnist\MNIST\raw\train-labels-idx1-ubyte.gz to ./mnist\MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./mnist\MNIST\raw\t10k-images-idx3-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ./mnist\MNIST\raw\t10k-images-idx3-ubyte.gz to ./mnist\MNIST\raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./mnist\MNIST\raw\t10k-labels-idx1-ubyte.gz


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))

Extracting ./mnist\MNIST\raw\t10k-labels-idx1-ubyte.gz to ./mnist\MNIST\raw
Processing...
Done!
train_data: torch.Size([10000, 28, 28])
train_labels: torch.Size([10000])
test_data: torch.Size([10000, 28, 28])




In [62]:
train_loader = Data.DataLoader(dataset=train_data, batch_size=32, shuffle=True)
test_loader = Data.DataLoader(dataset=test_data, batch_size=32)

model = ResNet18()
if if_use_gpu:
    model = model.cuda()

print(model)

ResNet(
  (conv1): Sequential(
    (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
  )
  (layer1): Sequential(
    (0): ResidualBlock(
      (left): Sequential(
        (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU()
        (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (4): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (right): Sequential()
    )
    (1): ResidualBlock(
      (left): Sequential(
        (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (2): ReLU()
        (3): Con

In [63]:
optimizer = torch.optim.Adam(model.parameters())
loss_func = torch.nn.CrossEntropyLoss()

for epoch in range(1):
    print('epoch {}'.format(epoch + 1))
    for i, data in enumerate(train_loader, 0):
        # get the inputs
        inputs, labels = data
        batch_x, batch_y = Variable(inputs), Variable(labels)
        if if_use_gpu:
            batch_x = batch_x.cuda()
            batch_y = batch_y.cuda()
        out = model(batch_x)
        batch_y = batch_y.long()
        loss = loss_func(out, batch_y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # 返回每行元素最大值
        pred = torch.max(out, 1)[1]
        train_correct = (pred == batch_y).sum()
        train_correct = train_correct.item()
        train_loss = loss.item()
        print('batch:{},Train Loss: {:.6f}, Acc: {:.6f}'.format(i+1,train_loss , train_correct /32))

epoch 1
batch:1,Train Loss: 2.374657, Acc: 0.062500
batch:2,Train Loss: 6.123889, Acc: 0.156250
batch:3,Train Loss: 19.508154, Acc: 0.125000
batch:4,Train Loss: 13.292258, Acc: 0.218750
batch:5,Train Loss: 4.874920, Acc: 0.125000
batch:6,Train Loss: 6.002936, Acc: 0.062500
batch:7,Train Loss: 15.337273, Acc: 0.093750
batch:8,Train Loss: 5.083591, Acc: 0.156250
batch:9,Train Loss: 7.783920, Acc: 0.093750
batch:10,Train Loss: 9.707572, Acc: 0.125000
batch:11,Train Loss: 6.786269, Acc: 0.156250
batch:12,Train Loss: 6.968081, Acc: 0.312500
batch:13,Train Loss: 5.258401, Acc: 0.125000
batch:14,Train Loss: 7.909595, Acc: 0.125000
batch:15,Train Loss: 3.568305, Acc: 0.250000
batch:16,Train Loss: 3.798988, Acc: 0.218750
batch:17,Train Loss: 5.174158, Acc: 0.093750
batch:18,Train Loss: 2.668407, Acc: 0.437500
batch:19,Train Loss: 2.718485, Acc: 0.437500
batch:20,Train Loss: 3.481237, Acc: 0.218750
batch:21,Train Loss: 2.669489, Acc: 0.312500
batch:22,Train Loss: 3.429018, Acc: 0.156250
batch:23

batch:182,Train Loss: 0.412537, Acc: 0.843750
batch:183,Train Loss: 0.203275, Acc: 0.968750
batch:184,Train Loss: 0.417183, Acc: 0.875000
batch:185,Train Loss: 0.030099, Acc: 1.000000
batch:186,Train Loss: 0.200837, Acc: 0.937500
batch:187,Train Loss: 0.127633, Acc: 0.937500
batch:188,Train Loss: 0.048370, Acc: 0.968750
batch:189,Train Loss: 0.322275, Acc: 0.937500
batch:190,Train Loss: 0.079107, Acc: 0.968750
batch:191,Train Loss: 0.252487, Acc: 0.906250
batch:192,Train Loss: 0.180565, Acc: 0.937500
batch:193,Train Loss: 0.293501, Acc: 0.906250
batch:194,Train Loss: 0.387668, Acc: 0.937500
batch:195,Train Loss: 0.668451, Acc: 0.843750
batch:196,Train Loss: 0.509568, Acc: 0.812500
batch:197,Train Loss: 0.283848, Acc: 0.906250
batch:198,Train Loss: 0.103716, Acc: 0.968750
batch:199,Train Loss: 0.071170, Acc: 0.968750
batch:200,Train Loss: 0.123919, Acc: 0.968750
batch:201,Train Loss: 0.198210, Acc: 0.937500
batch:202,Train Loss: 0.197717, Acc: 0.875000
batch:203,Train Loss: 0.278053, Ac

KeyboardInterrupt: 

In [69]:
# Evaluation--------------------------------
model.eval()
eval_loss = 0.
eval_acc = 0.

for batch_x, batch_y in test_loader:
    batch_x, batch_y = Variable(batch_x, requires_grad=False), Variable(batch_y, requires_grad=False)
    if if_use_gpu:
        batch_x = batch_x.cuda()
        batch_y = batch_y.cuda()
    out = model(batch_x)
    loss = loss_func(out, batch_y)
    eval_loss += loss.item()
    pred = torch.max(out, 1)[1]
    num_correct = (pred == batch_y).sum()
    eval_acc += num_correct.item()
    
print('Test Loss: {:.6f}, Acc: {:.6f}'.format(eval_loss / (len(test_data)), eval_acc / (len(test_data))))

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 57802752 bytes. Buy new RAM!
