# 1.0 Introduction

Our previous lesson discussed the GoogLeNet architecture and the Inception module, a micro-architecture that acts as a building block in the overall macro-architecture. We are now going to discuss another network architecture that relies on micro-architectures – **ResNet**.

**ResNet** uses what is called a **residual module** to train Convolutional Neural Networks to depths previously thought impossible. For example, in 2014, the [VGG16 and VGG19](https://arxiv.org/abs/1409.1556) architectures were
considered very deep. However, with [ResNet](https://arxiv.org/abs/1512.03385), we have successfully trained networks with > 100 layers on the challenging ImageNet dataset and over 1,000 layers on CIFAR-10.

These depths are only made possible by using *smarter* weight initialization algorithms (such as [Xavier/Glorot](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) and [MSRA/He](https://arxiv.org/abs/1502.01852) et al.) along with **identity mapping**, a concept we’ll discuss later in this lesson. Given the depths of ResNet networks, perhaps it comes as no surprise that ResNet took first place in all three ILSVRC 2015 challenges (classification, detection, and
localization).

In this lesson, we are going to discuss the ResNet architecture, the residual module, along with updates to the residual module that has made it capable of obtaining higher classification accuracy.
From there, we’ll implement and train variants of ResNet on the CIFAR-10 dataset and the Tiny ImageNet challenge – in each case, our ResNet implementations will outperform every experiment we have executed in this course.

# 2.0 ResNet and the Residual Module

First introduced by He et al. in their 2015 paper, [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385), the **ResNet** architecture has become a seminal work, demonstrating that extremely deep networks can be trained using standard SGD and a reasonable initialization function. In order to train networks at depths greater than 50-100 (and in some cases, 1,000) layers, ResNet relies on a micro-architecture called the **residual module**.

Another interesting component of ResNet is that pooling layers are used extremely sparingly.

Building on the work of [Springenberg](https://arxiv.org/abs/1412.6806) et al., ResNet does not strictly rely on max pooling operations to reduce volume size. Instead, convolutions with strides > 1 are used to learn weights and reduce the output volume spatial dimensions. In fact, there are only two occurrences of pooling being applied in the full implementation of the architecture:

1. The first (and only) occurrence of max pooling happens early in the network to help reduce spatial dimensions.

2. The second pooling operation is an average pooling layer used in place of fully connected layers, like in GoogLeNet.

Strictly speaking, there is only one max pooling layer – convolutional layers handle all other reductions in spatial dimensions.

In this section, we’ll review the original residual module, along with the residual bottleneck module used to train deeper networks. From there, we’ll discuss extensions and updates to the original residual module by He et al. in their 2016 publication, [Identity Mappings in Deep
Residual Networks](https://arxiv.org/abs/1603.05027), that allow us to further increase classification accuracy. Later in this lesson, we’ll implement ResNet from scratch using Keras.

## 2.1 Going Deeper: Residual Modules and Bottlenecks

The original residual module introduced by [He et al.](https://arxiv.org/abs/1512.03385) in 2015 relies on **identity mappings**, 

> the process of taking the original input to the module and adding it to the output of a series of operations. 

A graphical depiction of this module can be seen in Figure below (left). Notice how this module only has two branches, unlike the four branches in the Inception module of GoogLeNet. Furthermore,
this module is highly simplistic.


<img width="600" src="https://drive.google.com/uc?export=view&id=1Hw_1H_Bm9oPuLPRH12yy3r4eYPJSaXEu"/>

At the top of the module, we accept input to the module (i.e., the previous layer in the network). The right branch is a **linear shortcut** – it connects the input to an addition operation at the bottom of the module. Then, on the left branch of the **residual module**, we apply a series of
convolutions (all of which are 3 x 3), activations, and batch normalizations. This is a fairly standard pattern to follow when constructing Convolutional Neural Networks.

But what makes ResNet interesting is that He et al. suggested adding the original input to the output of the CONV, RELU and BN layers. We call this addition an **identity mapping** since the input
(the identity) is added to the output of series of operations. It is also why the term **residual** is used. The **residual** input is added to the output of a series of layer operations. The connection
between the input and the addition node is called the **shortcut**. Note that we are not referring to
concatenation along the channel dimension as we have done in previous lessons. Instead, we are performing simple 1+1 = 2 addition at the bottom of the module between the two branches.

While traditional neural network layers can be seen as learning a function $y = f(x)$, a residual layer attempts to approximate $y$ via $f(x)+id(x) = f (x)+x$ where $id(x)$ is the identity function.

These residual layers start at the identity function and evolve to become more complex as the network learns. This type of residual learning framework allows us to train networks that are substantially
deeper than previously proposed network architectures.

Furthermore, since the input is included in every residual module, it turns out the network can learn faster and with larger learning rates. It is very common to see the base learning rates for
ResNet implementations start at $1e-1$. For most architectures such as AlexNet or VGGNet, this high of a learning rate would almost guarantee the network would not converge. But since ResNet
relies on residual modules via identity mappings, this higher learning rate is completely possible.

In the same 2015 work, [He et al.](https://arxiv.org/abs/1512.03385) also included an extension to the original residual module called **bottlenecks** (Figure above, right). Here we can see that the same identity mapping is taking
place, only now the CONV layers in the left branch of the residual module have been updated:

1. We are utilizing three CONV layers rather than just two.
2. The first and last CONV layers are 1x 1 convolutions.
3. The number of filters learned in the first two CONV layers are 1/4 the number of filters learned
in the final CONV.

To understand why we call this a **bottleneck**, consider the following figure where two residual modules are stacked on top of each other, with one residual feeding into the next (Figure below).


<img width="400" src="https://drive.google.com/uc?export=view&id=11Ag3EgyXAf5fbgJd-YGO1y6ue3kxGRrj"/>

The first residual module accepts an input volume of size Mx N x 64 (the actual width and height are arbitrary for this example). The three CONV layers in the first residual module learn
K = 32, 32, and 128 filters, respectively. After applying the first residual module our output volume size is M xN x 128 which is then fed into the second residual module.

In the second residual module, our number of filters learned by each of the three CONV layers stays the same at K = 32, 32, and 128, respectively. However, notice that 32 < 128, implying that
we are actually reducing the volume size during the 1 x1 and 3x 3 CONV layers. This result has the benefit of leaving the 3x3 bottleneck layer with smaller input and output dimensions.

The final 1x1 CONV then applies 4x the number of filters than the first two CONV layers, thereby increasing dimensionality once again, which is why we call this update to the residual module
the **bottleneck** technique. When building our own residual modules, it’s common to supply pseudocode such as residual_module(K=128) which implies that the final CONV layer will learn
128 filters, while the first two will learn 128/4 = 32 filters. This notation is often easier to work with as it’s understood that the bottleneck CONV layers will learn 1/4th the number of filters as the
final CONV layer.

When it comes to training **ResNet**, we typically use the **bottleneck variant** of the residual module rather than the original version, especially for ResNet implementations with > 50 layers.

## 2.2 Rethinking the Residual Module

In 2016, He et al. published a second paper on the residual module entitled [Identity Mappings
in Deep Residual Networks](https://arxiv.org/abs/1603.05027). This publication described a comprehensive study, both theoretically and empirically, on the ordering of convolutional, activation, and batch normalization
layers within the residual module itself. Originally, the residual module (with bottleneck) looked like Figure below (left).

<img width="400" src="https://drive.google.com/uc?export=view&id=1Gl7Dzj3rKcSP5WW0RWvB4i3h2MroXljX"/>

The original residual module with bottleneck accepts an input (a ReLU activation map) and then applies a series of (CONV => BN => RELU) * 2 => CONV => BN before adding this output to the original input and applying a final ReLU activation (which is then fed into the next residual module in the network). However, the He et al. 2016 study, it was found there was a more optimal layer ordering capable of obtaining higher accuracy – this method is called **pre-activation**.

In the **pre-activation** version of the residual module, we remove the ReLU at the bottom of the module and re-order the batch normalization and activation such that they come before the
convolution (Figure above, right).

Now, instead of starting with a convolution, we apply a series of (BN => RELU => CONV) * 3 (assuming the bottleneck is being used, of course). The output of the residual module is now the
addition operation which is subsequently fed into the next residual module in the network (since residual modules are stacked on top of each other).

We call this layer ordering pre-activation as our ReLUs and batch normalization are placed before the convolutions, which is in contrast to the typical approach of applying ReLUs and batch
normalizations after the convolutions. In our next section, we’ll implement ResNet from scratch using both **bottlenecks** and **pre-activations**.

# 3.0 Implementing ResNet

Now that we have reviewed the ResNet architecture, let’s go ahead and implement in Keras. For
this specific implementation, we’ll be using the most recent incarnation of the residual module,
including bottlenecks and pre-activations.

We start off by importing our fairly standard set of classes and functions when building Convolutional Neural Networks. However, I would like to draw your attention to **Line 12** where we
import the add function. Inside the residual module, we’ll need to add together the outputs of two branches, which will be accomplished via this add method. We’ll also import the **l2 function** on
**Line 13** so that we can perform **L2 weight decay**. 
> Regularization is extremely important when training ResNet since, due to the network’s depth, it is prone to overfitting.



Next, let’s move on to our residual_module:

```python
class ResNet:
	@staticmethod
	def residual_module(data, K, stride, chanDim, red=False, reg=0.0001, bnEps=2e-5, bnMom=0.9):
```

This specific implementation of ResNet was inspired by both He et al. in their [Caffe distribution](https://github.com/KaimingHe/deep-residual-networks) as well as the mxnet implementation from [Wei Wu](https://github.com/tornadomeet/ResNet), therefore we will follow their parameter choices as closely as possible. Looking at the residual_module we can see that the function accepts more parameters than any of our previous functions – let’s review each of them in detail.

The data parameter is simply the input to the residual module. The **value K** defines the number of filters that will be learned by the final CONV in the bottleneck. The first two CONV layers will
learn K / 4 filters, as per the He et al. paper. 

**The stride** controls the stride of the convolution.
We’ll use this parameter to help us reduce the spatial dimensions of our volume without resorting to max pooling.

We then have the **chanDim parameter** which defines the axis which will perform batch normalization – this value is specified later in the build function based on whether we are using “channels
last” or “channels first” ordering.

**Not all residual modules will be responsible for reducing the dimensions of our spatial volume** – the red (i.e., “reduce”) boolean will control whether we are reducing spatial dimensions (True) or not (False).

We can then supply a regularization strength to all CONV layers in the residual module **via reg**. The **bnEps parameter** controls the $\epsilon$ responsible for avoiding “division by zero” errors when normalizing inputs. In Keras, $\epsilon$ defaults to 0.001; however, for our particular implementation, we’ll allow this value to be reduced significantly. The **bnMom controls** the momentum for the moving average. This value normally defaults to 0.99 inside Keras, but He et al. as well as Wei Wu recommend decreasing the value to 0.9.



Now that the parameters of **residual_module** are defined, let’s move on to the body of the function:

```python
19:        # the shortcut branch of the ResNet module should be
20:		# initialize as the input (identity) data
21:		shortcut = data
22:
23:		# the first block of the ResNet module are the 1x1 CONVs
24:		bn1 = BatchNormalization(axis=chanDim, epsilon=bnEps, momentum=bnMom)(data)
25:		act1 = Activation("relu")(bn1)
26:		conv1 = Conv2D(int(K * 0.25), (1, 1), use_bias=False, kernel_regularizer=l2(reg))(act1)
````

On **Line 21** we initialize the **shortcut** in the residual module, which is simply a reference to the **input data**. We will later add the shortcut to the output of our bottleneck + pre-activation branch.

The first pre-activation of the bottleneck branch can be seen in Lines 24-26. Here we apply batch normalization layer, followed by ReLU activation, and then a 1x1 convolution, using K/4 total filters. **You’ll also notice that we are excluding the bias term from our CONV layers via use_bias=False**. Why might we wish to purposely leave out the bias term? [According to He et
al., the biases are in the BN layers that immediately follow the convolutions](https://github.com/KaimingHe/deep-residual-networks/issues/10#issuecomment-194037195), so there is no need to introduce a second bias term.

Next, we have our second CONV layer in the bottleneck, this one responsible for learning a total of K/4, 3x3 filters according to **Lines 28 to 31**.

The final block in the bottleneck learns K filters, each of which are 1x1 acording to **Lines 33 to 37**.

The next step is to see if we need to reduce spatial dimensions, thereby alleviating the need to
apply max pooling:

```python
39:    # if we are to reduce the spatial size, apply a CONV layer to
40:    # the shortcut
41:		if red:
42:			shortcut = Conv2D(K, (1, 1), strides=stride, use_bias=False, kernel_regularizer=l2(reg))(act1)
```

If we are instructed to reduce spatial dimensions, we’ll do so with a convolutional layer (applied to the shortcut) with a stride > 1. The output of the final conv3 in the bottleneck is the added together with the shortcut, thus serving as the output of the **residual_module**:


```python
44:    # add together the shortcut and the final CONV
45:    x = add([conv3, shortcut])
46:
47:    # return the addition as the output of the ResNet module
48:    return x
```




The **residual_module** will serve as our building block when creating deep residual networks.
Let’s move on to using this building block inside the build method:

```python
@staticmethod
	def build(width, height, depth, classes, stages, filters, reg=0.0001, bnEps=2e-5, bnMom=0.9, dataset="cifar"):
```

Just as our **residual_module** requires more parameters than previous micro-architecture implementations, the same is true for our **build** function. The **width, height, and depth** classes
all control the input spatial dimensions of the images in our dataset. The classes variable dictates how many overall classes our network should learn – these variables you have already seen.

> What is interesting are the **stages** and **filters** parameters, both of which are **lists**. 

When constructing the **ResNet architecture**, we’ll be stacking a number of residual modules on top of each other (using the same number of filters for each stack), followed by reducing the spatial dimensions of the volume – this process is then continued until we are ready to apply our average pooling and softmax classifier.


To make this point clear, let’s suppose that stages=(3, 4, 6) and filters=(64, 128, 256, 512). The first filter value, 64, will be applied to the only CONV layer not part of the residual module (i.e, first convolutional layer in the network). We’ll then stack three residual modules on top of each other – each of these residual modules will learn K = 128 filters. The spatial dimensions of the volume will be reduced, and then we’ll move on to the second entry in stages where we’ll stack four residual modules on top of each other, each responsible for learning K = 256 filters. After these four residual modules, we’ll again reduce dimensionality and move on to the final entry in the stages list, instructing us to stack six residual modules on top of each other, where each residual module will learn K = 512.

The benefit of specifying both **stages** and **filters** in a list (rather than hardcoding them) is that we can easily leverage for loops to build the very deep network architectures without introducing code bloat – this point will become more clear later in our implementation. For the sake of understanding, in the [He at al., 2015](https://arxiv.org/pdf/1512.03385.pdf), Table 1, it is possible to note the different stages in ResNet. Considering the 152-layer model, stages will be (3,8,36,3) whereas filters are (64 \<conv1\>, 256, 512, 1024, 2048).

<img width="600" src="https://drive.google.com/uc?export=view&id=1D1IzM7a2BnP1TkX0Nq21eNADpWY5HXyX"/>

Finally, we have the dataset parameter which is assumed to be a string. Depending on
the dataset we are building ResNet for, we may want to apply more/less convolutions and batch normalizations before we start stacking our residual modules. We’ll see why we might want to vary the number of convolutional layers latter, but for the time being, you can safely ignore this parameter.


Unlike previous network architectures we have seen in this course (where the first layer is typically a CONV), we see that ResNet uses a BN as the first layer. 

> The reasoning behind applying batch normalization to your input is an added level of normalization. 

In fact, **performing batch normalization on the input itself can sometimes remove the need to apply mean normalization to the inputs**. In either case, the BN on **Line 68** acts as an added level of normalization.

```python
		# set the input and apply BN
		inputs = Input(shape=inputShape)
		x = BatchNormalization(axis=chanDim, epsilon=bnEps, momentum=bnMom)(inputs)
```

From there, we apply a single CONV layer on **Lines 70 to 73**. This CONV layer will learn a total of filters[0], 3x3 filters (keep in mind that filters is a list, so this value is specified via the build method when constructing the architecture).

You’ll also notice that I’ve made a check to see if we are using the CIFAR-10 dataset (**Line 71**). Later in this lesson, we’ll explain the elif statement for Tiny ImageNet.
Since the input dimensions to Tiny ImageNet are larger, we’ll apply a series of convolutions, batch normalizations, and max pooling (the only max pooling in the ResNet architecture) before we start stacking residual modules. However, for the time being, we are only using the CIFAR-10 dataset.


Let’s go ahead and start stacking residual layers on top of each other, the cornerstone of the
ResNet architecture:

```python
84:		# loop over the number of stages
85:		for i in range(0, len(stages)):
86:			# initialize the stride, then apply a residual module
87:			# used to reduce the spatial size of the input volume
88:			stride = (1, 1) if i == 0 else (2, 2)
89:			x = ResNet.residual_module(x, filters[i + 1], stride, chanDim, red=True, bnEps=bnEps, bnMom=bnMom)
90:
91:			# loop over the number of layers in the stage
92:			for j in range(0, stages[i] - 1):
93:				# apply a ResNet module
94:				x = ResNet.residual_module(x, filters[i + 1], (1, 1), chanDim, bnEps=bnEps, bnMom=bnMom)
```

On **Line 85** we start looping over the list of stages. Keep in mind that every entry in the **stages list** is an integer, indicating how many residual modules will be stacked on top of each other. Following the work of [Springenberg et al.](https://www.arxiv-vanity.com/papers/1412.6806/), ResNet tries to reduce the usage of pooling as
much as possible, relying on CONV layers to reduce the spatial dimensions of a volume.

To reduce volume size without pooling layers, we must set the stride of the convolution on **Line 88**. If this is the first entry in the stage, we’ll set the stride to (1, 1), indicating that no downsampling should be performed. However, for every subsequent stage we’ll apply a residual module with a stride of (2, 2), which will allow us to decrease the volume size.

From there, we’ll loop over the number of layers in the current stage on **Line 92** (i.e., the number of residual modules that will be stacked on top of each other). The number of filters each residual module will learn is controlled by the corresponding entry in the filters list. The reason we use i + 1 as the index into filters is because the first filter value was used on **Line 73**. The rest of the filter values correspond to the number of filters in each stage. Once we have stacked stages[i] residual modules on top of each other, our for loop brings us back up to **Line 89** where we decrease the spatial dimensions of the volume and repeat the process.

At this point, our volume size has been reduced to **8 x 8 x num_filters** (you can verify this for yourself by computing the input/output volume sizes for each layer. In order to avoid using dense fully-connected layers, we’ll instead apply average pooling to
reduce the volume size to 1 x 1 x classes:

```python
		# apply BN => ACT => POOL
		x = BatchNormalization(axis=chanDim, epsilon=bnEps,momentum=bnMom)(x)
		x = Activation("relu")(x)
		x = AveragePooling2D((8, 8))(x)
```

From there, we create a dense layer for the total number of classes we are going to learn, followed by applying a softmax activation to obtain our final output probabilities:

```python
		# softmax classifier
		x = Flatten()(x)
		x = Dense(classes, kernel_regularizer=l2(reg))(x)
		x = Activation("softmax")(x)

		# create the model
		model = Model(inputs, x, name="resnet")

		# return the constructed network architecture
		return model
```

In [None]:
# import the necessary packages
from tensorflow.keras.layers import BatchNormalization
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import AveragePooling2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import ZeroPadding2D
from tensorflow.keras.layers import Activation
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Input
from tensorflow.keras.models import Model
from tensorflow.keras.layers import add
from tensorflow.keras.regularizers import l2
from tensorflow.keras import backend as K

class ResNet:
	@staticmethod
	def residual_module(data, K, stride, chanDim, red=False, reg=0.0001, bnEps=2e-5, bnMom=0.9):
    # 
    # based on pre-activation residual module
    #
		# the shortcut branch of the ResNet module should be
		# initialize as the input (identity) data
		shortcut = data

		# the first block of the ResNet module are the 1x1 CONVs
		bn1 = BatchNormalization(axis=chanDim, epsilon=bnEps, momentum=bnMom)(data)
		act1 = Activation("relu")(bn1)
		conv1 = Conv2D(int(K * 0.25), (1, 1), use_bias=False, kernel_regularizer=l2(reg))(act1)

		# the second block of the ResNet module are the 3x3 CONVs
		bn2 = BatchNormalization(axis=chanDim, epsilon=bnEps, momentum=bnMom)(conv1)
		act2 = Activation("relu")(bn2)
		conv2 = Conv2D(int(K * 0.25), (3, 3), strides=stride, padding="same", use_bias=False, kernel_regularizer=l2(reg),)(act2)

		# the third block of the ResNet module is another set of 1x1
		# CONVs
		bn3 = BatchNormalization(axis=chanDim, epsilon=bnEps, momentum=bnMom)(conv2)
		act3 = Activation("relu")(bn3)
		conv3 = Conv2D(K, (1, 1), use_bias=False,kernel_regularizer=l2(reg))(act3)

		# if we are to reduce the spatial size, apply a CONV layer to
		# the shortcut
		if red:
			shortcut = Conv2D(K, (1, 1), strides=stride, use_bias=False, kernel_regularizer=l2(reg))(act1)

		# add together the shortcut and the final CONV
		x = add([conv3, shortcut])

		# return the addition as the output of the ResNet module
		return x

	@staticmethod
	def build(width, height, depth, classes, stages, filters, reg=0.0001, bnEps=2e-5, bnMom=0.9, dataset="cifar"):
		# initialize the input shape to be "channels last" and the
		# channels dimension itself
		inputShape = (height, width, depth)
		chanDim = -1

		# if we are using "channels first", update the input shape
		# and channels dimension
		if K.image_data_format() == "channels_first":
			inputShape = (depth, height, width)
			chanDim = 1

		# set the input and apply BN
		inputs = Input(shape=inputShape)
		x = BatchNormalization(axis=chanDim, epsilon=bnEps, momentum=bnMom)(inputs)

		# check if we are utilizing the CIFAR dataset
		if dataset == "cifar":
			# apply a single CONV layer
			x = Conv2D(filters[0], (3, 3), use_bias=False, padding="same", kernel_regularizer=l2(reg))(x)

		# check to see if we are using the Tiny ImageNet dataset
		elif dataset == "tiny_imagenet":
			# apply CONV => BN => ACT => POOL to reduce spatial size
			x = Conv2D(filters[0], (5, 5), use_bias=False, padding="same", kernel_regularizer=l2(reg))(x)
			x = BatchNormalization(axis=chanDim, epsilon=bnEps, momentum=bnMom)(x)
			x = Activation("relu")(x)
			x = ZeroPadding2D((1, 1))(x)
			x = MaxPooling2D((3, 3), strides=(2, 2))(x)

		# loop over the number of stages
		for i in range(0, len(stages)):
			# initialize the stride, then apply a residual module
			# used to reduce the spatial size of the input volume
			stride = (1, 1) if i == 0 else (2, 2)
			x = ResNet.residual_module(x, filters[i + 1], stride, chanDim, red=True, bnEps=bnEps, bnMom=bnMom)

			# loop over the number of layers in the stage
			for j in range(0, stages[i] - 1):
				# apply a ResNet module
				x = ResNet.residual_module(x, filters[i + 1], (1, 1), chanDim, bnEps=bnEps, bnMom=bnMom)

		# apply BN => ACT => POOL
		x = BatchNormalization(axis=chanDim, epsilon=bnEps,momentum=bnMom)(x)
		x = Activation("relu")(x)
		x = AveragePooling2D((8, 8))(x)

		# softmax classifier
		x = Flatten()(x)
		x = Dense(classes, kernel_regularizer=l2(reg))(x)
		x = Activation("softmax")(x)

		# create the model
		model = Model(inputs, x, name="resnet")

		# return the constructed network architecture
		return model

In [None]:
model = ResNet.build(32, 32, 3, 10, (9, 9, 9),(64, 64, 128, 256), reg=0.0005)
model.summary()

# 4.0 ResNet on CIFAR-10

Outside of training smaller variants of ResNet on the full ImageNet dataset, I had never attempted to train ResNet on CIFAR-10 (or Stanford’s Tiny ImageNet challenge, as we’ll see in this section). Because of this fact, I have decided to treat this section and the next as candid case studies where I reveal my personal rules of thumb and best practices I have mentioned in the previous lessons. These best practices allow me to approach a new problem with an initial plan, iterate on it, and eventually arrive at a solution that obtains good accuracy. In the case of CIFAR-10, we’ll be able to replicate the performance of **He et al.** and claim a spot amongst other [state-of-the-art approaches](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html#43494641522d3130).

## 4.1 Useful functions

In [None]:
# import the necessary packages
from tensorflow.keras.callbacks import BaseLogger
import matplotlib.pyplot as plt
import numpy as np
import json
import os

class TrainingMonitor(BaseLogger):
	def __init__(self, figPath, jsonPath=None, startAt=0):
		# store the output path for the figure, the path to the JSON
		# serialized file, and the starting epoch
		super(TrainingMonitor, self).__init__()
		self.figPath = figPath
		self.jsonPath = jsonPath
		self.startAt = startAt

	def on_train_begin(self, logs={}):
		# initialize the history dictionary
		self.H = {}

		# if the JSON history path exists, load the training history
		if self.jsonPath is not None:
			if os.path.exists(self.jsonPath):
				self.H = json.loads(open(self.jsonPath).read())

				# check to see if a starting epoch was supplied
				if self.startAt > 0:
					# loop over the entries in the history log and
					# trim any entries that are past the starting
					# epoch
					for k in self.H.keys():
						self.H[k] = self.H[k][:self.startAt]

	def on_epoch_end(self, epoch, logs={}):
		# loop over the logs and update the loss, accuracy, etc.
		# for the entire training process
		for (k, v) in logs.items():
			l = self.H.get(k, [])
			l.append(float(v))
			self.H[k] = l

		# check to see if the training history should be serialized
		# to file
		if self.jsonPath is not None:
			f = open(self.jsonPath, "w")
			f.write(json.dumps(self.H))
			f.close()

		# ensure at least two epochs have passed before plotting
		# (epoch starts at zero)
		if len(self.H["loss"]) > 1:
			# plot the training loss and accuracy
			N = np.arange(0, len(self.H["loss"]))
			plt.style.use("ggplot")
			plt.figure()
			plt.plot(N, self.H["loss"], label="train_loss")
			plt.plot(N, self.H["val_loss"], label="val_loss")
			plt.plot(N, self.H["accuracy"], label="train_acc")
			plt.plot(N, self.H["val_accuracy"], label="val_acc")
			plt.title("Training Loss and Accuracy [Epoch {}]".format(
				len(self.H["loss"])))
			plt.xlabel("Epoch #")
			plt.ylabel("Loss/Accuracy")
			plt.legend()

			# save the figure
			plt.savefig(self.figPath)
			plt.close()

In [None]:
# import the necessary packages
from tensorflow.keras.callbacks import Callback
import os

class EpochCheckpoint(Callback):
	def __init__(self, outputPath, every=5, startAt=0):
		# call the parent constructor
		super(Callback, self).__init__()

		# store the base output path for the model, the number of
		# epochs that must pass before the model is serialized to
		# disk and the current epoch value
		self.outputPath = outputPath
		self.every = every
		self.intEpoch = startAt

	def on_epoch_end(self, epoch, logs={}):
		# check to see if the model should be serialized to disk
		if (self.intEpoch + 1) % self.every == 0:
			p = os.path.sep.join([self.outputPath,
				"epoch_{}.hdf5".format(self.intEpoch + 1)])
			self.model.save(p, overwrite=True)

		# increment the internal epoch counter
		self.intEpoch += 1

In [None]:
# create some folder to store results and metadata
!mkdir output
!mkdir output/checkpoints

In [None]:
#
# import libraries, load dataset and pre-processing
#

# set the matplotlib backend so figures can be saved in the background
import matplotlib
matplotlib.use("Agg")

# import the necessary packages
from sklearn.preprocessing import LabelBinarizer
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.datasets import cifar10
from tensorflow.keras.models import load_model
from tensorflow.keras.callbacks import LearningRateScheduler
import tensorflow.keras.backend as K
import numpy as np
import sys

# load the training and testing data, converting the images from
# integers to floats
print("[INFO] loading CIFAR-10 data...")
((train_x, train_y), (test_x, test_y)) = cifar10.load_data()
train_x = train_x.astype("float")
test_x = test_x.astype("float")

# apply mean subtraction to the data
mean = np.mean(train_x, axis=0)
train_x -= mean
test_x -= mean

# convert the labels from integers to vectors
lb = LabelBinarizer()
train_y = lb.fit_transform(train_y)
test_y = lb.transform(test_y)

# construct the image generator for data augmentation
aug = ImageDataGenerator(width_shift_range=0.1, 
                         height_shift_range=0.1, 
                         horizontal_flip=True,
                         fill_mode="nearest")

## 4.2 Training ResNet on CIFAR-10 With the ctrl + c Method

Whenever I start a new set of experiments with either a network architecture I am unfamiliar with, a dataset I have never worked with, or both, I always begin with the ctrl + c method of training. 

Using this method:

1. I can start training with an initial learning rate (and associated set of hyperparameters), 
2. monitor training, 
3. and quickly adjust the learning rate based on the results as they come in. 

This method is especially helpful when I am totally unsure on the approximate number of epochs it will take for a given architecture to obtain reasonable accuracy or a specific dataset.

In the case of CIFAR-10, I have previous experience (as do you, after studying the previous lessons), so I’m quite confident that it will take 60-100 epochs, but I’m not exactly sure since I’ve never trained ResNet on the CIFAR-10 before.

Therefore, our first few experiments will rely on the ctrl + c method of training to narrow in on what hyperparameters we should be using. Once we are comfortable with our set of hyperparameters, we’ll switch over to a specific learning rate decay schedule in hopes of milking every last bit of accuracy out of the training process.

**To start**, take a look at the **learning rate** for our SGD optimizer on **Line 12** – at $1e-1$ this learning rate is by far the largest we have used in this course (by an order of magnitude). The reason we are able to get away with such a high learning rate is due to the identity mappings built into the residual module. Learning rates this high would not (typically) work for networks such as AlexNet, VGG, etc.

We then **instantiate our ResNet model** on **Line 15**. Here we can see that the network
will accept input images with a width of 32 pixels, height of 32 pixels, and depth of 3 (one for each of the RGB channels in the CIFAR-10 dataset). Since the **CIFAR-10 dataset has ten classes**, we’ll learn ten output labels.

The next parameter we need to supply is (9, 9, 9), or the number of stages in our architecture. This tuple indicates that we will be learning three stages with each stage containing nine residual modules stacked on top of each other. In between each stage, we will apply an additional residual module to decrease the volume size.

The next parameter, (64, 64, 128, 256) is the number of filters that the CONV layers will
learn. The first CONV layer (before any residual model is applied) will learn K = 64 filters. The remaining entries, 64, 128, and 256 correspond to the number of filters each of the residual module stages will learn. For example, the first nine residual modules will learn K = 64 filters. The second set of nine residual modules will learn K = 128 filters. And finally, the last set of nine residual modules will learn K = 256 filters. The last argument we’ll supply to ResNet is reg, or our L2 regularization strength for weight decay – this value is crucial as it will enable us to prevent
overfitting.

In [None]:
# In the case that we need to restart training from a particular epoch
start_epoch = 0

# Number of epochs you'd like the model to be trained.
number_epochs = 10

# If the restart training is necessary
resume = False

if resume == False:
  # if there is no specific model checkpoint supplied, then initialize
  # the network (ResNet-XX) and compile the model
  print("[INFO] compiling model...")

  opt = SGD(learning_rate=1e-1)

  # def build(width, height, depth, classes, stages, filters, reg=0.0001, bnEps=2e-5, bnMom=0.9, dataset="cifar"):
  model = ResNet.build(32, 32, 3, 10, (9, 9, 9),(64, 64, 128, 256), reg=0.0005)
  model.compile(loss="categorical_crossentropy",
                optimizer=opt,
                metrics=["accuracy"])
else:
	print("[INFO] loading {}...".format("output/checkpoints/epoch_10.hdf5"))
	model = load_model("output/checkpoints/epoch_10.hdf5")

	# update the learning rate
	print("[INFO] old learning rate: {}".format(K.get_value(model.optimizer.lr)))
	K.set_value(model.optimizer.lr, 1e-2)
	print("[INFO] new learning rate: {}".format(K.get_value(model.optimizer.lr)))


# construct the set of callbacks
callbacks = [EpochCheckpoint("output/checkpoints", 
                             every=5,
                             startAt=start_epoch),
             TrainingMonitor("output/resnetXX_cifar10.png",
                             jsonPath="output/resnetXX_cifar10.json",
                             startAt=start_epoch)]

# train the network
print("[INFO] training network...")
model.fit(aug.flow(train_x, train_y, batch_size=128),
          validation_data=(test_x, test_y),
          steps_per_epoch=len(train_x) // 128, 
          epochs=number_epochs,
          callbacks=callbacks,
          verbose=1,
          initial_epoch=start_epoch)

## 4.3 ResNet on CIFAR-10: Experiment 01

In my very first experiment with CIFAR-10, I was worried about the number of filters in the network, especially regarding overfitting. Because of this concern, my initial filter list consisted of (64, 64, 128, 256) along with (9, 9, 9) stages of residual modules. I also applied a very small amount of L2 regularization with reg=0.0001 – I knew regularization would be needed, but I wasn’t sure on the correct amount (yet). ResNet was trained using SGD with a base learning rate of $1e-1$ and a momentum term of 0.9.

In [None]:
# In the case that we need to restart training from a particular epoch
start_epoch = 0

# Number of epochs you'd like the model to be trained.
number_epochs = 50

# If the restart training is necessary
resume = False

if resume == False:
  # if there is no specific model checkpoint supplied, then initialize
  # the network (ResNet-XX) and compile the model
  print("[INFO] compiling model...")

  opt = SGD(learning_rate=1e-1)

  # def build(width, height, depth, classes, stages, filters, reg=0.0001, bnEps=2e-5, bnMom=0.9, dataset="cifar"):
  model = ResNet.build(32, 32, 3, 10, (9, 9, 9),(64, 64, 128, 256), reg=0.0005,bnMom=0.9)
  model.compile(loss="categorical_crossentropy",
                optimizer=opt,
                metrics=["accuracy"])
else:
	print("[INFO] loading {}...".format("output/checkpoints/epoch_50.hdf5"))
	model = load_model("output/checkpoints/epoch_50.hdf5")

	# update the learning rate
	print("[INFO] old learning rate: {}".format(K.get_value(model.optimizer.lr)))
	K.set_value(model.optimizer.lr, 1e-2)
	print("[INFO] new learning rate: {}".format(K.get_value(model.optimizer.lr)))


# construct the set of callbacks
callbacks = [EpochCheckpoint("output/checkpoints", 
                             every=5,
                             startAt=start_epoch),
             TrainingMonitor("output/resnetExp01_cifar10.png",
                             jsonPath="output/resnetExp01_cifar10.json",
                             startAt=start_epoch)]

# train the network
print("[INFO] training network...")
model.fit(aug.flow(train_x, train_y, batch_size=128),
          validation_data=(test_x, test_y),
          steps_per_epoch=len(train_x) // 128, 
          epochs=number_epochs,
          callbacks=callbacks,
          verbose=1,
          initial_epoch=start_epoch)

<font color="red">Past epoch 50</font> I noticed training loss starting to slow as well as some volatility in the validation loss (and a growing gap between the two). 

<img width="600" src="https://drive.google.com/uc?export=view&id=1Nx2NldT_rHWM6fn4glg3nGjxVtelNpUJ"/>

I stopped training, lowered the learning rate to $1e-2$, and then continued training:

In [None]:
# In the case that we need to restart training from a particular epoch
start_epoch = 50

# Number of epochs you'd like the model to be trained.
number_epochs = 75

# If the restart training is necessary
resume = True

if resume == False:
  # if there is no specific model checkpoint supplied, then initialize
  # the network (ResNet-XX) and compile the model
  print("[INFO] compiling model...")

  opt = SGD(learning_rate=1e-2)

  # def build(width, height, depth, classes, stages, filters, reg=0.0001, bnEps=2e-5, bnMom=0.9, dataset="cifar"):
  model = ResNet.build(32, 32, 3, 10, (9, 9, 9),(64, 64, 128, 256), reg=0.0005,bnMom=0.9)
  model.compile(loss="categorical_crossentropy",
                optimizer=opt,
                metrics=["accuracy"])
else:
	print("[INFO] loading {}...".format("output/checkpoints/epoch_50.hdf5"))
	model = load_model("output/checkpoints/epoch_50.hdf5")

	# update the learning rate
	print("[INFO] old learning rate: {}".format(K.get_value(model.optimizer.lr)))
	K.set_value(model.optimizer.lr, 1e-2)
	print("[INFO] new learning rate: {}".format(K.get_value(model.optimizer.lr)))


# construct the set of callbacks
callbacks = [EpochCheckpoint("output/checkpoints", 
                             every=5,
                             startAt=start_epoch),
             TrainingMonitor("output/resnetExp01_cifar10.png",
                             jsonPath="output/resnetExp01_cifar10.json",
                             startAt=start_epoch)]

# train the network
print("[INFO] training network...")
model.fit(aug.flow(train_x, train_y, batch_size=128),
          validation_data=(test_x, test_y),
          steps_per_epoch=len(train_x) // 128, 
          epochs=number_epochs,
          callbacks=callbacks,
          verbose=1,
          initial_epoch=start_epoch)

The drop in learning rate proved very effective, stabilizing validation loss, but also overfitting on the training set start to creep in (in inevitability when working with CIFAR-10) around epoch 75. 

<img width="600" src="https://drive.google.com/uc?export=view&id=1_BI0gw4tPuAkQOBRqnR2pCCCTuj5XZex"/>

After epoch 75 I once again stopped training, lowered the learning rate to
$1e-3$, and allowed ResNet to continue training for another 10 epochs:

In [None]:
# In the case that we need to restart training from a particular epoch
start_epoch = 75

# Number of epochs you'd like the model to be trained.
number_epochs = 85

# If the restart training is necessary
resume = True

if resume == False:
  # if there is no specific model checkpoint supplied, then initialize
  # the network (ResNet-XX) and compile the model
  print("[INFO] compiling model...")

  opt = SGD(learning_rate=1e-3)

  # def build(width, height, depth, classes, stages, filters, reg=0.0001, bnEps=2e-5, bnMom=0.9, dataset="cifar"):
  model = ResNet.build(32, 32, 3, 10, (9, 9, 9),(64, 64, 128, 256), reg=0.0005,bnMom=0.9)
  model.compile(loss="categorical_crossentropy",
                optimizer=opt,
                metrics=["accuracy"])
else:
	print("[INFO] loading {}...".format("output/checkpoints/epoch_75.hdf5"))
	model = load_model("output/checkpoints/epoch_75.hdf5")

	# update the learning rate
	print("[INFO] old learning rate: {}".format(K.get_value(model.optimizer.lr)))
	K.set_value(model.optimizer.lr, 1e-3)
	print("[INFO] new learning rate: {}".format(K.get_value(model.optimizer.lr)))


# construct the set of callbacks
callbacks = [EpochCheckpoint("output/checkpoints", 
                             every=5,
                             startAt=start_epoch),
             TrainingMonitor("output/resnetExp01_cifar10.png",
                             jsonPath="output/resnetExp01_cifar10.json",
                             startAt=start_epoch)]

# train the network
print("[INFO] training network...")
model.fit(aug.flow(train_x, train_y, batch_size=128),
          validation_data=(test_x, test_y),
          steps_per_epoch=len(train_x) // 128, 
          epochs=number_epochs,
          callbacks=callbacks,
          verbose=1,
          initial_epoch=start_epoch)

The final plot is shown below, where we reach 90.40% accuracy on the validation set. For our very first experiment 90.40% is a good start; however, it’s not as high as the
90.81% achieved by GoogLeNet in previous lesson. Furthermore, He et al. reported an accuracy of 93% with ResNet on CIFAR-10, so we clearly have some work to do.

<img width="600" src="https://drive.google.com/uc?export=view&id=1rUbzGpeMFTWwMhh9RXC7f2i61HjaXdw2"/>


## 4.4 Training ResNet on CIFAR-10 with Learning Rate Decay: Experiment 02

At this point, it seems that we have gotten as far as we can using standard ctrl + c training. We’ve also been able to see that our most successful experiments occur when we can train for longer, in the range of 85-100 epochs. However, there are two major problems we need to overcome:

1. Whenever we drop the learning rate by an order of magnitude and restart training, we obtain a nice bump in accuracy, but then we quickly plateau.
2. We are overfitting.

To solve these problems, and boost accuracy further, a good experiment to try is linearly
decreasing the learning rate over a large number of epochs, typically about the same as your longest ctrl + c experiments (if not slightly longer).

In [None]:
# define the total number of epochs to train for along with the
# initial learning rate
NUM_EPOCHS = 100
INIT_LR = 1e-1

def poly_decay(epoch):
	# initialize the maximum number of epochs, base learning rate,
	# and power of the polynomial
	maxEpochs = NUM_EPOCHS
	baseLR = INIT_LR
	power = 1.0

	# compute the new learning rate based on polynomial decay
	alpha = baseLR * (1 - (epoch / float(maxEpochs))) ** power

	# return the new learning rate
	return alpha

callbacks = [TrainingMonitor("output/resnetExp02_cifar10.png", jsonPath="output/resnetExp02_cifar10.json"),
             LearningRateScheduler(poly_decay)]

# initialize the optimizer and model (ResNet-XX)
print("[INFO] compiling model...")
opt = SGD(learning_rate=INIT_LR, momentum=0.9)
# def build(width, height, depth, classes, stages, filters, reg=0.0001, bnEps=2e-5, bnMom=0.9, dataset="cifar"):
model = ResNet.build(32, 32, 3, 10, (9, 9, 9), (64, 64, 128, 256), reg=0.0005)
model.compile(loss="categorical_crossentropy", optimizer=opt,metrics=["accuracy"])

# train the network
print("[INFO] training network...")
model.fit(aug.flow(train_x, train_y, batch_size=128),
          validation_data=(test_x, test_y),
          steps_per_epoch=len(train_x) // 128, epochs=NUM_EPOCHS,
          callbacks=callbacks, verbose=1)

# save the network to disk
print("[INFO] serializing network...")
model.save("epoch_100.hdf5")

After the 100th epoch, ResNet is reaching 93.55% accuracy on our testing set. This result
is substantially higher than our previous experiment, and more importantly, it has allowed us to replicate the results from He et al. when training ResNet on CIFAR-10.
Taking a look at the [CIFAR-10 leaderboard](http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html#43494641522d3130), we see that He et al. reached 93.57% accuracy, near identical to our result. 