### ResNet

Here, we are building ResNet, which won the 2015 ImageNet challenge. We're recreating Version 2, which switches the order of certain layers so that performance is improved. 

Links: 

ResNet, version 1: 
https://arxiv.org/pdf/1512.03385.pdf

ResNet, version 2: https://arxiv.org/pdf/1603.05027.pdf

ImageNet: https://en.wikipedia.org/wiki/ImageNet#ImageNet_Challenge

Andrew Ng's explanation of ResNet:
https://www.youtube.com/watch?v=ZILIbUvp5lk

#### Issues with deep learning obstacles

Since deep learning models have a lot of parameters, they take a long time to train. Moreover, because deep learning models extract more hidden features, they have a higher risk of overfitting. But, there are tricks around these problems. 

ResNet was designed to resolve a problem known as "degradation"

#### Degradation

After AlexNet in 2012, researchers found out that adding more layers to a CNN improved its performance. Researchers began to create models with double-digit layers. However, degradation was a problem that was encountered after a certain number of layers were added. 

Degradation of deep neural networks refers to the plateau, then the rapid decline, of model accuracy as model depth is increased. Notably, this isn't overfitting - the decrease in accuracy is evident in both the training and test sets.


#### ResNet

ResNet was designed to solve the problem of degradation. Similar to SqueezeNet, which had a main module (the Fire module), the ResNet also has a building block. Each of the building blocks incorporate "residual learning", which counteracts degradation. 

Using the ResNet architecture, we overcome the degradation concerns that come with adding more layers - the ResNet model that won the 2015 ImageNet challenge had 152 layers. 

#### ImageNet

The dataset used in the ImageNet contains 1.4 million images, with 1.2 million in the training set, 50,000 in the validation set, and 150,000 in the (unreleased) test set. The images are distributed across 1,000 categories. 

#### Image processing

The ImageNet images have varying heights and widths. We have to resize each image so that the new minimum dimension has size equal to some fixed value. We have to maintain the image's aspect ratio. 

### Overview of ResNet function

(from Andrew Ng's YouTube video)

The key intuition behind ResNet is that it uses past information and adds it to layers deeper in the network.

In a normal CNN, you take the output from one convolutional layer and pass it into another, and it proceeds in that feedforward manner. 

However, this leads to the problem of "degradation" as you increase the number of layers. 

To get around this issue, you can use "residual information", or information from earlier in the network, and incorporate that into your subsequent blocks (known as a "skip net" in ResNet parlance). This creates what are known as "residual blocks", or chunks of layers where the information at the beginning of the block is both fed into the next layer and used later downstream for a deeper layer. 

### Initialization

#### Model overview

The overall model contains four layers of blocks. The number of blocks per layer depends on how large we want the ResNet model to be. 

The 'block_layer_sizes' dictionary gives the # of blocks for each of the four block layers, based on the total # of weight layers in the ResNet model. 

We have six options for how many weight layers (where a weight layer is a convolution + one fully connected layer) are in a ResNet model: 18 ,34, 50, 101, 152, 200. When the number of layers is less than 50, the model uses a regular building block. When the number of layers is 50 or more, the model uses a bottleneck block. 

#### Filters and strides

The number of filters used in the model's convolution layers starts off at 64. In each subsequent block layer, the number of filters used is doubled (as you get deeper, you want to use more filters anyways). 

Most of the blocks in ResNet use a stride size of 1 for their convolution layers. Only the first block of each block layer can use a different stride size (set by self.block_strides)

#### Channel placement

Past models that we used implemented a NHWC format, where channels are specified last. But, Tensorflow also supports NCHW, or channels-first placement. The NCHW format is optimized for GPU training (while NHWC works better for CPUs), we create our model to allow for both formats. 

When we initialize our model, the channel format ('channels_first' or 'channels_last') is specified through the 'data_format' argument.

In [None]:
"""
# creating a ResNetModel class. 
# completing the initialization of the class
# setting up parts specific to a ResNet model: blocks per layer and bottleneck usage

import tensorflow as tf

block_layer_sizes = {
    18: [2, 2, 2, 2],
    34: [3, 4, 6, 3],
    50: [3, 4, 6, 3],
    101: [3, 4, 23, 3],
    152: [3, 8, 36, 3],
    200: [3, 24, 36, 3]
}

class ResNetModel(object):
    # Model Initialization
    def __init__(self, min_aspect_dim, resize_dim, num_layers, output_size,
        data_format='channels_last'):
        self.min_aspect_dim = min_aspect_dim
        self.resize_dim = resize_dim
        self.filters_initial = 64
        self.block_strides = [1, 2, 2, 2]
        self.data_format = data_format
        self.output_size = output_size
        self.block_layer_sizes = block_layer_sizes[num_layers]
        self.bottleneck = num_layers >= 50
"""

### Padding

We can create custom padding layers to maintain consistency in padding

The amount of padding depends on the kernel size, stride size, and input height/width. Since the images come in as different sizes and each layer has different configurations, this leads to inconsistent padding amounts at different layers of our model. 

When we pad our data for a convolution layer, we want the padding to be solely based on the size of the kernel. 

In [None]:
"""
def custom_padding(self, inputs, kernel_size):
    pad_total = kernel_size - 1
    pad_before = pad_total // 2
    pad_after = pad_total - pad_before
    if self.data_format == 'channels_first':
        padded_inputs = tf.pad(
            inputs,
            [[0, 0], [0, 0], [pad_before, pad_after], [pad_before, pad_after]])
    else:
        padded_inputs = tf.pad(
            inputs,
            [[0, 0], [pad_before, pad_after], [pad_before, pad_after], [0, 0]])
    return padded_inputs
"""

### Pre-activation


Batch normalization and pre-activation are important preprocessing steps in ResNet

#### Internal covariate shift

When training a model with many weight layers, a problem known as "internal covariate shift" occurs. 
To understand "internal covariate shift", we first need to understand "covariate shift"

A covariate shift occurs when the input data's distribution changes and the model cannot handle the change properly. For example, if the model was trained to classify between different dog breeds, with a training set of only brown dogs, the model won't perform well on images of yellow dogs. Here, the model's original input distribution was limited to just brown dogs, and changing the input distribution to a different color of dogs introduced covariate shift. 

An internal covariate shift is just a covariate shift that happens between layers of a model. Since the input of one layer is the output of the previous layer, the input distribution for a layer is the same as the output distribution of the previous layer.

Since the output distribution of a layer depends on its weights, and the weights of a model are constantly being updated, each layer's output distribution will constantly change (albeit by incremental amounts). In a model with many layers, the incremental changes eventually add up, which lead to internal covariate shift at deeper layers. 

#### Batch normalization

The solution to internal covariate shift is batch normalization. Since internal covariate shift is caused by distribution changes between layers, we can fix this by enforcing a fixed distribution to the inputs of each layer (since the problem of internal covariate shift occurred because the distribution of inputs to each layer changed as one got deeper into the network). 

Batch normalization creates a standardized normal distribution (mean = 0, sd = 1), by subtracting the mean from the inputs and dividing by the standard deivation. In doing so, we ensure that for each layer, the input distribution is normalized, regardless of what happens upstream in the process.

Batch normalization is applied across a specific dimension of the data. For CNNs, we apply it across the channels dimension, so we standard the values for each channel of the input data. So, we, for example, take all the values for a given input channel, find the mean and variance, and standardize the values for a given input channel with its own mean and variance.

For some layers, we might not want a standardized distribution of the inputs (we may want a different mean or variance). But, batch normalization allows us to change the mean and variance of the distribution: 

$$BN(\textbf{x}) = \gamma * \textbf{x'} + \beta$$

Where $\textbf{x'}$ is the standardized input data. 

Since $\gamma$ and $\beta$ are trainable, the model will fine-tune their values for each normalization layer. If $\gamma$ is set to the original standard deviation and $\beta$ is set to the original mean of the inputs, batch normalization will produce the original input data values (since it undoes the standardization process)

#### Pre-activation

When we use batch normalization, we apply it right before an activation function. 

Normally, the activation function in CNNs comes after a convolution layer, so this is known as post-activation. 

Version 1 of ResNet used the traditional method of post-activation. Its convolution layers would be followed by batch normalization, then ReLU activation. 

But, Version 2 of ResNet applied batch normalization and activation before the convolution layer, which is known as pre-activation. 



In [None]:
"""
import tensorflow as tf

# block_layer_sizes loaded in backend

class ResNetModel(object):
    # Model Initialization
    def __init__(self, min_aspect_dim, resize_dim, num_layers, output_size,
        data_format='channels_last'):
        self.min_aspect_dim = min_aspect_dim
        self.resize_dim = resize_dim
        self.filters_initial = 64
        self.block_strides = [1, 2, 2, 2]
        self.data_format = data_format
        self.output_size = output_size
        self.block_layer_sizes = block_layer_sizes[num_layers]
        self.bottleneck = num_layers >= 50
    
    # Applies pre-activation to the inputs
    def pre_activation(self, inputs, is_training):
        # does pre-activation step, before convolution step
        if self.data_format == "channel_first":
            axis = 1
        else:
            axis = 3
        bn_inputs = tf.layers.batch_normalization(inputs, axis = axis, training = is_training)
        pre_activated_inputs = tf.nn.relu(bn_inputs)
        return pre_activated_inputs
"""

### Shortcut

We can use shortcuts (skips) to improve performance in large models.

#### Mapping functions

A ResNet building block takes in an input, x, and produces some output H(x), where H is the block's mpaping function. The mapping function, H, is a mathematical representation of the block itself - it takes an input and, using the weights within the block, produces an output. 

A block is mostly just a stack of convolution layers bundled together. 

#### Identity mapping

An identity mapping just means the output for a layer (or a set of layers, e.g., a ResNet block) is the same as the input. If the larger model used the same weights as the smaller model, followed by an identity mapping, its outputs would be identical to the smaller model... but in reality, the larger model is worse, despite being able to match the smaller model's performance in theory. It turns out that training a model's layers to learn identity mappings (remember, a model has a constant update step). 


#### Residual learning

Let's say that for some block B, its optimal maaping function is $H_b$. If the optimal function is close to an identity mapping (e.g., $H_b$(x) = x), learning the function will be hard. Instead, we can have the block learn a different mapping function, $F_B$, such that $F_B$(x) = $H_b$(x) - x. 

Note: sometimes, the best "block" or "layer" is just one that returns the inputs. Having this possibility is nice because otherwise, if every layer is forced to be different, then degradation happens as you add more layers. 

We also add the input, x, to the block's output. This is referred to as the shortcut connection, or just "shortcut". So, the modified block's output will be $F_B$(x) + x, which is equivalent to the optimal function, $H_B$(x). The process of learning $F_B(x)$ is called "residual learning". When the optimal mapping function, $H_B(x)$, is close to the identity mapping, it's easier to learn the residual ($F_B(x)$) than $H_B(x)$

#### Projection shortcut

When a block uses a stride size > 1 for its convolution layers, the output will have reduced height and width compared to its input (since you're not iterating through the maximum number of chunks with each passing of the filter). Since the shortcut is the same as the input data, it cannot be added to the block output (because adding it to the block output requires equal dimension sizes). 

A project shortcut is the result of applying a convolution layer, with 1 x 1 kernels, to the pre-activated input data. This convolution layer ensures that the shortcut has the same dimensions as the block's output (which lets you feed the shortcut into the block's output).

We apply the convolution layer to the pre-actived input data, rather than the input data. 

In [None]:
"""
import tensorflow as tf

# block_layer_sizes loaded in backend

class ResNetModel(object):
    # Model Initialization
    def __init__(self, min_aspect_dim, resize_dim, num_layers, output_size,
        data_format='channels_last'):
        self.min_aspect_dim = min_aspect_dim
        self.resize_dim = resize_dim
        self.filters_initial = 64
        self.block_strides = [1, 2, 2, 2]
        self.data_format = data_format
        self.output_size = output_size
        self.block_layer_sizes = block_layer_sizes[num_layers]
        self.bottleneck = num_layers >= 50
    
    # Applies consistent padding to the inputs
    def custom_padding(self, inputs, kernel_size):
        pad_total = kernel_size - 1
        pad_before = pad_total // 2
        pad_after = pad_total - pad_before
        if self.data_format == 'channels_first':
            padded_inputs = tf.pad(
                inputs,
                [[0, 0], [0, 0], [pad_before, pad_after], [pad_before, pad_after]])
        else:
            padded_inputs = tf.pad(
                inputs,
                [[0, 0], [pad_before, pad_after], [pad_before, pad_after], [0, 0]])
        return padded_inputs
    
    # Customized convolution layer w/ consistent padding
    def custom_conv2d(self, inputs, filters, kernel_size, strides, name=None):
        if strides > 1:
            padding = 'valid'
            inputs = self.custom_padding(inputs, kernel_size)
        else:
            padding = 'same'
        return tf.layers.conv2d(
            inputs=inputs, filters=filters, kernel_size=kernel_size,
            strides=strides, padding=padding, data_format=self.data_format,
            name=name)

    # Apply pre-activation to input data
    def pre_activation(self, inputs, is_training):
        axis = 1 if self.data_format == 'channels_first' else 3
        bn_inputs = tf.layers.batch_normalization(inputs, axis=axis, training=is_training)
        pre_activated_inputs = tf.nn.relu(bn_inputs)
        return pre_activated_inputs
    
    # Returns pre-activated inputs and the shortcut
    def pre_activation_with_shortcut(self, inputs, is_training, shortcut_params):
        pre_activated_inputs = self.pre_activation(inputs, is_training)
        shortcut = inputs
        shortcut_filters = shortcut_params[0]
        if shortcut_filters is not None:
            strides = shortcut_params[1]
            shortcut = self.custom_conv2d(pre_activated_inputs, shortcut_filters, 1, strides)
        return pre_activated_inputs, shortcut
"""

### ResNet Block

#### Learning the identity

Residual learning is the process of a block learning the "residual function": $$F_B(x) = H_B(x) - x$$, where $H_B(x)$ is the optimal mapping function. 

We use residual learning because it is much easier for a block to learn a "zero mapping" than an identity mapping. 

A zero mapping simply returns an output of all 0s. To learn this, the block just has to be trained so all its weights are 0. This is pretty easy to do. But, if the block learns a zero mapping, then the modified output, with the shortcut added, is: 

$$F_B(x) + x = 0 + x = x$$

In learning the easy zero mapping, we've created a block that corresponds to an identity mapping, which is much harder 

NOTE: Since you'd have to learn the weights, for each of the edges, that will eventually lead to the same output as the input, and in a fully connected layer this is pretty hard to do. This is why you learn the zero mapping instead - just set all the weights to zero. 

#### Improving performance

Now, we can add layers to a model and avoid degradation, by having the additional layers represent identity mappings. In fact, residual learning actually increases model performance when adding many layers. 

For example, let's take a ResNet model with 20 blocks, which already has good performance. We can add 5 more blocks to the model and train with residual learning. If the first 20 blocks of the larger model learn the same weights as the smaler model, the 5 additional blocks don't need to modify the output of the first 20 blocks by much. We want the additional 5. blocks to learn a mapping function relatively close to the identity mapping. This is what residual learning allows us to accomplish: we can allow additional blocks to make just the slightest tweaks to improve model performance. 

#### Structure of a block

A block is made up of multiple convolution layers. Specifically, it uses two convolution layers, both with 3 x 3 kernels and pre-activated inputs. The second convolution layer always uses stride size of 1, so the input and output have the same height and width dimensions. The first convolution layer also mainly uses a stride size of 1 (it only uses a stride size of 2 for dimension reduction purposes). 

Since a ResNet block uses residual learning, the shortcut is added to the block's residual output, $F_B(x)$, to produce the overall output. 


![title](resNet_block.png)

In [None]:
"""
import tensorflow as tf

# block_layer_sizes loaded in backend

class ResNetModel(object):
    # Model Initialization
    def __init__(self, min_aspect_dim, resize_dim, num_layers, output_size,
        data_format='channels_last'):
        self.min_aspect_dim = min_aspect_dim
        self.resize_dim = resize_dim
        self.filters_initial = 64
        self.block_strides = [1, 2, 2, 2]
        self.data_format = data_format
        self.output_size = output_size
        self.block_layer_sizes = block_layer_sizes[num_layers]
        self.bottleneck = num_layers >= 50
    
    # Applies consistent padding to the inputs
    def custom_padding(self, inputs, kernel_size):
        pad_total = kernel_size - 1
        pad_before = pad_total // 2
        pad_after = pad_total - pad_before
        if self.data_format == 'channels_first':
            padded_inputs = tf.pad(
                inputs,
                [[0, 0], [0, 0], [pad_before, pad_after], [pad_before, pad_after]])
        else:
            padded_inputs = tf.pad(
                inputs,
                [[0, 0], [pad_before, pad_after], [pad_before, pad_after], [0, 0]])
        return padded_inputs
    
    # Customized convolution layer w/ consistent padding
    def custom_conv2d(self, inputs, filters, kernel_size, strides, name=None):
        if strides > 1:
            padding = 'valid'
            inputs = self.custom_padding(inputs, kernel_size)
        else:
            padding = 'same'
        return tf.layers.conv2d(
            inputs=inputs, filters=filters, kernel_size=kernel_size,
            strides=strides, padding=padding, data_format=self.data_format,
            name=name)

    # Apply pre-activation to input data
    def pre_activation(self, inputs, is_training):
        axis = 1 if self.data_format == 'channels_first' else 3
        bn_inputs = tf.layers.batch_normalization(inputs, axis=axis, training=is_training)
        pre_activated_inputs = tf.nn.relu(bn_inputs)
        return pre_activated_inputs
    
    # Returns pre-activated inputs and the shortcut
    def pre_activation_with_shortcut(self, inputs, is_training, shortcut_params):
        pre_activated_inputs = self.pre_activation(inputs, is_training)
        shortcut = inputs
        shortcut_filters = shortcut_params[0]
        if shortcut_filters is not None:
            strides = shortcut_params[1]
            shortcut = self.custom_conv2d(pre_activated_inputs, shortcut_filters, 1, strides)
        return pre_activated_inputs, shortcut
    
    # ResNet building block
    def regular_block(self, inputs, filters, strides, is_training, index, shortcut_filters=None):
        with tf.variable_scope('regular_block{}'.format(index)):
            shortcut_params = (shortcut_filters, strides)
            pre_activated1, shortcut = self.pre_activation_with_shortcut(inputs, is_training, shortcut_params)
            conv1 = self.custom_conv2d(pre_activated1, filters, 3, strides)
            # CODE HERE
            pre_activated2 = self.pre_activation(conv1, is_training)
            conv2 = self.custom_conv2d(pre_activated2, filters,3, 1)
            return conv2 + shortcut # step where you add the two results, to get output of block
"""

### Bottleneck

Using bottleneck blocks helps us reduce memory usage for large ResNet models.

#### Size considerations

Once we hit 50 or more layers, we want to take advantage of the large model depth and utilize more filters with each convolution layer. But, adding more filters results in added weight parameters, which can lead to long training times. 

To get around this, ResNet uses the same squeeze/expand concept used by the SqueezeNet Fire module. The ResNet blocks for 50+ layer models use 3 convolution layer rather than 2, where the first convolution layer squeezes the number of channels in the data and the third convolution layer expands the number of channels. These blocks are "bottleneck blocks"

#### Bottleneck block

The third convolution layer of a bottleneck block uses 4x as many filters as a regular ResNet block (since it's the "expand" block). This means that the input to bottleneck blocks will have 4x as many channels (since the output of one block is the input of the next). If you have multiple bottleneck blocks chained to each other, the output of the prior one would be 4x the size of its input, so the input of the next block would be 4x the size of the input of the prior block. To get around this, the first convolution layer acts as a squeeze layer reducing the number of channels back to the original amount. 

The bottleneck block, like the SqueezeNet Fire module, mixes the usage of 1 x 1 and 3 x 3 kernels. The bottleneck block uses 1 x 1 kernels for the first and third convolution layers, while the middle convolution layer still uses 3 x 3 kernels. This helps reduce the number of weight parameters while still maintaining good performance. 

#### Parameter comparison

To calculate the number of weight parameters in a convolution layer, we use the following equation: 

$$P = H_K x W_K x F x C + F$$

where $H_K x W_K$ is the kernel dimensions, F is the number of filters, and C is the number of input channels. 

For a regular ResNet block with 64 filters that takes in an input with 64 channels, the number of weight parameters, P, used in the block is:

$$P_1 = P_2 = 3 x 3 x 64 x 64 + 64 = 36,928$$

$$P = P_1 + P_2 = 73,856$$

where P1 and P2 represent the number of weight parameters used in the first and second convolution layers, respectively. 

For a bottleneck block with 64 filters, the input will have 256 channels. But, we need to reduce these during our bottleneck process. So, the nubmer of weight parameters, $P_B$, used in the block is: 

$$P_1 = 1 x 1 x 64 x 256 + 64 = 16,448$$
$$P_2 = 3 x 3 x 64 x 64 + 64 = 36,928$$
$$P_3 = 1 x 1 x 256 x 64 + 256 = 16,640$$
$$P_B = P_1 + P_2 + P_3 = 70,016$$

So, despite the input and output having 4x as many filters, the bottleneck block actually uses fewer weight filters. 



![title](bottleneck_block.png)

In [None]:
"""
import tensorflow as tf

# block_layer_sizes loaded in backend

class ResNetModel(object):
    # Model Initialization
    def __init__(self, min_aspect_dim, resize_dim, num_layers, output_size,
        data_format='channels_last'):
        self.min_aspect_dim = min_aspect_dim
        self.resize_dim = resize_dim
        self.filters_initial = 64
        self.block_strides = [1, 2, 2, 2]
        self.data_format = data_format
        self.output_size = output_size
        self.block_layer_sizes = block_layer_sizes[num_layers]
        self.bottleneck = num_layers >= 50
    
    # Applies consistent padding to the inputs
    def custom_padding(self, inputs, kernel_size):
        pad_total = kernel_size - 1
        pad_before = pad_total // 2
        pad_after = pad_total - pad_before
        if self.data_format == 'channels_first':
            padded_inputs = tf.pad(
                inputs,
                [[0, 0], [0, 0], [pad_before, pad_after], [pad_before, pad_after]])
        else:
            padded_inputs = tf.pad(
                inputs,
                [[0, 0], [pad_before, pad_after], [pad_before, pad_after], [0, 0]])
        return padded_inputs
    
    # Customized convolution layer w/ consistent padding
    def custom_conv2d(self, inputs, filters, kernel_size, strides, name=None):
        if strides > 1:
            padding = 'valid'
            inputs = self.custom_padding(inputs, kernel_size)
        else:
            padding = 'same'
        return tf.layers.conv2d(
            inputs=inputs, filters=filters, kernel_size=kernel_size,
            strides=strides, padding=padding, data_format=self.data_format,
            name=name)

    # Apply pre-activation to input data
    def pre_activation(self, inputs, is_training):
        axis = 1 if self.data_format == 'channels_first' else 3
        bn_inputs = tf.layers.batch_normalization(inputs, axis=axis, training=is_training)
        pre_activated_inputs = tf.nn.relu(bn_inputs)
        return pre_activated_inputs
    
    # Returns pre-activated inputs and the shortcut
    def pre_activation_with_shortcut(self, inputs, is_training, shortcut_params):
        pre_activated_inputs = self.pre_activation(inputs, is_training)
        shortcut = inputs
        shortcut_filters = shortcut_params[0]
        if shortcut_filters is not None:
            strides = shortcut_params[1]
            shortcut = self.custom_conv2d(pre_activated_inputs, shortcut_filters, 1, strides)
        return pre_activated_inputs, shortcut
    
    # ResNet bottleneck block
    def bottleneck_block(self, inputs, filters, strides, is_training, index, shortcut_filters=None):
        with tf.variable_scope('bottleneck_block{}'.format(index)):
            shortcut_params = (shortcut_filters, strides)
            pre_activated1, shortcut = self.pre_activation_with_shortcut(inputs, is_training, shortcut_params)
            conv1 = self.custom_conv2d(pre_activated1, filters, 1, 1)
            pre_activated2 = self.pre_activation(conv1, is_training)
            conv2 = self.custom_conv2d(pre_activated2, filters, 3, strides)
            # CODE HERE
            pre_activated3 = self.pre_activation(conv2, is_training)
            conv3 = self.custom_conv2d(pre_activated3, filters * 4, 1, 1)
            return conv3 + shortcut
"""

### Block Layer

We can stack multiple blocks together to create a block layer in ResNet

#### Block layers

A ResNet model has 4 block layers. Each block layer contains a different number of blocks, depending on the total number of weight layers in the ResNet model. For example, an 18 layer ResNet model has 2 blocks in each block layer. 

The blocks within each block layer are connected, so the output of block i, within the block layer, is the input of block i + 1. Furthermore, the four block layers themselves are connected, so the output of block layer j is the input of block layer j + 1.

![title](example_block_layer.png)

In [None]:
# the following function stacks multiple
# building blocks. 

"""
import tensorflow as tf

class ResNetModel(object):
    # __init__ and other functions omitted

    # Creates a layer of blocks
    def block_layer(self, inputs, filters, strides, num_blocks, is_training, index):
        with tf.variable_scope('block_layer{}'.format(index)):
            shortcut_filters = 4 * filters if self.bottleneck else filters
            block_fn = self.bottleneck_block if self.bottleneck else self.regular_block
            block_output = block_fn(inputs, filters, strides, is_training, 0, 
                shortcut_filters=shortcut_filters)
            # stack the blocks in this layer
            for i in range(1, num_blocks):
                block_output = block_fn(block_output, filters, 1, is_training, i)
            return block_output
"""

### Model Layers

#### Initial layer

Prior to the block layers, the ResNet model uses a convolution layer and a max pooling layer o the input data. 

The convolution layer uses 7 x 7 kernels, which is because the ImageNet dataset has larger images than the previous datasets. The convolution layer uses a stride size of 2 (For dimensionality reduction). The max pooling layer also uses a stride size of 2. After the initial convolution and pooling layers, the input height and width are both reduced by a factor of 4, which means that the input data for the first block layer is now 16x smaller. 

#### Final layers

The ResNet model ends with global average pooling and a fully-connected layer to obtain the logits. The fully-connected layer is the final weight layer of the model. 

So, the ResNet goes like this:

Initial convolution + max pooling --> block layers --> final fully-connected layer

The number of weight layers in the model gives each variation of ResNet its name (e.g., ResNet-18, ResNet-50, etc.). For example, ResNet-18 has 2 blocks in each of its 4 block layers. The blocks are regular blocks, so they have 2 convolution layers. So, ResNet-18 has 16 convolution layers across all of its blocks, giving it a total of 18 weight layers. 

### Full Model Architecture

#### Regularization

For deep models, we have to regularize the model so it doesn't overfit. For regular CNNs, we use dropout for regularization. However, batch normalizaiton, along with reducing internal covariate model, happens to regularize the model (since we take into account the entire batch of inputs when performing batch normalization so the model is likely to not overfit on a few outliers). Since we apply batch normalization before nearly every weight layer, it already fulfills the function of regularization.

#### Increased filters

In ResNet, the number of filters is doubled as we go from one block layer to the next. 

We increase the number of filters at the block layer level, rather than individual blocks, because there's a lot of blocks in ResNet, so doing it at the block level would lead to a large amount of parameters. 

#### Logits

For the ImageNet dataset, there are 1000 labels. Using a convolution layer with 1000 filters would require many weight parameters (in contrast, for the SqueezeNet with the CIFAR-10 dataset, we could use a convolution layer to make the number of channels equal the number of image classes. Then we apply global average pooling across the channels to obtain the logits)

To save weight parameters in the ResNet model, we first apply global average pooling across the channels, rather than using a convolution layer. Then, we flatten the data and use a fully-connected layer to obtain logits. 

In [None]:
"""
import tensorflow as tf

class ResNetModel(object):
    # __init__ and other functions omitted

    # Model Layers
    # inputs (channels_last): [batch_size, resize_dim, resize_dim, 3]
    # inputs (channels_first): [batch_size, 3, resize_dim, resize_dim]
    def model_layers(self, inputs, is_training):
        # initial convolution layer
        conv_initial = self.custom_conv2d(
            inputs, self.filters_initial, 7, 2, name='conv_initial')
        # pooling layer
        curr_layer = tf.layers.max_pooling2d(
            conv_initial, 3, 2, padding='same',
            data_format=self.data_format,
            name='pool_initial')
        # stack the block layers
        for i, num_blocks in enumerate(self.block_layer_sizes):
            filters = self.filters_initial * 2**i
            strides = self.block_strides[i]
            # stack this block layer on the previous one
            curr_layer = self.block_layer(
                curr_layer, filters, strides,
                num_blocks, is_training, i)
        # pre-activation
        pre_activated_final = self.pre_activation(curr_layer, is_training)
        filter_size = int(pre_activated_final.shape[2])
        # final pooling layer
        avg_pool = tf.layers.average_pooling2d(
            pre_activated_final,
            filter_size,
            1,
            data_format=self.data_format)
        final_layer = tf.layers.flatten(avg_pool)
        # get logits from final layer
        logits = tf.layers.dense(final_layer, self.output_size, name='logits')
        return logits
"""