### SqueezeNet

SqueezeNet is a memory-efficient CNN that performs similarly to larger nets (e.g., AlexNet, the breakthrough CNN that won the 2012 ImageNet challenge).

After training a model, we store its computational graph and parameters (weights + biases) for future use. The number of parameters can number in the millions, so it takes space to store them. 

Most high-performance models require hundreds of MB to store their parameters (e.g., AlexNet uses over 200MB for storage of 60 million parameters). But, SqueezeNet uses less than 1MB of space. 

The parameters that we'd need to store are the weights of each of the kernels, the number of kernels, the number of biases, and the number of input channels. For example, for a RGB image, there are 3 channels. Let's say there are two convolutional layers, and in each layer there are 3 kernels, each with dimensions 3 x 3. So, the # of parameters is ((3 x 3) (for the dimensions of a kernel) x 3 (number of kernels in a layer) x 2 (number of layers) + 2 (number of biases, one per layer)) x 3 (number of channels) = 168 parameters

Having smaller models is helpful because it's faster to load and easier to store. 

Link to paper: https://arxiv.org/pdf/1602.07360.pdf?source=post_page---------------------------

### Initialization

In this example, we're using the CIFAR10 dataset. It contains 60,000 color images, with dimensions 32 x 32, evenly split across 10 categories. Since it's in RGB format, there are 3 channels, with each channel containing 1024 integers (32 x 32) between 0 and 255. So, an image is represented by 3072 integers between 0 and 255. 

This time (as opposed to MNIST), the output labels are indices of the classes, rather than one-hot representations. This is known as a "sparse representation" of the labels. 

For a batch of input data, the shape is (batch size, 3 x 32 x 32), and the shape of the labels is (batch size, ) since the labels are represented as a 1-D tensor. 

The images in CIFAR10 have the following labels: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks

In [None]:
"""
batch_size = 32
dataset = dataset.batch(batch_size)
it = dataset.make_one_shot_iterator()
inputs_, labels = it.get_next()
with tf.Session() as sess:
    input_arr, label_arr = sess.run((inputs_, labels))
"""

In [None]:
"""
# beginning to build a SqueezeNet model class

class SqueezeNetModel(object):
    def __init__(self, original_dim, resize_dim, output_size):
        self.original_dim = original_dim
        self.output_size = output_size
        self.resize_dim = resize_dim
"""

### Augmentation

#### Data Augmentation

For images, we can augment our data through image transformations. We can rotate and crop our images to make new images for our training sample. Using data augmentation, we create variations of each image in a batch of input data. Since we train our model here on a few hundred epochs, data augmentation constructs hundreds of variations for each image in a training set. 

#### Image Transformation

For the CIFAR10 dataset, we'll use horizontal flips of the dataset. This is because for the types of images in CIFAR10, seeing a rotated version of them is unrealistic. 

Here, we crop each image to a randomly chosen submatrix of the data with height and width equal to a resized amount. Here, let's use resize_dim = 24 (originally, dim = 32). Then, with a probability of 0.5, we flip our image horizontally. 

Since we only apply data augmentation when training the model, we don't use image transformations when the model is not training. But, since we apply a random crop while training, our layers must be able to take in inputs with height and width = 24. Additionally, we need to resize our test images during evaluation. 

#### Data Standardization

The CIFAR10 dataset has pixel integers between 0 and 255, and we need some way to standardize them. In the MNIST dataset, we standardized the pixel integers to between 0 and 1. Here instead, we use $\textit {image standardization}$, in which, for a given image, we linearly scale the pixel data so that the data has 0 mean and unit variance (this is an improvement over the MNIST method because of the differences in color intensity, which could lead to differential impacts in backpropagation. For example, imagine an image that has a lot of strong red, blue, and green hues, as opposed to an image that has muted colors. Since all inputs will experience the same weights and biases, having weights and biases trained on images with strong hues (integer = 255) will mess up classification of those with more muted colors. Plus, it could lead to the gradient getting out of hand?). 

To accomplish image standardization, we use tf.image.per_image_standardization




In [None]:
"""
class SqueezeNetModel(object):
    # Model Initialization
    def __init__(self, original_dim, resize_dim, output_size):
        self.original_dim = original_dim
        self.resize_dim = resize_dim
        self.output_size = output_size
    
    # Random crop and flip
    def random_crop_and_flip(self, float_image):
        crop_image = tf.random_crop(float_image, [self.resize_dim, self.resize_dim, 3])
        updated_image = tf.image.random_flip_left_right(crop_image)
        return updated_image
        
    # Data Augmentation
    def image_preprocessing(self, data, is_training):
        reshaped_image = tf.reshape(data, [3, self.original_dim, self.original_dim])
        transposed_image = tf.transpose(reshaped_image, [1, 2, 0])
        float_image = tf.cast(transposed_image, tf.float32)
        if is_training:
            updated_image = self.random_crop_and_flip(float_image)
        else:
            updated_image = tf.image.resize_image_with_crop_or_pad(float_image, self.resize_dim, self.resize_dim)
        standardized_image = tf.image.per_image_standardization(updated_image)
        return standardized_image
"""

### The Fire Module

The fire module is the central component of SqueezeNet

#### Decreasing parameters

In order to make a smaller model, we have to decrease the number of weights per convolutional layer. There are three ways to decrease the number of weights:
1. Decrease the kernel size
2. Decrease the number of kernel filters used
3. Decrease the number of channels

We want to keep the number of filters, since having a wide variety of filters allows us to extract different hidden features from the input. But, there are ways to decrease the kernel size and number of input channels while still maintaining good model performance. 

#### Kernel size (and how to decrease it)

The kernel size represents the amount of spatial information it can capture. For example, a 1 x 1 kernel can only capture information from a single pixel, a 2 x 2 kernel can aggregate the information from four adjacent pixels, and a 3 x 3 kernel can aggregate the information between adjacent pixels in a 3 x 3 square of data. 

Larger kernels can capture more information (since they have a view of more information), it comes at the cost of additional parameters. A convolutional layer with 3 x 3 kernels will use 9x as many parameters as a layer that uses 1 x 1 kernels. A way to balance performance and parameter count is to use a mix of larger and smaller size kernels. 

#### Intermediate layer (used to decrease the number of input channels)

An intermediate convolution layer can be used to decrease the number of input channels. Adding an extra layer introduces additional kernel weights, but it can drastically decrease the number of parameters used in a layer. 

For example, look at a convolutional layer with 100 filters and 3 x 3 kernels. If the input has 50 channels, the number of paramters = 3 x 3 x 100 x 50 + 100 = 45,100. 

But, if we first apply an intermediate convolutional layer with 10 filters and 1 x 1 kernels, the intermediate output will have 10 channels. The number of parameters used in the intermediate layer is 1 x 1 x 10 x 50 + 10 = 510. Then, if we pass the intermediate output into our original convolutional layer, the total number of parameters used becomes:
510 + (3 x 3 x 100 x 10 + 100) = 9,610.

So, we'd apply an intermediate convolution layer, which, here, reduces the number of filters from 50 to 10, leading to a 5-fold reduction in the number of parameters. 

#### Fire module

The fire module, the key building block of SqueezeNet, applies concepts from the kernel size and intermediate layer sections. 

It uses an intermediate convolution layer, knowon as a "squeeze layer", then passes the intermediate output into an "expand layer" with a larger number of filters. 

The "expand layer" contains two convolution layers with an equal number of filters. One layer uses 1 x 1 kernels, while the other uses 3 x 3 kernels. Using 1 x 1 kernels helps decrease the number of parameters used. The outputs of the two layers have the same size, since both layers use the same number of filters. 

The two outputs are then concatenated along the channel dimension (which doubles the number of channels) to produce the overall output of the fire module. We use tf.concat to concatenate, and it has two required args: values and axis. Since the channel dimension is the last dimension, we use -1 for the axis. 

The ratio of the number of filters in the squeeze layer vs. the expand layer is known as the "squeeze ratio". A larger squeeze ratio (i.e., decreasing the number of filters in the squeeze layer) can improve the model performance up to a certain extent, at the cost of increasing the parameter count. 

Below is code for a fire module:


In [None]:
"""
class SqueezeNetModel(object):
    # __init__ and other functions omitted

    # Convolution layer wrapper
    def custom_conv2d(self, inputs, filters, kernel_size, name):
        return tf.layers.conv2d(
            inputs=inputs,
            filters=filters,
            kernel_size=kernel_size,
            activation=tf.nn.relu,
            padding='same',
            name=name)

    # SqueezeNet fire module
    def fire_module(self, inputs, squeeze_depth, expand_depth, name):
        with tf.variable_scope(name):
            squeezed_inputs = self.custom_conv2d(
                inputs,
                squeeze_depth,
                [1, 1],
                'squeeze')
            expand1x1 = self.custom_conv2d(
                squeezed_inputs,
                expand_depth,
                [1, 1],
                'expand1x1')
            expand3x3 = self.custom_conv2d(
                squeezed_inputs,
                expand_depth,
                [3, 3],
                'expand3x3')
            return tf.concat([expand1x1, expand3x3], axis=-1)
"""

### Stacking fire modules

We can stack multiple fire modules together by creating a utility function. 

#### Utility function

The SqueezeNet model, though it has few parameters, still has many layers. There are several fire modules, which makes it useful to write a utility function that can stack multiple layers. 

When we deal with more complex model architectures, there are going to be repetitions of the main building blocks in the model. 

Below is a way to stack the modules together (essentially, through using for loops :


In [None]:
"""
import tensorflow as tf

class SqueezeNetModel(object):
    # Model Initialization
    def __init__(self, original_dim, resize_dim, output_size):
        self.original_dim = original_dim
        self.resize_dim = resize_dim
        self.output_size = output_size
    
    # Convolution layer wrapper
    def custom_conv2d(self, inputs, filters, kernel_size, name):
        return tf.layers.conv2d(
            inputs=inputs,
            filters=filters,
            kernel_size=kernel_size,
            activation=tf.nn.relu,
            padding='same',
            name=name)

    # SqueezeNet fire module
    def fire_module(self, inputs, squeeze_depth, expand_depth, name):
        with tf.variable_scope(name):
            squeezed_inputs = self.custom_conv2d(
                inputs,
                squeeze_depth,
                [1, 1],
                'squeeze')
            expand1x1 = self.custom_conv2d(
                squeezed_inputs,
                expand_depth,
                [1, 1],
                'expand1x1')
            expand3x3 = self.custom_conv2d(
                squeezed_inputs,
                expand_depth,
                [3, 3],
                'expand3x3')
            return tf.concat([expand1x1, expand3x3], axis=-1)
    
    # Stacked fire modules (use a for loop to iterate through each param list)
    def multi_fire_module(self, layer, params_list):
        for params in params_list:
            layer = self.fire_module(layer, params[0], params[1], params[2])
        return layer
"""

### Model Layers

Here, we discuss the high-level architecture of SqueezeNet

#### Overview

Now, we're building a condensed version of the SqueezeNet model. The differences between our model and the original SqueezeNet model are that (1) our model only uses the first 4 fire modules from the original SqueezeNet (rather than all 8), (2) the initial convolution layer for the model uses fewer filters (64 vs. 96) and a smaller kernel size (3x3 vs. 7x7), and (3) the max pooling layers in our model use smaller filters (2x2 vs. 3x3). 

#### Initial layers

Here, we start with a regular convolution layer, rather than a fire module, to apply the initial filters to our image data. We do this so that we don't start off with squeezing our image data and thus risk losing on important general features.

It's usually a good idea for any CNN to start off with a regular convolution layer. 

Moreover, it's typically a good idea to apply max pooling after the initial convolution layer. Reducing the dimension of the initial data helps our model train faster and pick up on important features. 

Again, we apply this convolution + max pooling before passing the data into any Fire modules


In [None]:
"""
import tensorflow as tf

class SqueezeNetModel(object):
    # Model Initialization
    def __init__(self, original_dim, resize_dim, output_size):
        self.original_dim = original_dim
        self.resize_dim = resize_dim
        self.output_size = output_size
    
    # Convolution layer wrapper
    def custom_conv2d(self, inputs, filters, kernel_size, name):
        return tf.layers.conv2d(
            inputs=inputs,
            filters=filters,
            kernel_size=kernel_size,
            activation=tf.nn.relu,
            padding='same',
            name=name)

    # Max pooling layer wrapper
    def custom_max_pooling2d(self, inputs, name):
        return tf.layers.max_pooling2d(
            inputs=inputs,
            pool_size=[2, 2],
            strides=2,
            name=name)
    
    # Model Layers
    # inputs: [batch_size, resize_dim, resize_dim, 3]
    def model_layers(self, inputs, is_training):
        conv1 = self.custom_conv2d(inputs, filters = 64, kernel_size = [3,3], name = 'conv1')
        pool1 = self.custom_max_pooling2d(conv1, 'pool1')
"""

### Adding Fire

Here, we discuss adding Fire modules to the model and talk about delayed downsampling. 

#### Delayed downsampling

Reducing the data dimensions via max pooling can help make our model training faster and more efficient, with performance similar to that of a model without max pooling.

However, strategic placement of max pooling layers can improve the SqueezeNet model's accuracy. 

Normally, you apply max pooling after a convolutional layer, in order to reduce the dimensionality of the output.

Here, we can, rather than applying max pooling after the squeeze layer in the first Fire module, instead apply it after the first two Fire modules. Placing the max pooling layers later in the module structure is known as $\textit{delayed downsampling}$. We wait until later in the model architecture to downsample the data (i.e., reduce height/width dimensions through pooling), so that the earlier convolution layers can have a larger dimension input (and thus have more data to work with). 

The creators of SqueezeNet showed that having larger dimension inputs for more layers in the model can help improve accuracy, so there is a tradeoff between dimensionality and accuracy. 

In [None]:
"""
import tensorflow as tf

class SqueezeNetModel(object):
    # Model Initialization
    def __init__(self, original_dim, resize_dim, output_size):
        self.original_dim = original_dim
        self.resize_dim = resize_dim
        self.output_size = output_size
    
    # Convolution layer wrapper
    def custom_conv2d(self, inputs, filters, kernel_size, name):
        return tf.layers.conv2d(
            inputs=inputs,
            filters=filters,
            kernel_size=kernel_size,
            activation=tf.nn.relu,
            padding='same',
            name=name)

    # Max pooling layer wrapper
    def custom_max_pooling2d(self, inputs, name):
        return tf.layers.max_pooling2d(
            inputs=inputs,
            pool_size=[2, 2],
            strides=2,
            name=name)
    
    # SqueezeNet fire module
    def fire_module(self, inputs, squeeze_depth, expand_depth, name):
        with tf.variable_scope(name):
            squeezed_inputs = self.custom_conv2d(
                inputs,
                squeeze_depth,
                [1, 1],
                'squeeze')
            expand1x1 = self.custom_conv2d(
                squeezed_inputs,
                expand_depth,
                [1, 1],
                'expand1x1')
            expand3x3 = self.custom_conv2d(
                squeezed_inputs,
                expand_depth,
                [3, 3],
                'expand3x3')
            return tf.concat([expand1x1, expand3x3], axis=-1)

    # Utility function for multiple fire modules
    def multi_fire_module(self, layer, params_list):
        for params in params_list:
            layer = self.fire_module(
                layer,
                params[0],
                params[1],
                params[2])
        return layer
    
    # Model Layers
    # inputs: [batch_size, resize_dim, resize_dim, 3]
    def model_layers(self, inputs, is_training):
        conv1 = self.custom_conv2d(
            inputs,
            64,
            [3, 3],
            'conv1')
        pool1 = self.custom_max_pooling2d(
            conv1,
            'pool1')
        fire_params1 = ([32, 64,'fire1'],[32, 64,'fire2'])
        # apply multi fire module blcok with list of tuple params
        multi_fire1 = self.multi_fire_module(pool1, fire_params1)
        # perform max pooling on the output of the multi-fire module block
        pool2 = self.custom_max_pooling2d(multi_fire1, name = 'pool2')

"""

### Depth

We can increase the SqueezeNet model's depth to improve performance.

#### Deeper Fire modules

In order to extract more distinguishing and nuanced features from the data, we can add more Fire modules to the model. The additional Fire modules will use twice as many filters in the expanded layer (from 64 to 128). This follows the same approach for normal CNNs, and it works because as you get deeper in the model, you're looking for more specific and nuanced features that are specific to a certain picture (e.g., a specific line in one corner of the image).

To avoid overfitting, we apply dropout after the second multi-Fire moldule block, with a dropout rate of 0.5. We don't use a max pooling layer after this multi-Fire module block, since we can use average pooling to get logits. 


In [None]:
"""
mport tensorflow as tf

class SqueezeNetModel(object):
    # __init__ and other functions omitted

    # Model Layers
    def model_layers(self, inputs, is_training):
        conv1 = self.custom_conv2d(
            inputs,
            64,
            [3, 3],
            'conv1')
        pool1 = self.custom_max_pooling2d(
            conv1,
            'pool1')
        fire_params1 = [
            (32, 64, 'fire1'),
            (32, 64, 'fire2')
        ]
        multi_fire1 = self.multi_fire_module(
            pool1,
            fire_params1)
        ### Additional layers, for depth
        pool2 = self.custom_max_pooling2d(
            multi_fire1,
            'pool2')
        fire_params2 = [
            (32, 128, 'fire3'),
            (32, 128, 'fire4')
        ]
        multi_fire2 = self.multi_fire_module(
            pool2,
            fire_params2)
        dropout1 = tf.layers.dropout(multi_fire2, rate=0.5,
            training=is_training)
"""

### Logits

We can use global average pooling to obtain the model's logits. 

#### Channel-based logits

For the CNN section, we used flattening (because we had data in 3 channels that we needed in 2-D) and a couple of fully-connected layers (with dropout) to get the logits. Here, instead of flattening the data, we can instead use a convolution layer to convert the number of channels in our data to the # of classes. 

Then, we can use an average pooling layer to obtain logits for each channel (i.e., image class)

#### Global average pooling

Here, the filter performs an average on the inputs that it has. When the average pooling filter is the same height and width as the input data, it leads to a single output (average) per channel. This is known as "global average pooling"

So, global average pooling gives us one metric, the average, for each channel. Each channel corresponds to a unique class, and the channel's averaged value represents the logit for that class. 

So it seems like, for example, one channel gives you the coefficients for how likely the image is to be a car, for example, and another channel gives you the coefficients for how likely the image is to be a cat. You can use global average pooling on each of these in order to see how likely it is that a given image is, for example, a cat.

#### Advantages of using global pooling average (as opposed to fuly-connected layers)

Using global average pooling is more native to CNN structure, since we obtain logits via channels rather than converting the data to flattened vectors. Because of this lack of conversion, the CNN can obtain more accurate logits for each image class. 

Moreover, since global average pooling is just a pooling layer it has no parameters. That means that there's no risk of overfitting the global average pooling layer. In contrast, fully-connected layers use many weight parameters, which risks overfitting (even when dropout diminishes the problem). 

In [None]:
"""
import tensorflow as tf

class SqueezeNetModel(object):
    # Model Initialization
    def __init__(self, original_dim, resize_dim, output_size):
        self.original_dim = original_dim
        self.resize_dim = resize_dim
        self.output_size = output_size
    
    # Convert final convolution layer to logits
    def get_logits(self, conv_layer):
        avg_pool1 = tf.layers.average_pooling2d(conv_layer, [conv_layer.shape[1], conv_layer.shape[2]], 1)
        logits = tf.layers.flatten(avg_pool1, name = 'logits')
        return logits
        
    # Convolution layer wrapper
    def custom_conv2d(self, inputs, filters, kernel_size, name):
        return tf.layers.conv2d(
            inputs=inputs,
            filters=filters,
            kernel_size=kernel_size,
            activation=tf.nn.relu,
            padding='same',
            name=name)

    # Max pooling layer wrapper
    def custom_max_pooling2d(self, inputs, name):
        return tf.layers.max_pooling2d(
            inputs=inputs,
            pool_size=[2, 2],
            strides=2,
            name=name)
    
    # SqueezeNet fire module
    def fire_module(self, inputs, squeeze_depth, expand_depth, name):
        with tf.variable_scope(name):
            squeezed_inputs = self.custom_conv2d(
                inputs,
                squeeze_depth,
                [1, 1],
                'squeeze')
            expand1x1 = self.custom_conv2d(
                squeezed_inputs,
                expand_depth,
                [1, 1],
                'expand1x1')
            expand3x3 = self.custom_conv2d(
                squeezed_inputs,
                expand_depth,
                [3, 3],
                'expand3x3')
            return tf.concat([expand1x1, expand3x3], axis=-1)

    # Utility function for multiple fire modules
    def multi_fire_module(self, layer, params_list):
        for params in params_list:
            layer = self.fire_module(
                layer,
                params[0],
                params[1],
                params[2])
        return layer
    
    # Model Layers
    # inputs: [batch_size, resize_dim, resize_dim, 3]
    def model_layers(self, inputs, is_training):
        conv1 = self.custom_conv2d(
            inputs,
            64,
            [3, 3],
            'conv1')
        pool1 = self.custom_max_pooling2d(
            conv1,
            'pool1')
        fire_params1 = [
            (32, 64, 'fire1'),
            (32, 64, 'fire2')
        ]
        multi_fire1 = self.multi_fire_module(
            pool1,
            fire_params1)
        pool2 = self.custom_max_pooling2d(
            multi_fire1,
            'pool2')
        fire_params2 = [
            (32, 128, 'fire3'),
            (32, 128, 'fire4')
        ]
        multi_fire2 = self.multi_fire_module(
            pool2,
            fire_params2)
        dropout1 = tf.layers.dropout(multi_fire2, rate=0.5,
            training=is_training)
        final_conv_layer = self.custom_conv2d(
            dropout1,
            self.output_size,
            [1, 1],
            'final_conv')
        return self.get_logits(final_conv_layer)
"""

### Sparse Labels

#### Sparse representation

We use sparse representation for the CIFAR-10 labels. Instead of being one-hot vectors, each label is just the index of its corresponding image class. Using this sparse representation saves space compared to one-hot representation.

For training the model, we can use a sparse version of softmax in order to get the classes. In Tensorflow, this is provided through the tf.nn.sparse_softmax_cross_entropy_with_logits() function. 

In [None]:
"""
import tensorflow as tf

class SqueezeNetModel(object):
    # __init__ and other functions omitted

    # Set up and run model training
    def run_model_setup(self, inputs, labels):
      logits = self.model_layers(inputs, is_training)
      self.probs = tf.nn.softmax(logits, name='probs')
      self.predictions = tf.argmax(
          self.probs, axis=-1, name='predictions')
      is_correct = tf.equal(
          tf.cast(self.predictions, tf.int32),
          labels)
      is_correct_float = tf.cast(
          is_correct,
          tf.float32)
      self.accuracy = tf.reduce_mean(
          is_correct_float)
      # calculate cross entropy
      cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(
          labels=labels,
          logits=logits)
      self.loss = tf.reduce_mean(
          cross_entropy)
      adam = tf.train.AdamOptimizer()
      self.train_op = adam.minimize(
          self.loss, global_step=self.global_step)
"""