**IMAGE CLASSIFICATION WITH PYTORCH: ADDITIONAL MODEL ARCHITECTURES**

*Patrick Donnelly*

In our "Convolutional Neural Network Architectures for Image Classification" tutorial, we covered some of the most common algorithms for classifying images. This included AlexNet, VGG, ResNet (and ResNeXt), and Inception v3. In this tutorial, we cover the remaining architectures in the PyTorch Model Zoo (`torchvision.models`): SqueezeNet, DenseNet, GoogLeNet, Shufflenet v2, and MobileNet v2. Source code is available at https://github.com/pytorch/vision/tree/master/torchvision/models. Let's get into it!

First up is **SqueezeNet** (Iandola et al. 2016). The motivation behind SqueezeNet is simple: to attain the accuracy of a canonical deep neural network for image classification (e.g. AlexNet) with far (50x) fewer parameters. You can view the paper here: https://arxiv.org/pdf/1602.07360.pdf

The first innovation of SqueezeNet we'll examine is the *Fire module*. We'll use this module to construct our convolutional neural network (CNN). Iandola et al (2016, p. 3). identify three "strategies" to build a more "efficient" CNN using the Fire module:

1) "**Replace 3x3 filters with 1x1 filters.**" This is straightforward: a 3x3 filter has nine weights, while a 1x1 filter only has a single weight. Thus we learn nine times fewer weights per filter.

2) "**Decrease the number of input channels to 3x3 filters.**" Fewer input channels means fewer parameters to learn. Our parameters learned is equivalent to our number of inputs multiplied by the number of filters learned multiplied by the dimension of the filters (in this case 3x3). The first strategy of SqueezeNet was to reduce the filter size. This strategy involves reducing the input. Logically speaking, there's one other thing we could do! What about our third strategy?

3) "**Downsample late in the network so that convolution layers have larger activation maps**." Well, they're not suggesting we reduce our number of output channels... At the end of the day, we need our output channels to equal the number of potential classes from which we're predicting an output. However, there are many, many, many ways in which we can downsample from our input (determined by our data) to our output (determined by our number of classes). The authors of the paper hypothesize that retaining larger activation maps will lead to higher accuracy, as they put it, "with all else held equal." This seems kinda obvious. The more interesting idea might be whether the increased accuracy from larger activation maps can sufficiently compensate for the decreased accuracy of smaller input channels to 3x3 filters, along with replacing 3x3 filters with 1x1 filters. You can import the model from torchvision and find out for yourself!

We've gone long enough without working in code, so let's go ahead and sketch out our SqueezeNet. The source code is available at https://github.com/pytorch/vision/blob/master/torchvision/models/squeezenet.py. The torchvision source code includes two versions of SqueezeNet: 1.0 and 1.1. We'll use version 1.1. First let's do some imports:

In [5]:
import torch  # we need PyTorch, of course
import torch.nn as nn  # allowing us to use nn as shorthand for torch.nn

We'll define our class and give it a constructor and a `forward` method (similar to what we've seen with some of our other CNNs). Our `SqueezeNet` class will inherit from `nn.Module`.

In [6]:
class SqueezeNet(nn.Module):
    
    def __init__(self, num_classes=1000):  # for ImageNet
        super(SqueezeNet, self).__init__()
        self.num_classes = num_classes
        self.features = features
        
    def forward(self, x):
        pass

We'll be filling out `self.features` with the layers of our neural network, and then applying these features to our input in `forward`. But first let's define a separate Fire module class! This will also take a constructor and a `forward` method:

In [11]:
class Fire(nn.Module):
    
    def __init__(self):
        super(Fire, self).__init__()
        
    def forward(self, x):
        pass

Let's return to the paper (p. 4). What goes into our Fire module?

1) "A *squeeze* convolution layer" of 1x1 filters

2) "An *expand* layer" with both 1x1 and 3x3 filters

Our Fire module also includes three *hyperparameters*: parameters that govern how we learn our model parameters:

1) $s_{1x1}$: the number of (1x1) filters in the squeze layer

2) $e_{1x1}$: the number of 1x1 filters in the expand layer

3) (you guessed it) $e_{3x3}$: the number of 3x3 filters in the expand layer

We'll also construct our network such that $s_{1x1}$ < $e_{1x1} + e_{3x3}$, since we want relatively few input channels to our 3x3 filters (strategy 2) in our introductory blurb)

Our Fire module thus takes the following parameters:

1) `inplanes`: the number of inputs to the squeeze layer

2) `squeeze_planes`: the number of 1x1 filters to learn in the squeeze layer

3) `expand1x1_planes`: the number of 1x1 filters to learn in the expand layer

4) `expand3x3_planes`: the number of 3x3 filters to learn in the expand layer

Let's start filling out our Fire module!

In [14]:
class Fire(nn.Module):
    
    def __init__(self, inplanes, squeeze_planes, 
                 expand1x1_planes, expand3x3_planes):
        super(Fire, self).__init__()
    
    def forward(self, x):
        pass

We can use our arguments to define several class variables:

1) `self.inplanes`