**CONVOLUTIONAL NEURAL NETWORKS FOR IMAGE CLASSIFICATION**

*Patrick Donnelly, Groupware Technology*

This lab introduces the architectures of *convolutional neural networks*, a set of algorithms that have come to dominate computer vision research and applications over the past seven years. We examine some of the most influential architectures in chronological order: LeNet-5, AlexNet, VGG, Inception v3, and ResNet. It will help to have some familiarity with neural network operations and concepts. We cover this in a separate workshop: "An Introduction to Neural Networks." We implement each network in PyTorch.

Suppose we want to classify an image. Let's consider the hypothetically simple case of a 5x5 "image" of pixels that can take the values 0 or 1. Think of the image as a number on a scoreboard. When a pixel equals 1, the light turns on. We can thus define an image in Python using a list of lists.

In [1]:
a_zero = [[0,1,1,1,0],[1,0,0,0,1],[1,0,0,0,1],[1,0,0,0,1],[0,1,1,1,0]]
a_zero

Now suppose we want to classify this image as a digit. Looks like a zero, no? We define our zero label using a **one-hot encoding**. A one-hot encoding is simply a vector where each index corresponds to a class. The image is labeled by assigning a $1$ to the index of the image's class. Every other entry is assigned a $0$. In the case of digit classification, we can simply label the index corresponding to the digit with a $1$ and leave the others as $0$s.

In [2]:
zero_encoding = [1,0,0,0,0,0,0,0,0,0]

Now all we need is an algorithm that takes **a_zero** as an input and generates **zero_encoding** as an output with sufficient accuracy. Of course in "real life" we do this with thousands or millions of images, and each image is far larger than our 5x5 array of one-bit pixels. The advantage of keeping our data (or datum!) simple is that we can focus on the logic of the algorithm and wrap our head around the operations transforming our input into (predicted) output.

The simplest transformation (at a certain level of generality) we can apply to our input is linear. If we apply a linear transformation to an input pixel, we multiply that pixel by a **weight** and add a **bias**. We then sum the transformed inputs connecting our input **neuron** (e.g. pixel) with our output neuron (corresponding to a particular class).

If this is getting too abstract (as it was for the author of this workshop when he first encountered neural networks), consider our specific example. Let's represent **a_zero** (a list of lists) as a **flattened** vector (list):

In [3]:
a_zero_vector = [0,1,1,1,0,1,0,0,0,1,1,0,0,0,1,1,0,0,0,1,0,1,1,1,0]
a_zero_vector

[0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0]

Now let's say we want to "linearly transform" this vector of input pixels into our class vector. We'll first need to **matrix multiply** the input by a **weight matrix**. We'll import a library called **numpy** to perform matrix operations in Python. By convention:

In [4]:
import numpy as np

We use **np.array** to represent our lists as numpy arrays:

In [5]:
a_zero_array = np.array(a_zero)
a_zero_array

array([[0, 1, 1, 1, 0],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [0, 1, 1, 1, 0]])

In [6]:
a_zero_vector_array = np.array(a_zero_vector)
a_zero_vector_array

array([0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 1, 0])

In [7]:
zero_encoding_array = np.array(zero_encoding)
zero_encoding_array

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Let's print the dimensions of our arrays:

In [8]:
print(a_zero_array.shape)
print(zero_encoding_array.shape)

(5, 5)
(10,)


We'll need to reshape these arrays as (25,1) and (10,1) vectors so that we can do matrix math:

In [9]:
a_zero_array_reshaped = a_zero_array.reshape(25,1)
print(a_zero_array_reshaped.shape)
a_zero_array_reshaped

(25, 1)


array([[0],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [1],
       [1],
       [0]])

In [10]:
zero_encoding_array_reshaped = zero_encoding_array.reshape(10,1)
print(zero_encoding_array_reshaped.shape)
zero_encoding_array_reshaped

(10, 1)


array([[1],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0]])

Let's construct our weight matrix. If $A$ is an $nxm$ matrix and $B$ is an $mxp$ matrix, the product of $A$ and $B$ is an $nxp$ matrix. See https://en.wikipedia.org/wiki/Matrix_multiplication

For our example, we can let $B$ be our (25,1) input vector. Thus $m=25$ and $p=1$. Our output is our (10,1) output vector, where $n=10$ and (again) $p=1$. Thus the dimensions of $A$ are $nxm$: (10,25). We can randomly initialize the entries in this matrix using **np.random.randn**, which will sample each value from a normal distribution with mean $0$ and variance $1$. We'll call our weight matrix $W$:

In [11]:
W = np.random.randn(10,25)
W

array([[ 0.08959282,  1.23659129,  0.56369847, -0.8525387 ,  0.559432  ,
         1.47691024, -0.08397301,  1.1315573 , -0.02773398, -0.31005974,
        -0.52285462, -0.11918065,  0.11545375, -1.02465101, -0.31175265,
        -1.15691582,  0.67346433, -0.2259513 , -1.08184905,  1.78045084,
        -0.24103242, -0.40644409, -0.13537547,  0.71744809, -0.79537288],
       [ 0.73186254,  1.37300205, -0.60917055,  0.7650649 ,  0.83751162,
         0.07958813, -0.32937862, -0.01131647,  0.69321975,  0.51753006,
        -0.69461994, -0.6962348 ,  1.60917693, -0.11310379,  0.39129755,
         0.11881143,  0.44642317,  0.66268916, -0.63407709,  1.33305708,
         2.16614866, -0.20464374,  0.12178699,  0.83161651, -0.43977785],
       [ 0.62921154,  0.06338423,  0.37627663,  0.61492115, -0.60100287,
         2.50959962, -0.01203361, -0.73906838, -0.78917858, -0.16397109,
         0.55888729, -0.21641854,  0.088418  ,  3.04592433,  0.47819311,
         1.7837566 , -1.15263144,  0.72841872, -0

We can also initialize a bias vector $b$ to add to the product of our weight matrix $W$ and our input vector:

In [12]:
b = np.random.randn(10,1)
b

array([[ 1.60884757],
       [ 0.69341915],
       [ 2.13290841],
       [-0.18740879],
       [-0.94227445],
       [ 1.17850932],
       [ 0.18730949],
       [ 1.0631529 ],
       [-0.68747027],
       [ 0.06825139]])

Training a neural network involves...

1) Mutiplying a weight matrix $W$ by our input vector

2) Adding a bias vector to the product of the weight matrix and input vector

...as well as some stuff that we'll cover later, and in more detail in "An Introduction to Neural Networks"

3) Using an **activation function** to transform the output of our matrix multiplication and addition

4) Repeating steps 1-3

5) Applying a **softmax** operation to normalize our activated neurons

6) Computing our loss

7) Updating our weights

In our example, we have connected each input neuron to each output neuron. This is what is known as a **fully-connected** or **dense** neural network. (The lexicon of deep learning is full of synonyms.) Hence we're assuming that the class of an image is optimally learned (in terms of accuracy and computational efficiency) by summing up weighted pixel values and adding a bias to the output.

It turns out there's a better way of doing things. Enter the **convolutional neural network** (**CNN**, **conv net**, what did we just say about these synonyms?) Rather than learning a weight associated with each input-output pair, a convolutional neural network learns a set of **shared weights** (**filters**, **kernels**, **receptive fields**). These shared weights have a **shape**, **volume**, and **stride**. Filters are typically square and identical across all shared weights in a **layer** connecting input and output neurons. The **volume** defines how many filters to learn for a particular layer. 

To compute the output of a convolutional layer, we **stride** each filter across the input matrix (using our example, think of the image as a 5x5 matrix rather than a 25x1 vector) and take the **Hadamard product** (**Schur product**, **entrywise product**): we simply multiply the values of each entry in the filter with the corresponding values in the input matrix. We then sum over all the outputs. Some people call this the dot product. I don't know about that, but I'm not a mathematician.

We implement convolutions in NumPy in our introductory neural network course. If you're having trouble wrapping your head around the convolutional operation, you might also want to check out Stanford course on convolutional neural networks: http://cs231n.github.io. Chris Olah also has a great blog that explains convolutions (among other things): https://colah.github.io/posts/2014-07-Understanding-Convolutions/

We have a lot to cover, so let's import PyTorch before we get too lost in the convolutional weeds:

In [13]:
import torch

PyTorch is a **framework** for defining and executing neural network operations. It's developed by Facebook, along with Caffe2, originally built by a Berkeley PhD who sold out.** Other frameworks include TensorFlow (Google), MXNet (Amazon), and Keras (which is actually a higher-level API that simplifies coding neural networks in TensorFlow and other frameworks). PyTorch is already simple enough, so we don't need Keras.

** n.b. Pat has a PhD from UC Berkeley

PyTorch enables us to define neural network layers in one line of code using the **nn** and **nn.functional** models. Let's import them:

In [14]:
import torch.nn as nn
import torch.nn.functional as F

We can use **nn.linear** to define a dense layer (conventionally **fc** for fully-connected) connecting 25 inputs with 10 outputs:

In [15]:
fc = nn.Linear(25,10)

To construct a neural network, we define a class **Net** that inherits from **nn.Module**. Within this __Net__, we define our neural network architecture using an **init** module, along with our __forward propagation__ using **forward**. Forward propagation is the sequential application of operations to transform our input (image) into output (class vectors):

In [16]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc = nn.Linear(25,10)
    def forward(self, x):
        x = self.fc(x)
        return x

PyTorch has some good tutorials with sample neural networks. I'm using this one as a reference: https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html

For some reason the PyTorch documentation is slow to load, and I don't like the design. It's almost enough to make me switch to TensorFlow.

Now we can call an instance **net** of our network **Net()** and print it:

In [17]:
net = Net()
print(net)

Net(
  (fc): Linear(in_features=25, out_features=10, bias=True)
)


We see a simple network comprised of one dense (fully-connected, linear) layer, 25 input nodes (features), and 10 output nodes. **bias=True** indicates that we're adding a bias vector to our matrix multiplication. We can remove our bias vector by setting **bias** to **False**, for instance in **self.fc = nn.Linear(25,10,bias=False)**

What if we want to replace our dense layer with a convolutional layer? We use **nn.Conv2d** to specify that we're doing a **two-dimensional convolution**. Our convolution is two-dimensional because we're applying a 2D filter to a corresponding 2D input. Here's an article that will help you to understand other types of convolutions: https://towardsdatascience.com/a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215

Remember what we need to define a convolution: shape, volume, and stride! In PyTorch, we define volume using the parameter **out_channels**. We also specify our number of input channels with **in_channels**. If we're using our toy example of a 5x5 image of one-bit pixels, we only have one input channel. We define shape using **kernel_size** (the width and height of the kernel, assuming a square kernel). Stride is simply **stride**. Let's say we want to apply a 2x2 kernel with stride 1 and generate 4 output channels from our single input:

In [18]:
conv = nn.Conv2d(in_channels=1, out_channels=4, kernel_size=2, stride=1)

Now we can replace our fully-connected layer with our convolutional layer!

In [19]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv2d(in_channels=1, out_channels=4, kernel_size=2, stride=1)
    def forward(self, x):
        x = self.conv(x)
        return x
    
net = Net()
print(net)

Net(
  (conv): Conv2d(1, 4, kernel_size=(2, 2), stride=(1, 1))
)


Cool, so we've learned how to implement a convolution in PyTorch! However, convolutional neural networks generally contain at least six operations (and often more):

1) **Convolutions** to extract visual information in the form of shared weights

2) **Pooling** to downsample the output of convolutional operations

3) **Activations** to learn nonlinear features

4) **Flattening** to reshape the three-dimensional output into a one-dimensional vector

5) **Linear transformation** to downsample the flattened output to a vector whose length corresponds to the number of classes

6) **Softmax** to convert the raw output into a probability vector

Let's continue with pooling. If you're unsatisfied by the mathematical complexity of neural networks so far (it's just multiplications and additions??), be prepared to be further disappointed. The most common way to apply pooling is to take the maximum value over a receptive field. With operations like this and words like that, it's surprising that computer vision isn't dominated by English majors. We call this **max pooling**.

In PyTorch, we apply max pooling (over a two-dimensional receptive field) using **nn.MaxPool2d**. Again, we specify __kernel_size__ and **stride** (but not input and output channels). Let's say we want to pool over a 2x2 area with stride of 1:

In [20]:
pool = nn.MaxPool2d(kernel_size=2, stride=1)

Let's add this to our neural network:

In [21]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv2d(in_channels=1, out_channels=4, kernel_size=2, stride=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
    def forward(self, x):
        x = self.pool(self.conv(x))
        return x
    
net = Net()
print(net)

Net(
  (conv): Conv2d(1, 4, kernel_size=(2, 2), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)


Now let's move on to our **activation** function. There are lots of ways in which we could introduce a nonlinearity into our network. For more details, see the "Introduction to Neural Networks" tutorial, as well as this overview of activation functions: http://cs231n.github.io/neural-networks-1/#actfun. Actually, you all should just take the Stanford course; it's amazing.

We need our activation function to do a couple of things:

1) It needs to be nonlinear, of course. This is the whole point - we want to learn a nonlinear function between input and output.

2) It needs to enable the network to train effectively. When we update a weight in our network, we take the partial derivative of our loss function with respect to the weight, scale the derivative by our learning rate, and subtract this value from our weight. This requires us to apply the **chain rule**, where we take the derivative of each output with respect to each input all the way through the network. If the derivative is zero (or close to zero) for any layer (set of input/output pairs), we won't be able to update our weights. Long story short, we need the derivative of our activation function to be nonzero in order to learn. It's okay if the derivative is zero for some inputs. As we'll see in just a second, this is a property of one of the most commonly-used activation functions for convolutional neural networks.

Enter the **rectified linear unit (ReLU)**. All ReLU does is set our negative inputs to zero! We could probably write this function ourselves, but PyTorch makes it even simpler. We just pass our input to **F.relu()** (remember we have the functional API __F__ in addition to the **nn** API). Let's add a ReLU to our network after the convolution and pooling operations. Note that this doesn't change the output of **print(net)**, even though it'll change the value of our outputs for sure:

In [22]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv2d(in_channels=1, out_channels=4, kernel_size=2, stride=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
    def forward(self, x):
        x = F.relu(self.pool(self.conv(x)))
        return x
    
net = Net()
print(net)

Net(
  (conv): Conv2d(1, 4, kernel_size=(2, 2), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)


Let's take a second to think about how we've transformed the dimensions of our input. We can be totally explicit with this since our datum is small.

We started with our input $X$, equal to our "zero" image:

In [23]:
X = a_zero
X

[[0, 1, 1, 1, 0],
 [1, 0, 0, 0, 1],
 [1, 0, 0, 0, 1],
 [1, 0, 0, 0, 1],
 [0, 1, 1, 1, 0]]

What happens when we stride a $2x2$ filter across this image? Let's assume for the sake of example that our filter is simply a $2x2$ array of ones:

In [24]:
f = [[1,1],[1,1]]
f

[[1, 1], [1, 1]]

Now we can start striding across $X$. Since all of the values of F are 1, the elementwise product is simply the input values at every point. Thus we just need to sum over $2x2$ regions of $X$, then stride one column at a time, then do this for the next two rows, all the way until we have strided across the entire array. Every time we begin to stride across a new pair of rows, we similarly begin to output the values of our convolutions in a new row:

                    Rows (0,1), Columns (0,1): 0+1+1+0 = 2 [Enter in (0,0)]
    Stride 1 right: Rows (0,1), Columns (1,2): 1+1+0+0 = 2 [Enter in (0,1)]
    Stride 1 right: Rows (0,1), Columns (2,3): 1+1+0+0 = 2 [Enter in (0,2)]
    Stride 1 right: Rows (0,1), Columns (3,4): 1+0+0+1 = 2 [Enter in (0,3)]
    Stride 1 down:  Rows (1,2), Columns (0,1): 1+0+1+0 = 2 [Enter in (1,0)]
    Stride 1 right: Rows (1,2), Columns (1,2): 0+0+0+0 = 0 [Enter in (1,1)]
    Stride 1 right: Rows (1,2), Columns (2,3): 0+0+0+0 = 0 [Enter in (1,2)]
    Stride 1 right: Rows (1,2), Columns (3,4): 0+1+0+1 = 2 [Enter in (1,3)]
    Stride 1 down:  Rows (2,3), Columns (0,1): 1+0+1+0 = 2 [Enter in (2,0)]
    Stride 1 right: Rows (2,3), Columns (1,2): 0+0+0+0 = 0 [Enter in (2,1)]
    Stride 1 right: Rows (2,3), Columns (2,3): 0+0+0+0 = 0 [Enter in (2,2)]
    Stride 1 right: Rows (2,3), Columns (3,4): 0+1+0+1 = 2 [Enter in (2,3)]
    Stride 1 down:  Rows (3,4), Columns (0,1): 1+0+0+1 = 2 [Enter in (3,0)]
    Stride 1 right: Rows (3,4), Columns (1,2): 0+0+1+1 = 2 [Enter in (3,1)]
    Stride 1 right: Rows (3,4), Columns (2,3): 0+0+1+1 = 2 [Enter in (3,2)]
    Stride 1 right: Rows (3,4), Columns (3,4): 0+1+1+0 = 2 [Enter in (3,3)]
    
This yields our first output **feature map**. We'll call it $V_0$ (for volume):

In [25]:
V_0 = [[2,2,2,2],[2,0,0,2],[2,0,0,2],[2,2,2,2]]
V_0

[[2, 2, 2, 2], [2, 0, 0, 2], [2, 0, 0, 2], [2, 2, 2, 2]]

We then repeat this operation three times, yielding a total of four feature maps (equivalent to our number of **output_features**). Just for example, let's assume our second feature map $V_1$ is **initialized** as an array of twos (doubling our output relative to the first array), our third feature map $V_2$ is an array of threes (tripling our output relative to the first array), and our fourth feature map $V_3$ is an array of zeroes (yielding an array of zeroes).

In [26]:
V_1 = [[4,4,4,4],[4,0,0,4],[4,0,0,4],[4,4,4,4]]
V_2 = [[6,6,6,6],[6,0,0,6],[6,0,0,6],[6,6,6,6]]
V_3 = [[0,0,0,0],[0,0,0,0],[0,0,0,0],[0,0,0,0]]
print(V_0)
print(V_1)
print(V_2)
print(V_3)

[[2, 2, 2, 2], [2, 0, 0, 2], [2, 0, 0, 2], [2, 2, 2, 2]]
[[4, 4, 4, 4], [4, 0, 0, 4], [4, 0, 0, 4], [4, 4, 4, 4]]
[[6, 6, 6, 6], [6, 0, 0, 6], [6, 0, 0, 6], [6, 6, 6, 6]]
[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0]]


Now we pool over each $2x2$ region of each feature map {V_0, V_1, V_2, V_3}. Consider $V_0$:

        Rows (0,1), Columns (0,1): 2 [Enter in (0,0)]
        Rows (0,1), Columns (2,3): 2 [Enter in (0,1)]
        Rows (2,3), Columns (0,1): 2 [Enter in (1,0)]
        Rows (2,3), Columns (2,3): 2 [Enter in (1,1)]
        
Let's define $P_0$ as the output of pooling over $V_0$:

In [27]:
P_0 = [[2,2],[2,2]]
P_0

[[2, 2], [2, 2]]

We can then do the same for $V_1$, $V_2$, and $V_3$:

In [28]:
P_1 = [[4,4],[4,4]]
P_2 = [[6,6],[6,6]]
P_3 = [[0,0],[0,0]]
print(P_0)
print(P_1)
print(P_2)
print(P_3)

[[2, 2], [2, 2]]
[[4, 4], [4, 4]]
[[6, 6], [6, 6]]
[[0, 0], [0, 0]]


To sum up:

1) We applied convolutons to our 5x5 input (25 pixels or nodes) to learn four 4x4 arrays (4x4x4 = 64 nodes)

2) We used pooling to downsample our four 4x4 arrays (64 nodes) to four 2x2 arrays (4x2x2 = 16 nodes)

3) Now we need to downsample further from 16 nodes to 10 nodes (the number of classes in our output vector)

4) However, we first need to **concatenate** or "flatten" our output into a single dimension. PyTorch makes it easy to flatten __tensors__ (multidimensional arrays) using the **view** method. We pass two arguments to __view__: a $-1$ (indicating that we want to flatten the tensor to one dimension) and a $16$ (the number of nodes in our tensor). Let's add it to our network!

In [29]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv2d(in_channels=1, out_channels=4, kernel_size=2, stride=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
    def forward(self, x):
        x = F.relu(self.pool(self.conv(x)))
        x = x.view(-1,16)
        return x
    
net = Net()
print(net)

Net(
  (conv): Conv2d(1, 4, kernel_size=(2, 2), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)


Now we're at a vector of 16 nodes. We need to get from 16 to 10 (the number of classes in our output vector. A **linear** layer with 16 input features and 10 output features should do the trick! We'll call it **fc** (again, for fully connected) just like we did before:

In [30]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv = nn.Conv2d(in_channels=1, out_channels=4, kernel_size=2, stride=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc = nn.Linear(16,10)
    def forward(self, x):
        print(x)
        x = F.relu(self.pool(self.conv(x)))
        x = x.view(-1,16)
        x = self.fc(x)
        return x
    
net = Net()
print(net)

Net(
  (conv): Conv2d(1, 4, kernel_size=(2, 2), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc): Linear(in_features=16, out_features=10, bias=True)
)


Now let's see what features our network learns from one forward pass of our 5x5 "image." First we'll need to convert the image to a tensor using the **torch.tensor** method. Let's again call our sample input $X$. __requires_grad=False__ means that we don't want to calculate a **gradient** for $X$. $X$ is an input of constant pixel values, not a tensor of variable weights to update through backpropagation! Finally, __float()__ converts our input into the correct data type for doing tensor operations (though maybe not optimal precision for performance, but that's okay for now):

In [31]:
X = torch.tensor(a_zero_array, requires_grad=False).float()
print(X.shape)
X

torch.Size([5, 5])


tensor([[0., 1., 1., 1., 0.],
        [1., 0., 0., 0., 1.],
        [1., 0., 0., 0., 1.],
        [1., 0., 0., 0., 1.],
        [0., 1., 1., 1., 0.]])

One more thing before we feed our input tensor through the network. We need to add two additional dimensions: **batch_size** and **channels**. Both of these are just __1__: we have only one observation (our sample 5x5 image) and one channel (our 1-bit pixel value). In practice, we might be dealing with with RGB images or video, and we might be feeding multiple observations through the network at once (this is one way in which we take advantage of the parallel architecture of GPUs and other accelerators). 

We again use **x.view** to reshape our input:

In [32]:
X = X.view(1,1,5,5)
print(X.shape)
X

torch.Size([1, 1, 5, 5])


tensor([[[[0., 1., 1., 1., 0.],
          [1., 0., 0., 0., 1.],
          [1., 0., 0., 0., 1.],
          [1., 0., 0., 0., 1.],
          [0., 1., 1., 1., 0.]]]])

We're also gonna need to create a tensor for our label. PyTorch makes this easy too. Just define a tensor with \[0\]! Again, we'll want to turn off gradient updates - it's a label, not a parameter...

In [33]:
y = torch.tensor([0], requires_grad=False)

In [34]:
y

tensor([0])

Now we can pass our input through an instance of our neural network. We'll call our predicted output $Y$:

In [35]:
X

tensor([[[[0., 1., 1., 1., 0.],
          [1., 0., 0., 0., 1.],
          [1., 0., 0., 0., 1.],
          [1., 0., 0., 0., 1.],
          [0., 1., 1., 1., 0.]]]])

In [36]:
print(net)

Net(
  (conv): Conv2d(1, 4, kernel_size=(2, 2), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc): Linear(in_features=16, out_features=10, bias=True)
)


In [37]:
Y = net(X)
Y

tensor([[[[0., 1., 1., 1., 0.],
          [1., 0., 0., 0., 1.],
          [1., 0., 0., 0., 1.],
          [1., 0., 0., 0., 1.],
          [0., 1., 1., 1., 0.]]]])


tensor([[-0.4813,  0.0130,  0.1218, -0.2017, -0.2990, -0.3195, -0.0297,  0.1796,
         -0.1663,  0.1432]], grad_fn=<AddmmBackward>)

You should see a vector of length 10! We haven't applied softmax yet, so it hasn't been normalized to a probability distribution yet, but you can see that the network thinks our 0 is an 8. We haven't updated weights and biases yet, so that's completely meaningless...

All that's left to do is pass the raw output through a softmax function, compute loss, update weights, and repeat until satisfied! Since we still haven't even gotten to LeNet, we'll skim on the details (and the math). Here's a non-technical desciption of all three things:

1) The softmax function converts our raw output to a probability vector by exponentiating each entry of the output vector (the input to our softmax function), $e^{i}$ for each value $i$, then dividing each $e^{i}$ by the sum of all exponentiated outputs.

2) For classification, our loss is simply the negative natural logarithm of the predicted probability for the class corresponding to our ground truth. For instance, if we have a "hot dog" / "not hot dog" classifier, and our algorithm (correctly) predicts "hot dog" with probability 0.8, our loss is $-ln(0.8) = 0.22$. Note that our model still has a positive loss since the probability of "hot dog" is 0.8 (and thus "not hot dog" is 0.2), even though the "ground truth" is "hot dog" with probability 1 and "not hot dog" with probability 0. Assuming we are 100% confident in our observation... Make sense?

3) We update weights and biases by taking the derivative of our loss function with respect to each, scaling by our learning rate, and subtracting from our current values.

In PyTorch, we compute our softmax and calculate our loss in one go. **SGD** means that we are  applying **stochastic gradient descent**. This is the process described three sentences ago in __3)__ but by evaluating the gradient on randomly sampling observations. Of course this doesn't matter since we only have one observations but best practices yadda yadda. We apply this to our weights and biases (**net.parameters**) with a __learning rate__ of 0.1. The learning rate is the constant by which we scale our gradient updates.

**nn.CrossEntropyLoss** is the negative log loss described in __2)__, and the **backward()** method automatically updates our parameters using the computed loss. It really is magic!

Let's make sure to import the **optim** package first:

In [38]:
import torch.optim as optim

Now we can calculate our loss:

In [39]:
optimizer = optim.SGD(net.parameters(), lr=0.1)
ce_loss = nn.CrossEntropyLoss()
loss = ce_loss(Y, y)
loss.backward()

Finally, let's update our weights!

In [40]:
optimizer.step()

Great! So we've built a simple classifier network and "trained" it on our toy example. How about we use PyTorch to implement some "real" convolutional neural networks?

Let's start with LeNet-5 (LeCun et al. 1998). Geoff Hinton is the Godfather of Deep Learning, but Yann LeCun is perhaps the underboss in the world of convolutional nets (Alex Krizhevsky is the consigliere?) In 1989, LeCun and his coauthors developed a convolutional neural network trained with backpropagation to recognize handwritten digits. This network was used to classify zip codes using data provided by the U.S. Postal Service. Nine years later, LeCun, along with Leon Bottou, Yoshua Bengio, and Patrick Haffner developed LeNet-5. Commercial banks used LeNet-5 to recognize handwritten digits on checks.

LeNet-5 has (as you may have guessed) five layers, though I'm not sure if the "5" refers to the number of layers or maybe it was the fifth version of LeNet? There are more layers if you count pooling and activation as layers rather than "things in the forward method that are applied to the layers defined in the constructor."

Let's start with our template. We'll rename it LeNet, which now means we're plagiarizing unless I cite this: https://github.com/kuangliu/pytorch-cifar/blob/master/models/lenet.py

In [41]:
class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()

        # layers go here
        
    def forward(self, x):
        
        # pooling and activations go here
        
        return x
    
net = LeNet()
print(net)

LeNet()


The architecture of LeNet is simple:

1) A convolutional layer that takes 1 input channel for monochrome images, as is the case with our example, or 3 input channels for color images, and applies 6 5x5 convolutional filters with a stride of 1.

2) Another convolutional layer that takes 6 input channels (the output of the first convolutional layer) and applies 16 convolutional filters with a stride of 1.

3) A fully-connected layer that takes 400 input channels (the 16 filters of width 5 and height 5 from the prior convolutional layer) as input and applies a linear transformation to generate 120 output channels.

4) A fully-connected layer that takes 120 input channels (the output of the prior fully-connected layer) and applies a linear transformation to generate 84 output channels (where did these numbers come from?)

5) A fully-connected layer that takes 84 input channels (the output of the prior fully-connected layer) and applies a linear transformation to generate 10 output channels (corresponding to the number of classes).

Let's add this to our network! We don't need to state our argument names. For instance, **self.conv1 = nn.Conv2d(1,6,5)** will do the trick instead of what we have below (stride defaults to 1). However, let's keep it explicit for now so that we understand what our network is doing:

In [42]:
class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, stride=1)
        self.conv2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5, stride=1)
        self.fc1 = nn.Linear(in_features=16*5*5, out_features=120)
        self.fc2 = nn.Linear(in_features=120, out_features=84)
        self.fc3 = nn.Linear(in_features=84, out_features=10)
    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)   
        return x
    
net = LeNet()
print(net)

LeNet(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


Now let's add activation and pooling functions! For LeNet-5, we apply a ReLU activation to every layer except for **fc3**, which we are already going to pass through our **softmax** nonlinearity (to generate a probability vector for the purposes of calculating loss and updating weights and biases).

We also apply max pooling (2x2 region) to the output of our convolutional layers **after** passing that output through our ReLU activation. One thought (or actual!) experiment you might want to try is swapping pooling and ReLU. Does this change the output of our network? Does this generate a more accurate model? What about the time it takes to train the network?

Here's LeNet-5, with ReLU and max pooling added. This is all we need to define our network architecture (convolutional, fully-connected, and max pooling layers) and forward propagation (activation) modules. We can drop **self.pool** anywhere we'd like among the layers in our constructor since the sequence of forward propagation is defined in the subsequent __forward__ module.

We also need to make sure that we flatten the output after applying our second pooling operation, since our linear layers will expect a (one-dimensional) vector as input rather than a three-dimensional tensor with width and height (the dimensions of the feature maps learned from applying convolutional and pooling operations) and depth (the number of feature maps learned). Again, we do this using __x.view(-1, 16 * 5 * 5)__, where **-1** tells the little man (woman?) inside of PyTorch to flatten the input, and __16 * 5 * 5__ is the size of our input: 16 5x5 feature maps.

Let's also define our convolutional and fully-connected layers without explicitly specifying parameter names. Remember, we don't need to specify our stride if it is equal to our default value of 1. We'll keep our pooling parameters explicit for now:

In [43]:
class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(16*5*5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
    def forward(self, x):
        x = F.relu(self.pool(self.conv1(x)))
        x = F.relu(self.pool(self.conv2(x)))
        x = x.view(-1, 16*5*5)
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)   
        return x
    
net = LeNet()
print(net)

LeNet(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


You'll soon find out (if you haven't already) that there are many ways to define the same network. Instead of using **nn.MaxPool2d** and defining a pooling layer inside of our constructor (__init__), we can use **F.max_pool2d** and define pooling in our __forward__ method. The latter approach is used in 
https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#sphx-glr-beginner-blitz-neural-networks-tutorial-py and in https://github.com/kuangliu/pytorch-cifar/blob/master/models/lenet.py. Note that the first of these two examples can be understood as a modified LeNet with 3x3 rather than 5x5 convolutions. It also takes a single input channel rather than three. LeNet-5 has (likely) been used primarily on single-channel inputs, since handwritten digits tend to be monochrome (unless someone is getting creative with their cheques...)

Note also that pooling stride defaults to kernel_size. Thus we can just define 2x2 pooling on with stride of 2 as **F.max_pool2d(x, 2)**. You can see this in the source code: https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/pooling.py

Here's how LeNet-5 looks when we swap in **F.max_pool2d** for __nn.MaxPool2d__:

In [44]:
class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16*5*5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = x.view(-1, 16*5*5)
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.fc3(x)   
        return x
    
net = LeNet()
print(net)

LeNet(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


Going forward we'll use **F.max_pool2d** for two-dimensional max pooling. I think it's a bit cleaner.

If we try to pass our toy image through LeNet, we'll get a dimensional mismatch. We simply can't apply multiple 5x5 kernels and 2x2 pooling operations to an input image that's only 5x5 to begin with... Instead, LeNet wants a 32x32 one-channel input. We can also edit the network to take three-channel, e.g. RGB (red-green-blue) color images, but let's stick with monochrome.

Let's generate a random 32x32 image. We can stick with one-bit pixels. It doesn't really matter. **torch.randint** will get the job done. One of the cool things is that you can use pretty much everything from NumPy (including __numpy.random.randint__) in PyTorch. We pass the parameters **(2, size=(32,32))** to specify that we want to sample integers between $0$ and $2$ (exclusive) and output the results in a $32x32$ array:

In [45]:
X = torch.randint(2, size=(32,32))
print(X.shape)
print(X)

torch.Size([32, 32])
tensor([[1, 0, 0,  ..., 0, 1, 0],
        [0, 0, 0,  ..., 0, 1, 0],
        [0, 0, 1,  ..., 1, 1, 0],
        ...,
        [0, 1, 1,  ..., 0, 1, 0],
        [0, 0, 1,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 0, 1]])


As we did with our 5x5 image, we need to reshape $X$ to include dimensions for **batch_size** and __channels__:

In [46]:
X = X.view(1,1,32,32)
print(X.shape)
print(X)

torch.Size([1, 1, 32, 32])
tensor([[[[1, 0, 0,  ..., 0, 1, 0],
          [0, 0, 0,  ..., 0, 1, 0],
          [0, 0, 1,  ..., 1, 1, 0],
          ...,
          [0, 1, 1,  ..., 0, 1, 0],
          [0, 0, 1,  ..., 1, 1, 1],
          [1, 0, 0,  ..., 1, 0, 1]]]])


Finally, we need to cast (the values of?) $X$ as a float:

In [47]:
X = X.float()

In [48]:
Y = net(X)
Y

tensor([[-0.0551, -0.0085, -0.0425, -0.0719, -0.1065, -0.0156,  0.0019, -0.0425,
          0.0327,  0.0532]], grad_fn=<AddmmBackward>)

We can then compute loss and gradient, update weights, etc:

In [49]:
optimizer = optim.SGD(net.parameters(), lr=0.1)
ce_loss = nn.CrossEntropyLoss()
loss = ce_loss(Y, y)
loss.backward()
optimizer.step()

Alright! Now let's think about the complexity of our networks. We started with a super-simple linear transformation (with bias) of our 25 input pixels into 10 output classes. So we learned 25x10 + 10 = 250+10 = 260 parameters.

Then we switched to a convolutional neural network. We learned 4 2x2 filters. By default in PyTorch, we also added one bias per convolutional filter. So our convolutions generated 4x2x2 + 4 = 16+4 = 20 parameters. With ReLu and max pooling, we're not actually learning any additional parameters, even though we obviously have to do some computation. So we'll exclude these operations from our parameter count. Finally, we got our linear layer, where we take our 16 inputs (4 x 2x2 filters) and learn 16x10 weights and 10 biases, or 16x10 + 10 = 170 additional parameters. One of the cool things is that we've actually *decreased* the number of parameters by adding a convolution before our linear layer. Another reason to use convolutions!

Now let's take a look at LeNet-5, layer-by-layer:

1) We learn 6 5x5 weights plus bias in our first convolutional layer for 6x5x5 + 6 = 156 total parameters

2) We learn 16 5x5 weights plus bias in our second convolutional layer for 16x5x5 + 16 = 416 total parameters

3) We learn 400x120 + 120 = 48,000 + 120 = 48,120 parameters in our first fully-connected layer

4) We learn 120x84 + 84 = 10,080 + 84 = 10,164 parameters in our second fully-connected layer

5) We learn 84x10 + 10 = 840 + 10 = 850 parameters in our third fully-connected layer

In total, LeNet-5 learns 156 + 416 + 48,120 + 10,164 + 850 = 59,706 parameters. That's a lot more than our toy networks but we're just getting started..

Breaking down our parameters:

1) 98.68% are weights from our fully-connected layers

2) 0.92% are weights from our convolutional kernels

3) 0.36% are biases from our fully-connected layers

4) 0.04% are biases added to our convolutional weights

Judging from parameter count alone, this doesn't look like a convolutional neural network. But those are some powerful convolutions. They get the work done while learning far fewer parameters. We should also be aware of the fact that a single convolution is more computationally expensive than a simple matrix multiplication. We gotta stride each of those things across our source image and subsequently learned feature maps.

LeNet-5 gets the job done with handwritten digits, as long as we're cool with only 5% the digits on our cheques getting misread, which is most of us, no? But what if we're trying to do more complex stuff like build a cat classifier with RGB images? We're gonna need a bigger network.

In 2006, Fei-Fei Li of Princeton and UIUC (later Stanford) came up with the idea for ImageNet (http://image-net.org. I found out this information at https://en.wikipedia.org/wiki/ImageNet#History_of_the_database). Starting in 2010, the curators of ImageNet began hosting the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The most well-known version of ILSVRC, used since the 2012 challenge uses 1,281,167 training images, each labeled as one of 1000 classes. Each research team uses this training set to generate a model which is then evaluated on a test set.

When ILSVRC debuted in 2010, the dominant image classification competition was the PASCAL VOC (Visual Object Classes) challenge. ILSVRC 2010 was actually held as a "sideshow" in conjunction with PASCAL VOC 2012. In 2011, Xerox Research Centre Europe (XRCE) won the competition using a combination of Fisher vectors, product quantization, and support vector machine linear classifiers (learned through backpropagation! - http://image-net.org/challenges/LSVRC/2011/ilsvrc11.pdf) Don't worry what this stuff means. I'm not up on my Fisher vectors either.

What's important is that XRCE's submission wasn't a neural network. Even eight years ago (writing in 2019), neural networks were simply too costly to train given available compute resources. If you think about it, fully-connected layers are just about the dumbest way to learn features. You take all your inputs and outputs and learn a linear connection between the two of them. Let's say we're doing classification on a single 256x256x3 image. ImageNet images come in all sizes, so it's necessary to resize these images (to a common size) prior to feeding them into a neural network (or classification algorithm more generally). 256x256 is a common shape, and we'll retain our 3 color channels. That's 65,536 pixels per channel, *already more than the total parameters of LeNet-5*, or 65,536x3 = 196,608 total input nodes. Yikes. What if we want to learn a *single* linear transformation (with bias, but that's computationally cheap) between this input and our 1,000-class output? That's 196,608 x 1,000 + 1,000 = 196,609,000 total parameters. **WHOA**.

Lessons learned:

1) We're gonna need a bigger network

2) We're gonna need a more computationally efficient algorithm

3) We're gonna need thousands of CPUs, or maybe two (2011-era) GPUs?

3a) If we go the GPU route, we're going to need to implement our network in CUDA or OpenCL

Fortunately, we don't need to worry about 3) or 3a). We're experimenting with a single image, and we'll just do a single forward and backward pass. We can do that on CPU. If we want to do this on GPU, PyTorch of course supports that.

But in 2012 it wasn't so clear that (convolutional) neural networks were the way to go. You'll notice that there's a bit of a temporal gap between LeNet-5 (1998) and AlexNet (2012 - which we'll get to in just a moment).

For the reasons we just examined, neural networks weren't the optimal solution to many (computer vision, artificial intelligence, machine learning) problems, given available data, compute, and algorithmic constraints. However:

1) ImageNet solved the issue of insufficient data, at least for 1000-class image classification
2) GPUs can do arithmetic (especially matrix multiplications) in parallel, a massive speedup over CPU arithmetic
3) NVIDIA developed software for general purpose computing (CUDA) using GPUs, enabling researchers to take advantage of the parallel architecture of GPUs for computing (not just graphics)

In the decades in which neural networks had fallen out of favor, Geoff Hinton and his researchers at the University of Toronto continued to develop novel algorithms. In 2012, his graduate student Alex Krizhevsky, in collaboration with Hinton and Ilya Sutskever submitted a convolutional neural network (nicknamed "AlexNet") to the ILSVRC competition and *absolutely destroyed* the field. You can see the results here: http://www.image-net.org/challenges/LSVRC/2012/results.html. They called their team "SuperVision," ha.

It's been a while since we've done some code, so let's get the template ready for AlexNet. Or SuperVision. Why don't people call it that?

In [50]:
class AlexNet(nn.Module):
    def __init__(self):
        super(AlexNet, self).__init__()

        # layers go here
        
    def forward(self, x):
        
        # pooling and activations go here
        
        return x
    
net = AlexNet()
print(net)

AlexNet()


Let's check out AlexNet's architecture. For several reasons, one might expect AlexNet to be more complex than LeNet-5 (59,706 parameters). First off, the *number of input nodes alone* (196,608) is greater than the number of total parameters of LeNet-5. Of course, we don't need to learn connections from each of the input nodes, but if we want to exploit the additional detail of the images, we'll need a "bigger" network (in terms of parameters).

If we're learning 1000 classes rather than 10, we'll also need to learn many more features to differentiate between these classes. To take a concrete example from ILSVRC 2012, think of the complexity of information necessary to distinguish an Angora rabbit from a wood rabbit (see https://gist.github.com/xkumiyu/dd200f3f51986888c9151df4f2a9ef30). Now think about what we need to differentiate a zero (one circle) from an eight (two vertically-stacked circles). We will most likely need to learn more (convolutional) feature( map)s.

AlexNet is available as part of PyTorch's **torchvision** module of "popular datasets, model architectures, and common image transformations for computer vision" (see https://pytorch.org/docs/stable/torchvision/models.html#classification and https://github.com/pytorch/vision/blob/master/torchvision/models/alexnet.py). When building convolutional neural networks, it's common to import torchvision along with pytorch:

In [51]:
import torchvision

Recall that LeNet learned 6 5x5 kernels with stride of 1 in our first layer, then learned 16 with the same dimensions and stride in our second layer. If we do this on a 256x256x3 RGB image (rather than a 32x32 grayscale image), we're going to end up learning fairly large feature maps. AlexNet's first convolutional layer manages to downsample the source image(s) a bit more, learning 11x11 kernels with stride of 4. It turns out that this doesn't exactly match the dimensions of the input image (256x256x3), so we also have to add **padding**: a row and column of zeroes on each edge of the image. For our first layer, __padding=2__. We have 3 input channels (red, green, and blue) and we learn 64 output feature maps from our 11x11 kernels:

In [52]:
class AlexNet(nn.Module):
    def __init__(self):
        super(AlexNet, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=11, stride=4, padding=2)
        
    def forward(self, x):
        
        # pooling and activations go here
        
        return x
    
net = AlexNet()
print(net)

AlexNet(
  (conv1): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
)


We also apply max pooling over a 3x3 region with stride of 2. This leads to greater downsampling relative to LeNet, which pools over a 2x2 region with stride of 2. However, we only pool after our first, second, and fifth convolutional layers. AlexNet has five convolutional layers. We apply ReLU after each convolutional layer (before pooling).

What about our other convolutional layers?

We already saw that **conv1** takes 3 input channels and applies an 11x11 kernel with stride of 4, padding of 2, and 64 output channels

**conv2** takes 64 input channels (output of __conv1__) and applies a 5x5 kernel with stride of 1, padding of 2, and 192 output channels

**conv3** takes 192 input channels (output of __conv2__) and applies a 3x3 kernel with stride of 1, padding of 1, and 384 output channels

**conv4** takes 384 input channels (output of __conv3__) and applies a 3x3 kernel with stride of 1, padding of 1, and 256 output channels

**conv5** takes 256 input channels (output of __conv4__) and applies a 3x3 kernel with stride of 1, padding of 1, and 256 output channels

Let's add these layers to our network, along with our 3x3 pooling layers with stride of 2 and our ReLU activation functions. We can pass **in_channels** and __out_channels__ as our first two arguments without stating them explicitly:

In [53]:
class AlexNet(nn.Module):
    def __init__(self):
        super(AlexNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2)
        self.conv2 = nn.Conv2d(64, 192, kernel_size=5, stride=1, padding=2)
        self.conv3 = nn.Conv2d(192, 384, kernel_size=3, stride=1, padding=2)
        self.conv4 = nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1)
        self.conv5 = nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1)        
    def forward(self, x):
        x = F.max_pool2d(self.conv1(x), kernel_size=3, stride=2)
        x = F.relu(x)
        x = F.max_pool2d(self.conv2(x), kernel_size=3, stride=2)
        x = F.relu(x)
        x = F.relu(self.conv3(x))
        x = F.relu(self.conv4(x))
        x = F.max_pool2d(self.conv3(x), kernel_size=3, stride=2)
        x = F.relu(x)
        return x
    
net = AlexNet()
print(net)

AlexNet(
  (conv1): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
  (conv2): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv3): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2))
  (conv4): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (conv5): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)


Let's introduce another PyTorch API. Just as we can define our layers with **torch.nn** and our pooling and activations with __torch.nn.functional__, we can also "stack" our layers (including pooling and activation functions) using **torch.nn.Sequential**. This allows us to define multiple methods of our __self__ class that we can then call during **forward()**. We separate each "layer" (convolution, pooling, and activation) with a comma. Note also that we are using __nn.ReLU__ and **nn.MaxPool2d** rather than __F.relu__ and **F.max_pool2d** (note the difference in casing).

__torchvision__ uses **nn.Sequential** to define AlexNet (https://github.com/pytorch/vision/blob/master/torchvision/models/alexnet.py):

In [54]:
class AlexNet(nn.Module):
    def __init__(self, num_classes=1000):  # length of our output vector = num_classes
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(  # we pass our input through these layers (features)
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2), # no need to specify "self"
            nn.ReLU(inplace=True), # a neat trick to decrease memory usage [0]
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2), # include trailing comma
        )
         
    def forward(self, x):
        x = self.features(x)
        return x
    
net = AlexNet()
print(net)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2))
    (7): ReLU(inplace)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
)


\[0\] See https://discuss.pytorch.org/t/whats-the-difference-between-nn-relu-and-nn-relu-inplace-true/948

The output of **print(net)** is slightly prettier, I think! Of course we still need to add our fully-connected layers. AlexNet has three of these. But first we do something called __Adaptive Average Pooling (AdaptiveAvgPool2d)__ in PyTorch. This applies average pooling to our input *given a specified output*. Average pooling sums up over the pooling region and divides by the number of observations in the region. It's more precise than max pooling but it costs slightly more to compute.

In other words, rather than applying a specified kernel size and stride to the input, it applies pooling based on the *target size* of the output (in this case 6x6). See https://discuss.pytorch.org/t/what-is-adaptiveavgpool2d/26897. We define a method **self.avgpool** and pass our input to the method in __forward__

What about our fully-connected layers? The **vision** implementation of AlexNet defines these layers in a separate __classifier__ method. We apply **classifier** to the reshaped output of the adaptive average pooling of the output of our convolutional, ReLU, and pooling operations. That's a mouthful, but hopefully it's starting to make sense, and will make more sense once we see it in code.

Recall that we learn 256 kernels in our final convolutional layer. We then apply adaptive average pooling so that each kernel is now of dimensions 6x6. These flattened parameters are thus the input to our first **nn.Linear** (fully-connected) layer:

The first **nn.Linear** takes a 256x6x6 input and generates an output of size 4096

The second **nn.Linear** takes an input of size 4096 and generate an output of equal size

Finally, the third **nn.Linear** takes an input of size 4096 and generates an output of size equal to the number of classes defined in __num_classes__. For ImageNet, **num_classes=1000**. This is the default value in __vision/torchvision/models/alexnet.py__

We also need to make sure that we layer some activation functions between our **nn.Linear**s. Remember it's necessary to apply activations between each pair of layers in order to learn nonlinear features.

One more thing! We add something called **dropout** to our __classifier__ of fully-connected layers. Dropout randomly sets some of our parameters to zero to prevent our network from **overfitting**. If we think of our data as sampled from a true distribution with error, overfitting occurs when we learn the error rather than the true distribution from the data. If we apply an "overfit" model to new data sampled from our true distribution, it will be biased. See https://en.wikipedia.org/wiki/Overfitting for more on overfitting, and http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf for the paper on dropout (Hinton, Krizhevsky, and Sutskever again! - along with two other collaborators from Toronto).

We add dropout using **nn.Dropout()** before our first and second linear layers. By default, __nn.Dropout()__ keeps parameters with probability 0.5 and sets the remainder to zero.

Let's take a look at our network with the **avgpool** and __classifier__ methods added:

In [55]:
class AlexNet(nn.Module):
    def __init__(self, num_classes=1000):  # length of our output vector = num_classes
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(  # we pass our input through these layers (features)
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2), # no need to specify "self"
            nn.ReLU(inplace=True), # a neat trick to decrease memory usage [0]
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2), # include trailing comma
        )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes),
        )
         
    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), 256 * 6 * 6)  # flattens our set of 6x6 kernels
        x = self.classifier(x)
        return x
    
net = AlexNet()
print(net)

AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2))
    (7): ReLU(inplace)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
    (2): ReLU(inplace)
    (3): Dropout(p

Sweet! Let's generate a fake 256x256x3 image and feed it through AlexNet. We'll make the pixel values binary again. As an exercise, you could generate your own synthetic image with 8-bit pixels instead.

In [56]:
X = torch.randint(2, size=(3,256,256))
print(X.shape)
print(X)

torch.Size([3, 256, 256])
tensor([[[0, 1, 0,  ..., 0, 1, 1],
         [1, 0, 1,  ..., 0, 1, 0],
         [0, 0, 1,  ..., 1, 1, 1],
         ...,
         [0, 0, 1,  ..., 0, 0, 0],
         [1, 1, 0,  ..., 0, 1, 0],
         [1, 0, 0,  ..., 0, 1, 1]],

        [[1, 0, 0,  ..., 0, 0, 0],
         [0, 0, 0,  ..., 1, 1, 1],
         [1, 1, 0,  ..., 0, 1, 1],
         ...,
         [1, 0, 1,  ..., 0, 1, 1],
         [0, 1, 1,  ..., 1, 1, 0],
         [1, 0, 0,  ..., 0, 1, 0]],

        [[0, 1, 1,  ..., 1, 0, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 0,  ..., 1, 0, 1],
         ...,
         [0, 0, 1,  ..., 1, 1, 0],
         [0, 1, 0,  ..., 1, 1, 0],
         [0, 1, 0,  ..., 1, 1, 0]]])


Let's add a fourth dimension for **batch_size**...

In [57]:
X = X.view(1,3,256,256)
print(X.shape)
print(X)

torch.Size([1, 3, 256, 256])
tensor([[[[0, 1, 0,  ..., 0, 1, 1],
          [1, 0, 1,  ..., 0, 1, 0],
          [0, 0, 1,  ..., 1, 1, 1],
          ...,
          [0, 0, 1,  ..., 0, 0, 0],
          [1, 1, 0,  ..., 0, 1, 0],
          [1, 0, 0,  ..., 0, 1, 1]],

         [[1, 0, 0,  ..., 0, 0, 0],
          [0, 0, 0,  ..., 1, 1, 1],
          [1, 1, 0,  ..., 0, 1, 1],
          ...,
          [1, 0, 1,  ..., 0, 1, 1],
          [0, 1, 1,  ..., 1, 1, 0],
          [1, 0, 0,  ..., 0, 1, 0]],

         [[0, 1, 1,  ..., 1, 0, 1],
          [1, 1, 1,  ..., 1, 1, 1],
          [1, 1, 0,  ..., 1, 0, 1],
          ...,
          [0, 0, 1,  ..., 1, 1, 0],
          [0, 1, 0,  ..., 1, 1, 0],
          [0, 1, 0,  ..., 1, 1, 0]]]])


... and cast our synthetic "image" as a float:

In [58]:
X = X.float()

Since we've already set **net** equal to an instance __AlexNet()__ of AlexNet, we can pass our image **X** through __net__ 

We'll call our output **Y** again:

In [59]:
Y = net(X)
Y

tensor([[-2.6008e-03, -1.1606e-02, -3.6402e-03, -1.2425e-02,  4.6868e-03,
         -2.0302e-02,  1.9750e-02,  1.6332e-03,  2.1267e-02,  7.0487e-03,
         -3.8987e-03,  3.3410e-03, -8.5150e-03,  1.8240e-02,  1.1947e-02,
          7.6247e-03, -9.0806e-04,  1.4194e-03,  5.7656e-03,  3.9215e-03,
         -1.1260e-02,  7.1236e-03, -5.4486e-03,  1.1531e-02, -3.0117e-03,
         -7.3532e-04, -6.9984e-03, -1.0789e-02, -1.7334e-02,  1.6429e-02,
          1.5632e-02,  5.6069e-03,  3.9309e-03,  1.2510e-03,  5.0644e-03,
          8.0094e-03, -1.5284e-02,  9.1914e-04, -1.2278e-03, -9.5449e-03,
          1.0507e-02,  6.3436e-03, -8.1913e-03, -9.7031e-03, -1.1655e-02,
          2.4470e-02,  3.1370e-03, -8.1217e-03,  8.9174e-03, -1.6260e-02,
         -5.5201e-03, -6.8232e-03, -6.6632e-04, -5.6171e-03, -7.9334e-04,
          9.3163e-03, -3.5516e-03,  2.2007e-02, -2.3209e-02, -4.7882e-03,
         -1.3573e-02,  4.2442e-03,  6.9117e-03,  6.2985e-03, -1.0943e-02,
          1.1973e-02, -1.3436e-02,  7.

We can use **torch.argmax** to identify the index with the maximum value. You can check out the corresponding ImageNet class here! https://gist.github.com/yrevar/942d3a0ac09ec9e5eb3a

In [60]:
torch.argmax(Y)

tensor(824)

The next architecture we'll examine is **VGG**. VGG is just the abbreviation for the Oxford Visual Geometry Group. Here's an article about VGG from VGG: http://www.robots.ox.ac.uk/~vgg/research/very_deep/

VGG is one of the most straightforward convolutional neural networks. The torchvision code might look intimidating at first, since it's designed for multiple configurations of VGG: https://github.com/pytorch/vision/blob/master/torchvision/models/vgg.py

As we'll see with ResNet, a neural network can come in multiple flavors. In the case of VGG, it's possible to have 11, 13, 16, or 19 layers, each with or without **batch normalization** (we won't worry about what that is - you can check out the paper if you're curious (https://arxiv.org/pdf/1502.03167.pdf), or just google "batch norm(alization)")). We define which VGG we want to use by setting __features__

Let's start with our generic VGG template. Once we've defined this, we'll examine specific "flavors" of VGG (e.g. VGG16).

In [61]:
class VGG(nn.Module):
    def __init__(self, features, num_classes=1000): # omitting weight initialization argument
        super(VGG, self).__init__()
        self.features = features  # used for defining specific type of VGG, e.g. VGG19
        self.avgpool = nn.AdaptiveAvgPool2d((7, 7)) # average pooling with (7, 7) output
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),  # 512 kernels from last conv; (7, 7) pool output
            nn.ReLU(True), # inplace=True
            nn.Dropout(), # 0.5 by default
            nn.Linear(4096, 4096), # 4096 output from last nn.Linear
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )
         
    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)  # flattens the output of average pooling
        x = self.classifier(x)
        return x

The next method of VGG defines specific VGG networks using the **make_layers** function:

In [62]:
def make_layers(cfg):  # omitting batch normalization argument here and below
    layers = []  # defining empty list
    in_channels = 3  # for three-channel (e.g. RGB) images
    for v in cfg:
        if v == 'M':  # add pooling layers for each cfg = 'M' (see 'cfgs' below)
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        else:  # add convolutional layers for each cfg != 'M', where out_channels=v
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            layers += [conv2d, nn.ReLU(inplace=True)]
    return nn.Sequential(*layers)  # return nn.Sequential with added layers

cfgs = {
    'A': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],  # used for vgg11
    'B': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],  # used for vgg13
    'D': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],  # used for vgg16
    'E': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],  # used for vgg19
}

We then have a separate **\_vgg** function for building the model:

In [63]:
def _vgg(arch, cfg, **kwargs):  # omitting batch_norm, pretrained, and progress arguments
    model = VGG(make_layers(cfgs[cfg], **kwargs))  # calls make_layers for particular cfg
    return model

Now let's examine our function for building VGG16:

In [64]:
def vgg16(**kwargs):  # again, omitting batch_norm, pretrained, and progress arguments
    return _vgg('vgg16', 'D', **kwargs)

Now let's put it all together:

In [65]:
class VGG(nn.Module):
    def __init__(self, features, num_classes=1000): # omitting weight initialization argument
        super(VGG, self).__init__()
        self.features = features  # used for defining specific type of VGG, e.g. VGG19
        self.avgpool = nn.AdaptiveAvgPool2d((7, 7)) # average pooling with (7, 7) output
        self.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),  # 512 kernels from last conv; (7, 7) pool output
            nn.ReLU(True), # inplace=True
            nn.Dropout(), # 0.5 by default
            nn.Linear(4096, 4096), # 4096 output from last nn.Linear
            nn.ReLU(True),
            nn.Dropout(),
            nn.Linear(4096, num_classes),
        )
         
            
    def forward(self, x):
        x = self.features(x)
        x = self.avgpool(x)
        x = x.view(x.size(0), -1)  # flattens the output of average pooling
        x = self.classifier(x)
        return x
    
    
    def make_layers(cfg):  # omitting batch normalization argument here and below
        layers = []  # defining empty list
        in_channels = 3  # for three-channel (e.g. RGB) images
        for v in cfg:
            if v == 'M':  # add pooling layers for each cfg = 'M' (see 'cfgs' below)
                layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
            else:  # add convolutional layers for each cfg != 'M', where out_channels=v
                conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
                layers += [conv2d, nn.ReLU(inplace=True)]
        return nn.Sequential(*layers)  # return nn.Sequential with added layers


    cfgs = {
        'A': [64, 'M', 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],  # used for vgg11
        'B': [64, 64, 'M', 128, 128, 'M', 256, 256, 'M', 512, 512, 'M', 512, 512, 'M'],  # used for vgg13
        'D': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M'],  # used for vgg16
        'E': [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 256, 'M', 512, 512, 512, 512, 'M', 512, 512, 512, 512, 'M'],  # used for vgg19
    }
    
    
    def _vgg(arch, cfg, **kwargs):  # omitting batch_norm, pretrained, and progress arguments
        model = VGG(make_layers(cfgs[cfg], **kwargs))  # calls make_layers for particular cfg
        return model
    
    
    def vgg16(**kwargs):  # again, omitting batch_norm, pretrained, and progress arguments
        return _vgg('vgg16', 'D', **kwargs)

We can now define our **net** as a specific version of VGG (e.g. VGG16). As the name suggests, their are 16 total layers between convolutions and (fully-connected) matrix multiplications, each separated by a ReLU activation:

In [66]:
net = vgg16
net()

VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace)
    (2): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace)
    (7): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(3, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace)
    (12): Conv2d(3, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace)
    (14): Conv2d(3, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (17): Conv2d(3, 512, k

Good stuff. You can also experiment with other VGG architectures and see how they differ.

You might be curious about **dilation** and __ceil_mode__. The "Introduction to Neural Networks" tutorial covers this:

**dilation** refers to spacing between values in a convolutional or padding kernel. For instance, if we applied our 2x2 kernel with a dilation rate of 2, we would end up applying our operation over a 3x3 field but only performing elementwise multiplication on every other row or column.

A visualization DEFINITELY helps to explain dilation (and other convolutional operations!) https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d

What about **ceil_mode**? Remember when we computed the dimensions of the output of our convolutional layer? We can do the same for the output of our pooling layer. We don't need to go over the details of the operation (see https://pytorch.org/docs/stable/\_modules/torch/nn/modules/pooling.html if you're curious). 

The point is that our dimensions must be (positive) integers, and thus if the operation produces a non-integer output, we'll either need to round down (floor) or up (ceiling). When ceil_mode=False (by default), PyTorch uses the floor of the operation rather than the ceiling to compute the dimensions of the pooling output.

Let's keep moving along, since we still have two more architectures to cover. For more details on forward and backward propagation, consult our "Introduction to Neural Networks" tutorial. We'll focus on architectures from here on.

Next up is Inception v3! You can use this for reference: https://github.com/pytorch/vision/blob/master/torchvision/models/inception.py

In [67]:
class Inception3(nn.Module):
    def __init__(self):
        super(Inception3, self).__init__()

        # layers go here
        
    def forward(self, x):
        
        # defines how layers are applied to inputs
        
        return x
    
net = Inception3()
print(net)

Inception3()


The first set of layers we'll add use the function **BasicConv2d**. This is defined in a separate class, which inherits from __nn.Module__!

In [68]:
class BasicConv2d(nn.Module):
    
    def __init__(self, in_channels, out_channels, **kwargs):
        super(BasicConv2d, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
        self.bn = nn.BatchNorm2d(out_channels, eps=0.001)
        
    def forward(self, x):
        x = self.conv(x)
        x = self.bn(x)
        return F.relu(x, inplace=True)

Like the networks we've seen so far - LeNet, AlexNet, and VGG - **BasicConv2d** has two methods: __\_\_init\_\___ and **forward**. __\_\_init\_\___ defines the architecture, in this case a single two-dimensional convolution with a specified number of input and output channels and no bias, plus a **batch normalization** layer. We encountered batch normalization in our examination of VGG, though we didn't end up constructing a network with "batch norm." Batch normalization takes a layer's inputs, subtracts the mean input over the batch from each input, and divides by the square root of the squared variance (plus a small epsilon value). Without getting into the details of "batch norm" (the theory is still kinda murky), just know that it scales our inputs to enable our network to learn faster.

Now let's return to our **Inception3** class. We first use our __BasicConv2d__ function to define six convolutional layers:

**Conv2d_1a_3x3** takes our three color channels as input and applies 32 3x3 convolutions with stride of 2. Our __BasicConv2d__ function applies batch normalization with the following parameters. You will see this once you execute the subsequent block of code:

In [69]:
class Inception3(nn.Module):
    def __init__(self, num_classes=1000):  # 1000 classes for ImageNet LSVRC (by default)
        super(Inception3, self).__init__()  # omitting aux_logits and transform_input
        
        self.Conv2d_1a_3x3 = BasicConv2d(3, 32, kernel_size=3, stride=2)  # 3x3 convolution
        self.Conv2d_2a_3x3 = BasicConv2d(32, 32, kernel_size=3)  # (stride=1)
        self.Conv2d_2b_3x3 = BasicConv2d(32, 64, kernel_size=3, padding=1)  # with padding
        self.Conv2d_3b_1x1 = BasicConv2d(64, 80, kernel_size=1)  # 1x1 convolution
        self.Conv2d_4a_3x3 = BasicConv2d(80, 192, kernel_size=3)
        
    def forward(self, x):
        
        # defines how layers are applied to inputs
        
        return x
    
net = Inception3()
print(net)

Inception3(
  (Conv2d_1a_3x3): BasicConv2d(
    (conv): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), bias=False)
    (bn): BatchNorm2d(32, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (Conv2d_2a_3x3): BasicConv2d(
    (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(32, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (Conv2d_2b_3x3): BasicConv2d(
    (conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (Conv2d_3b_1x1): BasicConv2d(
    (conv): Conv2d(64, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(80, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (Conv2d_4a_3x3): BasicConv2d(
    (conv): Conv2d(80, 192, kernel_size=(3, 3), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(192, eps=0.001, momentum=0.1, affine=True, t

Returning to our batch normalization parameters in our (bn) layers:

1) **num_features**=__output_features__ from **BasicConv2d**. For instance, in **self.Conv2d_1a_3x3**, the number of output features is 32. We want to apply batch normalization to these features before passing the data forward to the next layer.

2) **eps** is the small epsilon value that we add to our squared variance (before taking the squared root and dividing the difference of our input and the mean input for the batch of inputs by this number)

3) We can also add **momentum**. With momentum, we have a "running mean" and "running variance" that we update each time we pass the data through a layer. We multiply our momentum term by the mean of our inputs for the layer, then subtract our momentum term from one and multiply it by our running mean. We then do the same for our variance.

4) If **affine=True**, we pass our normalized mean through a linear transformation. This corresponds to __//scale and shift__ in Ioffe and Szegedy 2015 (https://arxiv.org/pdf/1502.03167.pdf)

5) **track_running_stats** needs to be set to __True__ in order to record our running mean and variance (e.g. for updating these terms with momentum)

**torch.nn.modules.batchnorm** code is available at https://pytorch.org/docs/stable/\_modules/\torch/\nn/\modules/\batchnorm.html

We now define a set of six **inception modules**. Each of these are a series of convolutional layers with batch normalization. We thus use our __BasicConv2d__ function to define these layers. There's a lot of code here, but it should look familiar by now. 

Let's start with **InceptionA**, which consists of seven convolutional layers. Again, we'll use our standard template, inheriting our class from __nn.Module__ and defining constructor and forward methods:

In [70]:
class InceptionA(nn.Module):
    
    def __init__ (self, in_channels, pool_features):  # pool_features=out_channels for branch_pool
        super(InceptionA, self).__init__()
        
        # convolutional layers go here
        
    def forward(self, x):
        
        # convolution and pooling applied to input here
        
        return torch.cat(outputs, 1)

**torch.cat** will concatenate our outputs (to be defined in __forward__). **1** is short for __dim=1__, which means to concatenate the outputs along our second dimension. To concatenate along the first dimension, use __dim=0__ (since PyTorch like all things Python is zero-indexed)

Now let's add our convolutional layers. Again, these are all defined with batch normalization using **BasicConv2d**:

As the name implies, **branch1x1** is a 1x1 convolution applied to our __in_channels__. We learn 64 convolutional kernels.

**branch5x5_1** is a 1x1 convolution, also applied to our __in_channels__ (not the output of **branch1x1**! Hmm, what's going on here? And why is it called __branch5x5_1__, not **branch1x1_1** or something?). We learn 48 of these kernels.

**branch5x5_2** is a 5x5 convolution, applied to 48 input channels (presumably the output of __branch5x5_1__?) with zero padding of depth 2. We learn 64 of these.

**branch3x3dbl_1** learns 1x1 kernels applied to __in_channels__ with 64 outputs

**branch3x3dbl_2** learns 96 3x3 kernels from 64 inputs with __padding=1__

**branch3x3dbl_3** learns 96 3x3 kernels from 96 inputs with __padding=1__

**branch_pool** learns 1x1 kernels applied to __in_channels__ with output equal to **pool_features**

All of these convolutions are stride of 1 by default. Let's add them to our network:

In [71]:
class InceptionA(nn.Module):
    
    def __init__ (self, in_channels, pool_features):  # pool_features=out_channels for branch_pool
        super(InceptionA, self).__init__()
        
        self.branch1x1 = BasicConv2d(in_channels, 64, kernel_size=1)
        
        self.branch5x5_1 = BasicConv2d(in_channels, 48, kernel_size=1)
        self.branch5x5_2 = BasicConv2d(48, 64, kernel_size=5, padding=2)
        
        self.branch3x3dbl_1 = BasicConv2d(in_channels, 64, kernel_size=1)
        self.branch3x3dbl_2 = BasicConv2d(64, 96, kernel_size=3, padding=1)
        self.branch3x3dbl_3 = BasicConv2d(96, 96, kernel_size=3, padding=1)
        
        self.branch_pool = BasicConv2d(in_channels, pool_features, kernel_size=1)
        
    def forward(self, x):
        
        # convolution and pooling applied to input here
        
        return torch.cat(outputs, 1)

Now we need to pass our **x** through these layers. We also define a __branch_pool__, which performs 3x3 average pooling over our input with stride and padding of 1. We then apply the *convolution* defined in **self.branch_pool** to our pooled output!

What's different about this "inception module" in contrast to the other networks we've seen is that we're not passing __x__ through all of the layers in **init** sequentially. Instead, we're passing the data through *each* of the branches (1x1, 5x5, 3x3dbl, and pool), and then concatenating the outputs. Weird, huh?

Let's see how the module looks with our **forward** method:

In [72]:
class InceptionA(nn.Module):
    
    def __init__ (self, in_channels, pool_features):  # pool_features=out_channels for branch_pool
        super(InceptionA, self).__init__()
        
        self.branch1x1 = BasicConv2d(in_channels, 64, kernel_size=1)
        
        self.branch5x5_1 = BasicConv2d(in_channels, 48, kernel_size=1)
        self.branch5x5_2 = BasicConv2d(48, 64, kernel_size=5, padding=2)
        
        self.branch3x3dbl_1 = BasicConv2d(in_channels, 64, kernel_size=1)
        self.branch3x3dbl_2 = BasicConv2d(64, 96, kernel_size=3, padding=1)
        self.branch3x3dbl_3 = BasicConv2d(96, 96, kernel_size=3, padding=1)
        
        self.branch_pool = BasicConv2d(in_channels, pool_features, kernel_size=1)
        
    def forward(self, x):
        branch1x1 = self.branch1x1(x)

        branch5x5 = self.branch5x5_1(x)
        branch5x5 = self.branch5x5_2(branch5x5)

        branch3x3dbl = self.branch3x3dbl_1(x)
        branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
        branch3x3dbl = self.branch3x3dbl_3(branch3x3dbl)
        
        branch_pool = F.avg_pool2d(x, kernel_size=3, stride=1, padding=1)
        branch_pool = self.branch_pool(branch_pool)
        
        outputs = [branch1x1, branch5x5, branch3x3dbl, branch_pool]
        return torch.cat(outputs, 1)

We got five more of these. In **InceptionB**, we pass the input through a 3x3 kernel with stride of 2, and then pass the same input sequentially through three kernels:

1) 1x1 convolution with stride of 1 and 64 outputs
2) 3x3 convolution with stride of 1, padding of 1, and 96 outputs
3) 3x3 convolution with stride of 2 and 96 outputs

We'll also do some max pooling on our input, but this time we won't pass that output through a convolutional layer. As with **InceptionA**, we return our concatenated outputs. We still need to figure out how to combine these modules!

Let's take a look at **InceptionB**:

In [73]:
class InceptionB(nn.Module):

    def __init__(self, in_channels):
        super(InceptionB, self).__init__()
        self.branch3x3 = BasicConv2d(in_channels, 384, kernel_size=3, stride=2)

        self.branch3x3dbl_1 = BasicConv2d(in_channels, 64, kernel_size=1)
        self.branch3x3dbl_2 = BasicConv2d(64, 96, kernel_size=3, padding=1)
        self.branch3x3dbl_3 = BasicConv2d(96, 96, kernel_size=3, stride=2)

    def forward(self, x):
        branch3x3 = self.branch3x3(x)

        branch3x3dbl = self.branch3x3dbl_1(x)
        branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
        branch3x3dbl = self.branch3x3dbl_3(branch3x3dbl)

        branch_pool = F.max_pool2d(x, kernel_size=3, stride=2)

        outputs = [branch3x3, branch3x3dbl, branch_pool]
        return torch.cat(outputs, 1)

Now it's about to get weird. **InceptionC** takes an additional parameter, __channels_7x7__, and uses this parameter to define the number of output channels for the first layer in two of our sets of convolutional layers. Ok, that's not so weird, but check out the kernel sizes!

In [74]:
class InceptionC(nn.Module):

    def __init__(self, in_channels, channels_7x7):
        super(InceptionC, self).__init__()
        self.branch1x1 = BasicConv2d(in_channels, 192, kernel_size=1)

        c7 = channels_7x7
        self.branch7x7_1 = BasicConv2d(in_channels, c7, kernel_size=1)
        self.branch7x7_2 = BasicConv2d(c7, c7, kernel_size=(1, 7), padding=(0, 3))
        self.branch7x7_3 = BasicConv2d(c7, 192, kernel_size=(7, 1), padding=(3, 0))

        self.branch7x7dbl_1 = BasicConv2d(in_channels, c7, kernel_size=1)
        self.branch7x7dbl_2 = BasicConv2d(c7, c7, kernel_size=(7, 1), padding=(3, 0))
        self.branch7x7dbl_3 = BasicConv2d(c7, c7, kernel_size=(1, 7), padding=(0, 3))
        self.branch7x7dbl_4 = BasicConv2d(c7, c7, kernel_size=(7, 1), padding=(3, 0))
        self.branch7x7dbl_5 = BasicConv2d(c7, 192, kernel_size=(1, 7), padding=(0, 3))

        self.branch_pool = BasicConv2d(in_channels, 192, kernel_size=1)

    def forward(self, x):
        branch1x1 = self.branch1x1(x)

        branch7x7 = self.branch7x7_1(x)
        branch7x7 = self.branch7x7_2(branch7x7)
        branch7x7 = self.branch7x7_3(branch7x7)

        branch7x7dbl = self.branch7x7dbl_1(x)
        branch7x7dbl = self.branch7x7dbl_2(branch7x7dbl)
        branch7x7dbl = self.branch7x7dbl_3(branch7x7dbl)
        branch7x7dbl = self.branch7x7dbl_4(branch7x7dbl)
        branch7x7dbl = self.branch7x7dbl_5(branch7x7dbl)

        branch_pool = F.avg_pool2d(x, kernel_size=3, stride=1, padding=1)
        branch_pool = self.branch_pool(branch_pool)

        outputs = [branch1x1, branch7x7, branch7x7dbl, branch_pool]
        return torch.cat(outputs, 1)

So far we've been dealing with square kernels. Now we have alternating (7, 1) and (1, 7) rectangular kernels, with asymmetric padding!?

Let's keep rolling with **InceptionD**...

In [75]:
class InceptionD(nn.Module):

    def __init__(self, in_channels):
        super(InceptionD, self).__init__()
        self.branch3x3_1 = BasicConv2d(in_channels, 192, kernel_size=1)
        self.branch3x3_2 = BasicConv2d(192, 320, kernel_size=3, stride=2)

        self.branch7x7x3_1 = BasicConv2d(in_channels, 192, kernel_size=1)
        self.branch7x7x3_2 = BasicConv2d(192, 192, kernel_size=(1, 7), padding=(0, 3))
        self.branch7x7x3_3 = BasicConv2d(192, 192, kernel_size=(7, 1), padding=(3, 0))
        self.branch7x7x3_4 = BasicConv2d(192, 192, kernel_size=3, stride=2)

    def forward(self, x):
        branch3x3 = self.branch3x3_1(x)
        branch3x3 = self.branch3x3_2(branch3x3)

        branch7x7x3 = self.branch7x7x3_1(x)
        branch7x7x3 = self.branch7x7x3_2(branch7x7x3)
        branch7x7x3 = self.branch7x7x3_3(branch7x7x3)
        branch7x7x3 = self.branch7x7x3_4(branch7x7x3)

        branch_pool = F.max_pool2d(x, kernel_size=3, stride=2)
        outputs = [branch3x3, branch7x7x3, branch_pool]
        return torch.cat(outputs, 1)

... and **InceptionE**...

In [76]:
class InceptionE(nn.Module):

    def __init__(self, in_channels):
        super(InceptionE, self).__init__()
        self.branch1x1 = BasicConv2d(in_channels, 320, kernel_size=1)

        self.branch3x3_1 = BasicConv2d(in_channels, 384, kernel_size=1)
        self.branch3x3_2a = BasicConv2d(384, 384, kernel_size=(1, 3), padding=(0, 1))
        self.branch3x3_2b = BasicConv2d(384, 384, kernel_size=(3, 1), padding=(1, 0))

        self.branch3x3dbl_1 = BasicConv2d(in_channels, 448, kernel_size=1)
        self.branch3x3dbl_2 = BasicConv2d(448, 384, kernel_size=3, padding=1)
        self.branch3x3dbl_3a = BasicConv2d(384, 384, kernel_size=(1, 3), padding=(0, 1))
        self.branch3x3dbl_3b = BasicConv2d(384, 384, kernel_size=(3, 1), padding=(1, 0))

        self.branch_pool = BasicConv2d(in_channels, 192, kernel_size=1)

    def forward(self, x):
        branch1x1 = self.branch1x1(x)

        branch3x3 = self.branch3x3_1(x)
        branch3x3 = [
            self.branch3x3_2a(branch3x3),
            self.branch3x3_2b(branch3x3),
        ]
        branch3x3 = torch.cat(branch3x3, 1)

        branch3x3dbl = self.branch3x3dbl_1(x)
        branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
        branch3x3dbl = [
            self.branch3x3dbl_3a(branch3x3dbl),
            self.branch3x3dbl_3b(branch3x3dbl),
        ]
        branch3x3dbl = torch.cat(branch3x3dbl, 1)

        branch_pool = F.avg_pool2d(x, kernel_size=3, stride=1, padding=1)
        branch_pool = self.branch_pool(branch_pool)

        outputs = [branch1x1, branch3x3, branch3x3dbl, branch_pool]
        return torch.cat(outputs, 1)

We'll omit **InceptionAux**, since we only use this when __aux_logits=True__ in **Inception3**. Don't worry about this.

Ok, let's return to our **Inception3** class. Now we're going to start adding the outputs of *passing our input through each inception module*! Recall that these outputs are concatenated outputs of multiple sets of convolutions, along with a pooling operation. Again, we're taking this from https://github.com/pytorch/vision/blob/master/torchvision/models/inception.py and omitting **transform_input** and __aux_logits__:

In [77]:
class Inception3(nn.Module):
    def __init__(self, num_classes=1000):
        super(Inception3, self).__init__()
        self.Conv2d_1a_3x3 = BasicConv2d(3, 32, kernel_size=3, stride=2)
        self.Conv2d_2a_3x3 = BasicConv2d(32, 32, kernel_size=3)
        self.Conv2d_2b_3x3 = BasicConv2d(32, 64, kernel_size=3, padding=1)
        self.Conv2d_3b_1x1 = BasicConv2d(64, 80, kernel_size=1)
        self.Conv2d_4a_3x3 = BasicConv2d(80, 192, kernel_size=3)
        self.Mixed_5b = InceptionA(192, pool_features=32)
        self.Mixed_5c = InceptionA(256, pool_features=64)
        self.Mixed_5d = InceptionA(288, pool_features=64)
        self.Mixed_6a = InceptionB(288)
        self.Mixed_6b = InceptionC(768, channels_7x7=128)
        self.Mixed_6c = InceptionC(768, channels_7x7=160)
        self.Mixed_6d = InceptionC(768, channels_7x7=160)
        self.Mixed_6e = InceptionC(768, channels_7x7=192)
        self.Mixed_7a = InceptionD(768)
        self.Mixed_7b = InceptionE(1280)
        self.Mixed_7c = InceptionE(2048)
        self.fc = nn.Linear(2048, num_classes)
        
    def forward(self, x):
        
        # defines how layers are applied to inputs
        
        return x
    
net = Inception3()
print(net)

Inception3(
  (Conv2d_1a_3x3): BasicConv2d(
    (conv): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), bias=False)
    (bn): BatchNorm2d(32, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (Conv2d_2a_3x3): BasicConv2d(
    (conv): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(32, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (Conv2d_2b_3x3): BasicConv2d(
    (conv): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (Conv2d_3b_1x1): BasicConv2d(
    (conv): Conv2d(64, 80, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(80, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (Conv2d_4a_3x3): BasicConv2d(
    (conv): Conv2d(80, 192, kernel_size=(3, 3), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(192, eps=0.001, momentum=0.1, affine=True, t

Again, let's keep the focus on architecture, but for completeness (the **torchvision** version of) Inception v3 includes a loop for initializing parameters:

In [78]:
class Inception3(nn.Module):

    def __init__(self, num_classes=1000, aux_logits=True, transform_input=False):
        super(Inception3, self).__init__()
        
        # layers (of layers of layers?) go here
        
        for m in self.modules():
            if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
                import scipy.stats as stats
                stddev = m.stddev if hasattr(m, 'stddev') else 0.1
                X = stats.truncnorm(-2, 2, scale=stddev)
                values = torch.as_tensor(X.rvs(m.weight.numel()), dtype=m.weight.dtype)
                values = values.view(m.weight.size())
                with torch.no_grad():
                    m.weight.copy_(values)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

Now let's see how the data propagates through the network!

In [79]:
class Inception3(nn.Module):

    def forward(self, x):
        # N x 3 x 299 x 299
        x = self.Conv2d_1a_3x3(x)
        # N x 32 x 149 x 149
        x = self.Conv2d_2a_3x3(x)
        # N x 32 x 147 x 147
        x = self.Conv2d_2b_3x3(x)
        # N x 64 x 147 x 147
        x = F.max_pool2d(x, kernel_size=3, stride=2)
        # N x 64 x 73 x 73
        x = self.Conv2d_3b_1x1(x)
        # N x 80 x 73 x 73
        x = self.Conv2d_4a_3x3(x)
        # N x 192 x 71 x 71
        x = F.max_pool2d(x, kernel_size=3, stride=2)
        # N x 192 x 35 x 35
        x = self.Mixed_5b(x)
        # N x 256 x 35 x 35
        x = self.Mixed_5c(x)
        # N x 288 x 35 x 35
        x = self.Mixed_5d(x)
        # N x 288 x 35 x 35
        x = self.Mixed_6a(x)
        # N x 768 x 17 x 17
        x = self.Mixed_6b(x)
        # N x 768 x 17 x 17
        x = self.Mixed_6c(x)
        # N x 768 x 17 x 17
        x = self.Mixed_6d(x)
        # N x 768 x 17 x 17
        x = self.Mixed_6e(x)
        # N x 768 x 17 x 17
        x = self.Mixed_7a(x)
        # N x 1280 x 8 x 8
        x = self.Mixed_7b(x)
        # N x 2048 x 8 x 8
        x = self.Mixed_7c(x)
        # N x 2048 x 8 x 8
        # Adaptive average pooling
        x = F.adaptive_avg_pool2d(x, (1, 1))
        # N x 2048 x 1 x 1
        x = F.dropout(x, training=self.training)
        # N x 2048 x 1 x 1
        x = x.view(x.size(0), -1)
        # N x 2048
        x = self.fc(x)
        # N x 1000 (num_classes)
        return x

We can then put it all together inside a function **inception_v3**. We'll ignore the __pretrained__ option. **progress=True** just gives us a progress bar during training:

In [80]:
def inception_v3(progress=True, **kwargs):
    
    return Inception3(**kwargs)


    class Inception3(nn.Module):

        def __init__(self, num_classes=1000):
            super(Inception3, self).__init__()
            self.Conv2d_1a_3x3 = BasicConv2d(3, 32, kernel_size=3, stride=2)
            self.Conv2d_2a_3x3 = BasicConv2d(32, 32, kernel_size=3)
            self.Conv2d_2b_3x3 = BasicConv2d(32, 64, kernel_size=3, padding=1)
            self.Conv2d_3b_1x1 = BasicConv2d(64, 80, kernel_size=1)
            self.Conv2d_4a_3x3 = BasicConv2d(80, 192, kernel_size=3)
            self.Mixed_5b = InceptionA(192, pool_features=32)
            self.Mixed_5c = InceptionA(256, pool_features=64)
            self.Mixed_5d = InceptionA(288, pool_features=64)
            self.Mixed_6a = InceptionB(288)
            self.Mixed_6b = InceptionC(768, channels_7x7=128)
            self.Mixed_6c = InceptionC(768, channels_7x7=160)
            self.Mixed_6d = InceptionC(768, channels_7x7=160)
            self.Mixed_6e = InceptionC(768, channels_7x7=192)
            self.Mixed_7a = InceptionD(768)
            self.Mixed_7b = InceptionE(1280)
            self.Mixed_7c = InceptionE(2048)
            self.fc = nn.Linear(2048, num_classes)

            for m in self.modules():
                if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
                    import scipy.stats as stats
                    stddev = m.stddev if hasattr(m, 'stddev') else 0.1
                    X = stats.truncnorm(-2, 2, scale=stddev)
                    values = torch.as_tensor(X.rvs(m.weight.numel()), dtype=m.weight.dtype)
                    values = values.view(m.weight.size())
                    with torch.no_grad():
                        m.weight.copy_(values)
                elif isinstance(m, nn.BatchNorm2d):
                    nn.init.constant_(m.weight, 1)
                    nn.init.constant_(m.bias, 0)

        def forward(self, x):
            # N x 3 x 299 x 299
            x = self.Conv2d_1a_3x3(x)
            # N x 32 x 149 x 149
            x = self.Conv2d_2a_3x3(x)
            # N x 32 x 147 x 147
            x = self.Conv2d_2b_3x3(x)
            # N x 64 x 147 x 147
            x = F.max_pool2d(x, kernel_size=3, stride=2)
            # N x 64 x 73 x 73
            x = self.Conv2d_3b_1x1(x)
            # N x 80 x 73 x 73
            x = self.Conv2d_4a_3x3(x)
            # N x 192 x 71 x 71
            x = F.max_pool2d(x, kernel_size=3, stride=2)
            # N x 192 x 35 x 35
            x = self.Mixed_5b(x)
            # N x 256 x 35 x 35
            x = self.Mixed_5c(x)
            # N x 288 x 35 x 35
            x = self.Mixed_5d(x)
            # N x 288 x 35 x 35
            x = self.Mixed_6a(x)
            # N x 768 x 17 x 17
            x = self.Mixed_6b(x)
            # N x 768 x 17 x 17
            x = self.Mixed_6c(x)
            # N x 768 x 17 x 17
            x = self.Mixed_6d(x)
            # N x 768 x 17 x 17
            x = self.Mixed_6e(x)
            # N x 768 x 17 x 17
            x = self.Mixed_7a(x)
            # N x 1280 x 8 x 8
            x = self.Mixed_7b(x)
            # N x 2048 x 8 x 8
            x = self.Mixed_7c(x)
            # N x 2048 x 8 x 8
            # Adaptive average pooling
            x = F.adaptive_avg_pool2d(x, (1, 1))
            # N x 2048 x 1 x 1
            x = F.dropout(x, training=self.training)
            # N x 2048 x 1 x 1
            x = x.view(x.size(0), -1)
            # N x 2048
            x = self.fc(x)
            # N x 1000 (num_classes)
            return x


    class InceptionA(nn.Module):

        def __init__(self, in_channels, pool_features):
            super(InceptionA, self).__init__()
            self.branch1x1 = BasicConv2d(in_channels, 64, kernel_size=1)

            self.branch5x5_1 = BasicConv2d(in_channels, 48, kernel_size=1)
            self.branch5x5_2 = BasicConv2d(48, 64, kernel_size=5, padding=2)

            self.branch3x3dbl_1 = BasicConv2d(in_channels, 64, kernel_size=1)
            self.branch3x3dbl_2 = BasicConv2d(64, 96, kernel_size=3, padding=1)
            self.branch3x3dbl_3 = BasicConv2d(96, 96, kernel_size=3, padding=1)

            self.branch_pool = BasicConv2d(in_channels, pool_features, kernel_size=1)

        def forward(self, x):
            branch1x1 = self.branch1x1(x)

            branch5x5 = self.branch5x5_1(x)
            branch5x5 = self.branch5x5_2(branch5x5)

            branch3x3dbl = self.branch3x3dbl_1(x)
            branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
            branch3x3dbl = self.branch3x3dbl_3(branch3x3dbl)

            branch_pool = F.avg_pool2d(x, kernel_size=3, stride=1, padding=1)
            branch_pool = self.branch_pool(branch_pool)

            outputs = [branch1x1, branch5x5, branch3x3dbl, branch_pool]
            return torch.cat(outputs, 1)


    class InceptionB(nn.Module):

        def __init__(self, in_channels):
            super(InceptionB, self).__init__()
            self.branch3x3 = BasicConv2d(in_channels, 384, kernel_size=3, stride=2)

            self.branch3x3dbl_1 = BasicConv2d(in_channels, 64, kernel_size=1)
            self.branch3x3dbl_2 = BasicConv2d(64, 96, kernel_size=3, padding=1)
            self.branch3x3dbl_3 = BasicConv2d(96, 96, kernel_size=3, stride=2)

        def forward(self, x):
            branch3x3 = self.branch3x3(x)

            branch3x3dbl = self.branch3x3dbl_1(x)
            branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
            branch3x3dbl = self.branch3x3dbl_3(branch3x3dbl)

            branch_pool = F.max_pool2d(x, kernel_size=3, stride=2)

            outputs = [branch3x3, branch3x3dbl, branch_pool]
            return torch.cat(outputs, 1)


    class InceptionC(nn.Module):

        def __init__(self, in_channels, channels_7x7):
            super(InceptionC, self).__init__()
            self.branch1x1 = BasicConv2d(in_channels, 192, kernel_size=1)

            c7 = channels_7x7
            self.branch7x7_1 = BasicConv2d(in_channels, c7, kernel_size=1)
            self.branch7x7_2 = BasicConv2d(c7, c7, kernel_size=(1, 7), padding=(0, 3))
            self.branch7x7_3 = BasicConv2d(c7, 192, kernel_size=(7, 1), padding=(3, 0))

            self.branch7x7dbl_1 = BasicConv2d(in_channels, c7, kernel_size=1)
            self.branch7x7dbl_2 = BasicConv2d(c7, c7, kernel_size=(7, 1), padding=(3, 0))
            self.branch7x7dbl_3 = BasicConv2d(c7, c7, kernel_size=(1, 7), padding=(0, 3))
            self.branch7x7dbl_4 = BasicConv2d(c7, c7, kernel_size=(7, 1), padding=(3, 0))
            self.branch7x7dbl_5 = BasicConv2d(c7, 192, kernel_size=(1, 7), padding=(0, 3))

            self.branch_pool = BasicConv2d(in_channels, 192, kernel_size=1)

        def forward(self, x):
            branch1x1 = self.branch1x1(x)

            branch7x7 = self.branch7x7_1(x)
            branch7x7 = self.branch7x7_2(branch7x7)
            branch7x7 = self.branch7x7_3(branch7x7)

            branch7x7dbl = self.branch7x7dbl_1(x)
            branch7x7dbl = self.branch7x7dbl_2(branch7x7dbl)
            branch7x7dbl = self.branch7x7dbl_3(branch7x7dbl)
            branch7x7dbl = self.branch7x7dbl_4(branch7x7dbl)
            branch7x7dbl = self.branch7x7dbl_5(branch7x7dbl)

            branch_pool = F.avg_pool2d(x, kernel_size=3, stride=1, padding=1)
            branch_pool = self.branch_pool(branch_pool)

            outputs = [branch1x1, branch7x7, branch7x7dbl, branch_pool]
            return torch.cat(outputs, 1)


    class InceptionD(nn.Module):

        def __init__(self, in_channels):
            super(InceptionD, self).__init__()
            self.branch3x3_1 = BasicConv2d(in_channels, 192, kernel_size=1)
            self.branch3x3_2 = BasicConv2d(192, 320, kernel_size=3, stride=2)

            self.branch7x7x3_1 = BasicConv2d(in_channels, 192, kernel_size=1)
            self.branch7x7x3_2 = BasicConv2d(192, 192, kernel_size=(1, 7), padding=(0, 3))
            self.branch7x7x3_3 = BasicConv2d(192, 192, kernel_size=(7, 1), padding=(3, 0))
            self.branch7x7x3_4 = BasicConv2d(192, 192, kernel_size=3, stride=2)

        def forward(self, x):
            branch3x3 = self.branch3x3_1(x)
            branch3x3 = self.branch3x3_2(branch3x3)

            branch7x7x3 = self.branch7x7x3_1(x)
            branch7x7x3 = self.branch7x7x3_2(branch7x7x3)
            branch7x7x3 = self.branch7x7x3_3(branch7x7x3)
            branch7x7x3 = self.branch7x7x3_4(branch7x7x3)

            branch_pool = F.max_pool2d(x, kernel_size=3, stride=2)
            outputs = [branch3x3, branch7x7x3, branch_pool]
            return torch.cat(outputs, 1)


    class InceptionE(nn.Module):

        def __init__(self, in_channels):
            super(InceptionE, self).__init__()
            self.branch1x1 = BasicConv2d(in_channels, 320, kernel_size=1)

            self.branch3x3_1 = BasicConv2d(in_channels, 384, kernel_size=1)
            self.branch3x3_2a = BasicConv2d(384, 384, kernel_size=(1, 3), padding=(0, 1))
            self.branch3x3_2b = BasicConv2d(384, 384, kernel_size=(3, 1), padding=(1, 0))

            self.branch3x3dbl_1 = BasicConv2d(in_channels, 448, kernel_size=1)
            self.branch3x3dbl_2 = BasicConv2d(448, 384, kernel_size=3, padding=1)
            self.branch3x3dbl_3a = BasicConv2d(384, 384, kernel_size=(1, 3), padding=(0, 1))
            self.branch3x3dbl_3b = BasicConv2d(384, 384, kernel_size=(3, 1), padding=(1, 0))

            self.branch_pool = BasicConv2d(in_channels, 192, kernel_size=1)

        def forward(self, x):
            branch1x1 = self.branch1x1(x)

            branch3x3 = self.branch3x3_1(x)
            branch3x3 = [
                self.branch3x3_2a(branch3x3),
                self.branch3x3_2b(branch3x3),
            ]
            branch3x3 = torch.cat(branch3x3, 1)

            branch3x3dbl = self.branch3x3dbl_1(x)
            branch3x3dbl = self.branch3x3dbl_2(branch3x3dbl)
            branch3x3dbl = [
                self.branch3x3dbl_3a(branch3x3dbl),
                self.branch3x3dbl_3b(branch3x3dbl),
            ]
            branch3x3dbl = torch.cat(branch3x3dbl, 1)

            branch_pool = F.avg_pool2d(x, kernel_size=3, stride=1, padding=1)
            branch_pool = self.branch_pool(branch_pool)

            outputs = [branch1x1, branch3x3, branch3x3dbl, branch_pool]
            return torch.cat(outputs, 1)


    class BasicConv2d(nn.Module):

        def __init__(self, in_channels, out_channels, **kwargs):
            super(BasicConv2d, self).__init__()
            self.conv = nn.Conv2d(in_channels, out_channels, bias=False, **kwargs)
            self.bn = nn.BatchNorm2d(out_channels, eps=0.001)

        def forward(self, x):
            x = self.conv(x)
            x = self.bn(x)
            return F.relu(x, inplace=True)

Wow, that's a lot of code. Here's the article if you want to learn more about the logic behind Inception v3: https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf

The basic idea of Inception v3 is to reduce the computational cost of stacking large (e.g. 5x5 and 7x7) convolutions. Remember AlexNet and its 11x11 convolutions? Inception v3 does this by replacing large convolutions with sets of smaller convolutions, concatenating the output, and feeding this output as input to another network.

One last thing: if you want to experiment with passing data through Inception v3, it takes tensors of size N x 3 x 299 x 299 (a size-N batch of 299x299 three-channel images)

Last but certainly not least we have ResNet. It's not as verbose as Inception v3. Like VGG, there are multiple flavors of ResNet: resnet18, resnet34, resnet50, resnet101, renset152, resnet50_32x4d, and resnet101_32x8d.

Here's the ResNet paper: https://arxiv.org/pdf/1512.03385

One of the problems with training deeper neural networks is that error (on both training and test sets) tends to flatten over time or even increase. Updating weights becomes less effective as we add more and more layers, and may even increase the loss once the network reaches a certain depth. ResNet tackles this issue by adding *shortcut connections* - an identity mapping between layers. This is a somewhat strange finding. Maybe it will make more sense in code?

The convolutional operations are simple, just 1x1 and 3x3 kernels with stride of 1. Note the padding and dilation in the 3x3 kernels:

In [81]:
def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)

def conv1x1(in_planes, out_planes, stride=1):
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)

Now let's define our **BasicBlock** class. This defines the architecture of a *residual block* of layers with skipped connections. Recall that we aren't going to pass data through the layer sequentially. We'll be skipping some connections:

In [82]:
class BasicBlock(nn.Module):
    expansion = 1  # used later to define layers 

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

Let's take a look at what we have here:

1) A 3x3 convolution with default stride 1 and number of inputs and outputs defined by the researcher

2) A normalization layer (**nn.BatchNorm2d** by default. I don't know why it's defined this way in the code) with the number of inputs equal to the outputs of the first convolutional layer

3) A ReLU activation

4) A 3x3 convolution with default stride 1 and number of inputs and outputs defined by the researcher (both are equal to the output of the first convolutional layer)

5) Another normalization layer, identical to the first one

6/7) **downsample** and **stride** are also passed as input by the researcher

Our forward pass proceeds with **conv1**, __bn1__, **relu**, __conv2__, **bn2**, and optionally __downsample__. We add our identity and pass the output through a ReLU activation:

In [83]:
def forward(self, x):
    identity = x

    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu(out)

    out = self.conv2(out)
    out = self.bn2(out)

    if self.downsample is not None:
        identity = self.downsample(x)

    out += identity
    out = self.relu(out)

    return out

We also define a **Bottleneck** class. As the name suggests, we pass a 1x1 convolutional layer, followed by a 3x3 layer, followed by another 1x1 layer. __BasicBlock__ is used for resnet18 and resnet34, **Bottleneck** is used for resnet50, resnet101, resnet152, resnext50_32x4d, and resnext101_32x8d. The rest of the code should look similar to __BasicBlock__:

In [84]:
class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(Bottleneck, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        width = int(planes * (base_width / 64.)) * groups
        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

Once we define our generic ResNet architecture (we'll get to that soon!) we can pass our specific ResNet (e.g. resnet50), which in turn calls a function **\_resnet**. To simplify things, let's assume we're not using a pretrained model (i.e. we're not initializing our weights with the output of a ResNet previously trained on ImageNet). Let's start with **\_resnet**:

In [85]:
def _resnet(arch, block, layers, pretrained, progress, **kwargs):
    model = ResNet(block, layers, **kwargs)
    return model

**\_resnet** takes the following arguments:

**arch** is the specific architecture of our model (e.g. __resnet50__)

**block** is either __BasicBlock__ (for __resnet18__ and **resnet34**) or __Bottleneck__ (for **resnet50** and larger networks)

**layers** is created from the __make_layer__ method of our **ResNet** class (we'll get to this next)

The source code (https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py) also includes arguments for **pretrained** and __progress__. We won't use these, so don't worry about them. We've already covered these arguments in our discussion of Inception v3.

Now let's take a look at our different ResNet architectures:

In [86]:
def resnet18(**kwargs):
    return _resnet('resnet18', BasicBlock, [2, 2, 2, 2],
                   **kwargs)


def resnet34(**kwargs):
    return _resnet('resnet34', BasicBlock, [3, 4, 6, 3],
                   **kwargs)


def resnet50(**kwargs):
    return _resnet('resnet50', Bottleneck, [3, 4, 6, 3],
                   **kwargs)


def resnet101(**kwargs):
    return _resnet('resnet101', Bottleneck, [3, 4, 23, 3],
                   **kwargs)


def resnet152(**kwargs):
    return _resnet('resnet152', Bottleneck, [3, 8, 36, 3],
                   **kwargs)


def resnext50_32x4d(**kwargs):
    kwargs['groups'] = 32
    kwargs['width_per_group'] = 4
    return _resnet('resnext50_32x4d', Bottleneck, [3, 4, 6, 3],
                   **kwargs)


def resnext101_32x8d(**kwargs):
    kwargs['groups'] = 32
    kwargs['width_per_group'] = 8
    return _resnet('resnext101_32x8d', Bottleneck, [3, 4, 23, 3],
                   **kwargs)

Now let's check out our generic ResNet architecture. This class takes three methods: a constructor that defines the layers, a method **make_layer** that appends layers based on the arguments passed to the method, and a __forward__ method which sequentially applies a 7x7 convolution with stride of 2 and padding of 3, batch normalization, ReLU, and 3x3 max pooling with stride of 2 and padding of 1, plus layer1, layer2, layer3, and layer4 created by **make_layer**:

In [87]:
class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
                 groups=1, width_per_group=64, replace_stride_with_dilation=None,
                 norm_layer=None):
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, False]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError("replace_stride_with_dilation should be None "
                             "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                                       dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                                       dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                                       dilate=replace_stride_with_dilation[2])
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)

    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.reshape(x.size(0), -1)
        x = self.fc(x)

        return x

Let's visualize everything in one block of code!

In [88]:
def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)


def conv1x1(in_planes, out_planes, stride=1):
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)


class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        if groups != 1 or base_width != 64:
            raise ValueError('BasicBlock only supports groups=1 and base_width=64')
        if dilation > 1:
            raise NotImplementedError("Dilation > 1 not supported in BasicBlock")
        # Both self.conv1 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self, inplanes, planes, stride=1, downsample=None, groups=1,
                 base_width=64, dilation=1, norm_layer=None):
        super(Bottleneck, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        width = int(planes * (base_width / 64.)) * groups
        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class ResNet(nn.Module):

    def __init__(self, block, layers, num_classes=1000, zero_init_residual=False,
                 groups=1, width_per_group=64, replace_stride_with_dilation=None,
                 norm_layer=None):
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        self._norm_layer = norm_layer

        self.inplanes = 64
        self.dilation = 1
        if replace_stride_with_dilation is None:
            # each element in the tuple indicates if we should replace
            # the 2x2 stride with a dilated convolution instead
            replace_stride_with_dilation = [False, False, False]
        if len(replace_stride_with_dilation) != 3:
            raise ValueError("replace_stride_with_dilation should be None "
                             "or a 3-element tuple, got {}".format(replace_stride_with_dilation))
        self.groups = groups
        self.base_width = width_per_group
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3,
                               bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2,
                                       dilate=replace_stride_with_dilation[0])
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2,
                                       dilate=replace_stride_with_dilation[1])
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2,
                                       dilate=replace_stride_with_dilation[2])
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512 * block.expansion, num_classes)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)

    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups,
                            self.base_width, previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups,
                                base_width=self.base_width, dilation=self.dilation,
                                norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = x.reshape(x.size(0), -1)
        x = self.fc(x)

        return x


def _resnet(arch, block, layers, **kwargs):
    model = ResNet(block, layers, **kwargs)
    return model


def resnet18(**kwargs):
    return _resnet('resnet18', BasicBlock, [2, 2, 2, 2],
                   **kwargs)


def resnet34(**kwargs):
    return _resnet('resnet34', BasicBlock, [3, 4, 6, 3],
                   **kwargs)


def resnet50(**kwargs):
    return _resnet('resnet50', Bottleneck, [3, 4, 6, 3],
                   **kwargs)


def resnet101(**kwargs):
    return _resnet('resnet101', Bottleneck, [3, 4, 23, 3],
                   **kwargs)


def resnet152(**kwargs):
    return _resnet('resnet152', Bottleneck, [3, 8, 36, 3],
                   **kwargs)


def resnext50_32x4d(**kwargs):
    kwargs['groups'] = 32
    kwargs['width_per_group'] = 4
    return _resnet('resnext50_32x4d', Bottleneck, [3, 4, 6, 3],
                   **kwargs)


def resnext101_32x8d(**kwargs):
    kwargs['groups'] = 32
    kwargs['width_per_group'] = 8
    return _resnet('resnext101_32x8d', Bottleneck, [3, 4, 23, 3],
                   **kwargs)

Great work. We've now covered five of the most well-known convolutional neural networks. LeNet-5 blazed a trail for future research and applications by demonstrating that a convolutional neural network trained with backpropagation could accurately classify (grayscale) images, including handwritten digits on checks. AlexNet was the first neural network to win the ImageNet Large Scale Visual Recognition Challenge, showing the dominance of convolutional networks over computer vision algorithms with "hand-crafted" features. VGG remains a powerful option for classification, despite its deceptively simple architecture of stacked 3x3 convolutions. Inception v3 is an efficient alternative to networks consisting of layered convolutions, replacing large convolutions with "modules," or sets of smaller convolutions. Finally, ResNet uses "skipped connections" between layers, enabling very deep neural networks (up to 152 layers) to learn features.

There's a lot we haven't covered in this tutorial. Although we managed to explore the architectures of the most common neural networks, we didn't spend much time on how networks compute loss or update parameters. We go into greater detail on the fundamentals of training neural networks in our "Introduction to Neural Networks" tutorial, but there is still much to cover.

If you have any questions, comments, or critiques, please email pdonnelly@groupwaretech.com!