*Major components of these lecture notes are based on the [Training a Classifier tutorial](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html) in the PyTorch docs. It is recommended to run the code in these notes on Google Colab with a GPU enabled.*

<a href="https://colab.research.google.com/github/middlebury-csci-0451/CSCI-0451/blob/main/lecture-notes/convnets.ipynb" target="_parent">Open these notes in Google Colab</a>

<a href="https://colab.research.google.com/github/middlebury-csci-0451/CSCI-0451/blob/main/lecture-notes/convnets-live.ipynb" target="_parent">Open the live version in Google Colab</a>

## Introduction

The image classification problem is the problem of assigning a label to an image. For example, we might want to assign the label "duck" to pictures of ducks, the label "frog" to pictures of frogs, and so on. 

In more practical terms, we might be interested in classifications like: 

1. "**cancerous**" vs. "**noncancerous**" X-ray images. 
2. "**contains pedestrian**" vs. "**no pedestrian**" optical sensors in driver-assist vehicles. 
3. A recent problem that has come up: "**generated by human artist**" vs. "**generated by model**"

### Ethics of Image Classification

Image classification, and application of machine learning tools to visual data more generally, is an area fraught with ethical challenges. Just a few highlights of what can go wrong: 

1. The famous [GenderShades study](http://gendershades.org/) by Joy Buolamwini and Timnit Gebru found significant racial disparities in the accuracy of face-based gender classifiers. 
2. Researchers at Stanford [trained a model](https://www.theguardian.com/technology/2017/sep/07/new-artificial-intelligence-can-tell-whether-youre-gay-or-straight-from-a-photograph) on images from a dating site in order to make face-based predictions about an individual's sexual orientation, with higher-than-human accuracy. They argued that their paper supported the idea the sexual orientation is physiologically-influenced and therefore not a choice, but rather something that people are born with. On the other hand, critics raised concerns about the consequences of such a tool being used by oppressive governments in which queer identities and sexuality are prohibited and punished. 
3. A controversial study from 2016 claimed to be able to predict whether or not an individual would go on to commit a future crime from their facial features. In fact, the researchers allowed themselves to be fooled by a relic of their data collection process: their training data set of non-criminals was from online profile pictures (in which smiling is common), while their training data set of convicted criminals was mugshots, in which smiling is prohibited. So, they essentially built a smile-detector, which has some interest but not for the intended purpose. Here is [one writeup of this episode](https://www.newscientist.com/article/2114900-concerns-as-face-recognition-tech-used-to-identify-criminals/). The idea that it is possible to make inferences about a person's character, abilities, or future decisions is called [*physiognomy*](https://en.wikipedia.org/wiki/Physiognomy), and was one of the main "scientific" "justifications" of the scientific racism movement of the late 1800s and early 1900s. 



![](https://theintercept.imgix.net/wp-uploads/sites/1/2016/11/Screen-Shot-2016-11-18-at-12.09.50-PM.png?auto=compress%2Cformat&q=90&w=1000&h=857)

*Excerpted figure from the paper claiming to predict criminality from faces.*

## The CIFAR10 Data

For this lecture, we'll use the popular [CIFAR10 data set](https://www.cs.toronto.edu/~kriz/cifar.html). CIFAR10 was a common benchmark for simple image recognition tasks, although it's since been superseded by larger and more complex data sets.

To start, we'll load our packages, access the CIFAR10 data set, and set the device type. 

In [None]:
# if you don't have torchviz, install it by uncommenting and running the following line
# !pip install torchviz

In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
from torchviz import make_dot, make_dot_from_trace
from torchsummary import summary

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 4

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

If your computer has a CUDA GPU available, or if you are working on Google Colab, then you can use a GPU (CUDA) device on which to run your computations. This *can* be very helpful, often resulting in speedups of roughly 10x or so. However, how useful this is can depend strongly on the exact model architecture. Generally speaking, larger models will see greater benefits from GPU usage. 

### Visualizing The Data

The CIFAR10 training data set contains 50,000 images with 32x32 pixels and 3 color channels. Each of these images is labeled with one of 10 labels: 

In [None]:
classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

The testing data set contains 10,000 more images with the same labels. 

Let's begin by visualizing a few elements of the training data set: 

In [None]:
from matplotlib import pyplot as plt
import numpy as np

n_rows = 3

fig, axarr = plt.subplots(n_rows, batch_size, figsize = (10, 7))

tl = iter(trainloader)

for i in range(n_rows):

    # returns batch_size images with their labels
    X, y = next(tl) 

    # populate a row with the images in the batch
    for j in range(batch_size):
        img = np.moveaxis(X[j].numpy(), 0, 2) 
        axarr[i, j].imshow((img + 1)/2)
        axarr[i, j].axis("off")
        axarr[i, j].set(title = classes[int(y[j])])

Each of the 10 classes of data are evenly represented in the training and test data sets. So, the base rate for this problem (corresponding to random guessing) is 1/10 = 10%. 

## First Model: Logistic Regression

We're now ready to write some models to attempt do better than the base rate. Before we construct any models, we'll set our `device` to be equal to the GPU if we have one available (e.g. if your computer has one or if you are working on Google Colab. 

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

Now let's construct a logistic regression model. All we need for this model is a `Linear` layer that accepts the number of floating point numbers used to store a single image and returns 10 numbers. A single image is a tensor of size $3\times32\times 32$, which means that it has 3072 total numbers stored. 

We can get a nice summary of the structure of our model using the `summary` function from the `torchsummary` package: 

As a side note, after computing the loss on a bit of data, it's possible to actually visualize the computational graph. Fun! 

Let's go ahead and implement a training loop: 

In [None]:
import torch.optim as optim

def train(model, k_epochs = 1, print_every = 2000):

    # loss function is cross-entropy (multiclass logistic)
    loss_fn = nn.CrossEntropyLoss() 

    # optimizer is Adam, which does fancier stuff with the gradients
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(k_epochs): 

        running_loss = 0.0
        for i, data in enumerate(trainloader, 0):

            # training steps in here

            # print the epoch, number of batches processed, and running loss 
            # in regular intervals
            if i % print_every == print_every - 1:    
                print(f'[epoch: {epoch + 1}, batches: {i + 1:5d}], loss: {running_loss / print_every:.3f}')
                running_loss = 0.0

    print('Finished Training')

# train the model


Let's also define a testing function that will evaluate the accuracy of our model against the test set. 

In [None]:
def test(model):
    correct = 0
    total = 0
    # torch.no_grad creates an environment in which we do NOT store the 
    # computational graph. We don't need to do this because we don't care about 
    # gradients unless we're training
    with torch.no_grad():
        for data in testloader:
            
            # implement evaluation on each batch here

    print(f'Test accuracy: {100 * correct // total} %')

test(model)

## Second Model: 1 Hidden Layer

Nothing at all wrong with this, but it's not very Deep. If all we wanted to do was logistic regression, we could have stuck with `scikit-learn` or even our own implementations. Here's an implementation of a linear-ReLU stack. This model has "complete" or "full" linkage because every unit in one layer is an input to every unit in the next layer. 

![](https://upload.wikimedia.org/wikipedia/commons/thumb/4/46/Colored_neural_network.svg/500px-Colored_neural_network.svg.png)

*Example of a a fully-connected neural network: every unit in every layer is an input into the next layer.*

This time the number of model parameters is much larger: 

Let's go ahead and train...

...and test! 

## Third Model: Convolutional Neural Net

Fully connected networks with hidden layers are *generalists*: they do their best to fit to the data using minimal assumptions about how the data is structured. For this reason, they are often decent at many tasks, but are often outperformed by networks with more specialized layeres that are adapted to the specifics of the task at hand. Image data, for example, is *spatial*. In order to address the spatial nature of images, it is useful to incorporate layers that explicitly account for spatial structure. 

One of the most common types of layers is a *convolutional* layer. The idea of an image convolution is pretty simple. We define a square *kernel matrix* containing some numbers, and we "slide it over" the input data. At each location, we multiply the data values by the kernel matrix values, and add them together. Here's an illustrative diagram:

![](https://d2l.ai/_images/correlation.svg)

*Image from [Dive Into Deep Learning](https://d2l.ai/chapter_convolutional-neural-networks/conv-layer.html)*

In this example, the value of 19 is computed as $0\times 0 + 1\times 1 + 3\times 2 + 4\times 3 = 19$. 

Historically, kernel matrices were designed by hand for specific purposes and applied to images. For example, here's a greyscale image alongside the result of convolving it with an *edge-detection kernel*. You can see that the resulting convolved image is darker (larger values in each pixel) in the places where different patches of color meet. 


In [None]:
from PIL import Image
import urllib
from scipy.signal import convolve2d

# get the image
def read_image(url):
    return np.array(Image.open(urllib.request.urlopen(url)))

url = "https://i.pinimg.com/originals/0e/d0/23/0ed023847cad0d652d6371c3e53d1482.png"

img = read_image(url)

# convert it to greyscale
def to_greyscale(im):
    return 1 - np.dot(im[...,:3], [0.2989, 0.5870, 0.1140])

img = to_greyscale(img)

# perform kernel convolution
kernel = np.array([[-1, -1, -1], 
                   [-1,  8, -1], 
                   [-1, -1, -1]])

convd = convolve2d(img, kernel)

# visualize
fig, axarr = plt.subplots(1, 2, figsize = (5, 10))

axarr[0].imshow(img, cmap = "Greys")
axarr[0].axis("off")

axarr[1].imshow(convd, cmap = "Greys", vmin = 0, vmax = 8)
viz = axarr[1].axis("off")

However, in the modern approach to learning from images, we don't both designing our own kernels. Instead, we *learn them* from the data! The reason this is possible is that the kernel convolution operation still corresponds to 

1. Multiplying some pairs of numbers together and
2. Adding up the products. 

Although the notation gets a little complicated, this is still just matrix multiplication!! So, we can represent convolutional layers within our framework just by carefully engineering these matrices. We won't worry about the details here, but instead will go straight into using convolutional layers in a model: 

What does max pooling do? you can think of it as a kind of "summarization" step in which we intentionally make the current output somewhat "blockier." Technically, it involves sliding a window over the current batch of data and picking only the largest element within that window. Here's an example of how this looks:

![](https://computersciencewiki.org/images/8/8a/MaxpoolSample2.png)

*Image credit: Computer Science Wiki*



Although convolutional neural networks seem more complicated, one of their key points is using a small set of kernels in each layer actually reduces the number of parameters when compared to a fully connected model. For example, our model here has more layers, but has fewer parameters than our previous model with a single hidden layer: 

Furthermore, this model can obtain substantially better performance, because the parameters that it does have are designed for the spatial nature of the image classification task: 

Let's go ahead and take a look at some of the model predictions. We'll show the image, the true label, and the model's prediction. 

In [None]:
n_rows = 3

fig, axarr = plt.subplots(n_rows, batch_size, figsize = (10, 7))

tl = iter(testloader)

for i in range(n_rows):

    # extract model predictions

    # populate a row with the images in the batch
    for j in range(batch_size):
        img = np.moveaxis(X[j].to("cpu").numpy(), 0, 2) 
        axarr[i, j].imshow((img + 1)/2)
        axarr[i, j].axis("off")
        axarr[i, j].set(title = f"{classes[int(y[j])]} (pred = {classes[int(pred_y[j])]})")

As we observe, the model is right much more frequently than we would expect by chance, but still makes plenty of errors. 

## Inspecting Learned Features

It's possible to extract the outputs from intermediate layers of the model. Doing this can sometimes help us get some understanding of what features the model has learned from the data that help it to perform the classification task. It's important not to overinterpret these. 

The function below helps extract these hidden layer outputs. I retrieved this function from [this forum post](https://discuss.pytorch.org/t/how-can-i-extract-intermediate-layer-output-from-loaded-cnn-model/77301/2). 

In [None]:
activation = {}
def get_activation(name):
    def hook(model, input, output):
        activation[name] = output.detach()
    return hook

In [None]:
model.conv1.register_forward_hook(get_activation('conv'))
model.conv2.register_forward_hook(get_activation('conv2'))

Let's pick an image and call our model on it to obtain an output. 

In [None]:
X, y = next(tl) 
X, y = X.to(device), y.to(device)

X = X[:1, :, :, :]
output = model(X)

Now let's visualize our image alongside activations from each of the two layers. 

In [None]:
fig, axarr = plt.subplots(3, 8, figsize = (10, 4))

axarr[0, 0].imshow((np.moveaxis(X[0].to("cpu").numpy(), 0, 2) + 1) / 2)
axarr[0, 0].set_title("Original Image")

for ax in axarr.ravel():
    ax.axis("off")

for i in range(2):
    for j in range(8):
        
        if i == 0: 
            show = "conv"
            title = "First-Layer Activations"
        else:
            show = "conv2"
            title = "Second-Layer Activations"
        
        im_num = i*4 + j

        axarr[i+1, j].imshow(activation[show].to("cpu").numpy()[0,im_num], cmap = "Greys")
        if j == 0:
            axarr[i+1, j].set_title(title)

        

Again, it's important not to read too much into these activations, but it can be fun to get a little peak into how the model "looks" at the data. 