In [2]:
import sys
sys.path.insert(0, "../..")

import torch
import torch.nn as nn
from src.data import make_dataset
from pathlib import Path

Lets get some data

In [2]:
datadir = Path("../../data/raw/")
train_dataloader, test_dataloader = make_dataset.get_MNIST(datadir, batch_size=64) 

In [3]:
len(train_dataloader), len(test_dataloader)

(938, 157)

We can obtain an item:

In [4]:
x, y = next(iter(train_dataloader))
x.shape, y.shape

(torch.Size([64, 1, 28, 28]), torch.Size([64]))

The image follows the channels-first convention: (channel, width, height). The label is an integer.

Lets pull this through a Conv2d layer:

In [5]:
conv = nn.Conv2d(
    in_channels=1, 
    out_channels=32,
    kernel_size=3,
    padding=(1,1))
out = conv(x)
out.shape

torch.Size([64, 32, 28, 28])

What is happening here? Can you explain all the parameters, and relate them to the outputshape?

Let's see what happens if we change the padding:

In [6]:
conv = nn.Conv2d(
    in_channels=1, 
    out_channels=32,
    kernel_size=3,
    padding=(0,0))
out = conv(x)
out.shape

torch.Size([64, 32, 26, 26])

And if we change the stride from the default 1 to 2:

In [7]:
conv = nn.Conv2d(
    in_channels=1, 
    out_channels=32,
    kernel_size=3,
    padding=(1,1),
    stride=2)
out = conv(x)
out.shape

torch.Size([64, 32, 14, 14])

As you can see, you need to think about what is going in and out of the convolution. We can stitch multiple layers together like this:

In [8]:
convolutions = nn.Sequential(
    nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=0),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
    nn.Conv2d(32, 32, kernel_size=3, stride=1, padding=0),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2),
)
out = convolutions(x)
out.shape

torch.Size([64, 32, 2, 2])

As you can see, the dimensions of the featuremap have become really small. You need to take this into account: If we would have started with a smaller image, we could get errors...

In [9]:
x_too_small = torch.rand((32, 1, 12, 12))

try:
    convolutions(x_too_small)
except RuntimeError as err:
    print("ERROR:", err)

ERROR: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size


At this point our `out` has 32 activation maps, each 2x2 big.

If we want to pull the activation maps through a neural network (A dense layer) we will need to flatten them (do you understand what happens if you dont do that?)

In [10]:
input_nn = nn.Flatten()(out)
input_nn.shape

torch.Size([64, 128])

Note that there are potential problems connecting the image layers and the linear layers:
- Conv2d and MaxPool both expect 4 dimensional data (batch, channels/activationmaps, width, height)
- Linear layers expect 2 dimensional data (batch, features)
- Linear layers wont crash if you feed them data with more dimensions! However, they will just work on the last dimension, and thats probably not what you want.

This means we need to somehow transform the 4D data into 2D. There are some options here:
- Some sort of aggregation; the activationmaps are typically small (eg 2x2) and they indicate that the filter has detected a features. There are a lot of different ways to aggregate this: mean, max, min, sum, etc...
- Flatten: a flatten layer simple transforms (batch, C, W, H) into (batch, C * W * H). lets say you have (32, 32, 2, 2) than after a flatten you end up with (32, 128). The problem here is, when you use a different amount of Conv2d layers, or a different stride or padding, you will end up with a different size of activationmap, eg (32, 32, 3, 3), which would mean you would end up with 32 * 3 * 3 = 288 features. 

I have solved this problem by calculating the size of the activationmap with the ._conv_test method. After I calculate the size of the map (eg (2,2)) I can create an AvgPool2d layer that will take the average of the (2,2) map. This way you will always end up with (batch, filters, 1, 1) and after the flatten this will be filter * 1 * 1, which is exactly the amount of filters.

In [12]:
avgpool = nn.AvgPool2d((2,2))
pooled = avgpool(out)
pooled.shape

torch.Size([64, 32, 1, 1])

If we flatten this, we obtain 32x1x1 numbers, which is still 32, which makes designing your model a bit easier (and you might also argue that taking the average is a good approach in terms of model logic)

Let's combine it all together, and add a _conv_test method to create the right size for the AvgPool2D layer.

In [13]:
import torch
from torch import nn
from loguru import logger

# Get cpu or gpu device for training.
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

# Define model
class CNN(nn.Module):
    def __init__(self, filters, units1, units2, input_size=(32, 1, 28, 28)):
        super().__init__()

        self.convolutions = nn.Sequential(
            nn.Conv2d(1, filters, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(filters, filters, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
            nn.Conv2d(filters, filters, kernel_size=3, stride=1, padding=0),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2),
        )

        activation_map_size = self._conv_test(input_size)
        logger.info(f"Aggregating activationmap with size {activation_map_size}")
        self.agg = nn.AvgPool2d(activation_map_size)

        self.dense = nn.Sequential(
            nn.Flatten(),
            nn.Linear(filters, units1),
            nn.ReLU(),
            nn.Linear(units1, units2),
            nn.ReLU(),
            nn.Linear(units2, 10)
        )

    def _conv_test(self, input_size = (32, 1, 28, 28)):
        x = torch.ones(input_size)
        x = self.convolutions(x)
        return x.shape[-2:]

    def forward(self, x):
        x = self.convolutions(x)
        x = self.agg(x)
        logits = self.dense(x)
        return logits

model = CNN(filters=16, units1=128, units2=64).to(device)
from torchsummary import summary
summary(model, input_size=(1, 28, 28))

2023-05-09 10:27:37.533 | INFO     | __main__:__init__:27 - Aggregating activationmap with size torch.Size([2, 2])


Using cpu device
----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1           [-1, 16, 28, 28]             160
              ReLU-2           [-1, 16, 28, 28]               0
         MaxPool2d-3           [-1, 16, 14, 14]               0
            Conv2d-4           [-1, 16, 12, 12]           2,320
              ReLU-5           [-1, 16, 12, 12]               0
         MaxPool2d-6             [-1, 16, 6, 6]               0
            Conv2d-7             [-1, 16, 4, 4]           2,320
              ReLU-8             [-1, 16, 4, 4]               0
         MaxPool2d-9             [-1, 16, 2, 2]               0
        AvgPool2d-10             [-1, 16, 1, 1]               0
          Flatten-11                   [-1, 16]               0
           Linear-12                  [-1, 128]           2,176
             ReLU-13                  [-1, 128]               0
           Linear-14  

We have about 15k parameters. You will always need to judge that relative to your input data: 

- how many observations do you have? 
- maybe even more important: how many features do you have? Images sized 28x28 will need much less complexity than images sized 224x224 (note how the first one has 784 features, the second one more than 50.000!)
- Do you think the model needs a lot of complexity, or not so much? E.g. classifying if there is a stamp, or not, on a piece of paper is much easier than classifying the age of a face.

Also think about:
What is the trade off between adding more complexity? Or reducing complexity?

Try to answer this trade of in terms of:

- speed
- generalization
- accuracy

Eg 512 filters might add 0.1 % accuracy, but it might double training time. Is that worth it? Often, not...

We will need to tell the model how good it is performing. To do that, we will need to pick a loss function $\mathcal{L}$. We will discuss this in more depth, but for now, just take my word for it that a CrossEntropyLoss is a good pick.

In [17]:
import torch.optim as optim
from src.models import metrics, train_model
optimizer = optim.Adam
loss_fn = torch.nn.CrossEntropyLoss()
accuracy = metrics.Accuracy()

In [18]:
yhat = model(x)
accuracy(y, yhat)

tensor(0.1250)

We now have everything we need to train the model.

In [None]:
model = train_model.trainloop(
    epochs=10,
    model=model,
    optimizer=optimizer,
    learning_rate=1e-3,
    loss_fn=loss_fn,
    metrics=[accuracy],
    train_dataloader=train_dataloader,
    test_dataloader=test_dataloader,
    log_dir="../../models/test/",
    tunewriter=["tensorboard"],
    train_steps=len(train_dataloader),
    eval_steps=len(test_dataloader),
)