## Convolutional Neural Networks

Training a CNN is not much different from training a MLP; you just have to change:

- the way data are loaded;
- the structure of the model;

Let's train a CNN on MNIST.

In [None]:
import torch
from matplotlib import pyplot as plt
from torchvision import datasets
import torchvision.transforms.functional as TF

In [None]:
# Load MNIST dataset
train_dataset = datasets.MNIST(root="..", download=True, train=True)
test_dataset = datasets.MNIST(root="..", download=True, train=False)

In [None]:
# Dataset interface: len
print(f"Num. training samples: {len(train_dataset)}")
print(f"Num. test samples:     {len(test_dataset)}")

In [None]:
# Compute dataset sizes
num_train = len(train_dataset)
num_test = len(test_dataset)

Let's split our data into training, validation and test sets.

In [None]:
# List of indexes on the training set
train_idx = list(range(num_train))

In [None]:
# List of indexes on the test set
test_idx = list(range(num_test))

In [None]:
# Import
import random

In [None]:
# Shuffle training set
random.shuffle(train_idx)

In [None]:
# Validation fraction
val_frac = 0.1
# Compute number of samples
num_val = int(num_train*val_frac)
num_train = num_train - num_val
# Split training set
val_idx = train_idx[num_train:]
train_idx = train_idx[:num_train]

### `DataLoader`

The `DataLoader` class in `torch.utils.data` includes several useful functionalities for loading data.

The constructor receives several arguments; the most important ones are:

- `dataset`: input dataset;
- `batch_size`: `DataLoader` automatically groups samples into batches (exactly as we did in the exercise);
- `shuffle`: boolean for shuffling the dataset at each epoch. This is usually a good idea in training, while it doesn't matter for validation/test;
- `num_workers`: number of background threads for data loading. This is useful when your computation time is large, so in the meantime you want to load data so that it's ready when the CPU/GPU is ready.
- `drop_last`: when the number of samples is not divisable by the batch size, the last batch will have a number of elements smaller than `batch_size`. This parameters specifies if you want to use the last incomplete batch or not. Note that if `shuffle` is false, this means that you will never use the last samples in the dataset.

The problem with using `DataLoader` is that it requires a `Dataset` object. However, we only have two dataset instance, one for training and one for test, but we have three logical splits, because `train_dataset` is actually used -- with different indexes -- as a training set and validation set.

To solve this problem, we can use the `torch.utils.data.Subset` class, which takes as input a `Dataset` and returns a new `Dataset` containing only the samples at the specified indexes.

First of all, let's start by defining the transforms. This time, we will use a -1/1 representation(without standardization).

In [None]:
# Import module
import torchvision.transforms as T

In [None]:
# Define single transforms

# Note: transforms can also be regular functions
def normalize_(x):
    # Set values
    x[x > 0.5] = 1
    x[x <= 0.5] = -1
    # Return
    return x

to_tensor = T.ToTensor()
normalize = normalize_

In [None]:
# Compose transforms
transform = T.Compose([to_tensor, normalize])

In [None]:
# Load MNIST dataset with transforms
train_dataset = datasets.MNIST(root="..", download=True, train=True, transform=transform)
test_dataset = datasets.MNIST(root="..", download=True, train=False, transform=transform)

Now, let's split our training set into train and validation.

In [None]:
# Import
from torch.utils.data import DataLoader, Subset

In [None]:
# Split train_dataset into training and validation
val_dataset = Subset(train_dataset, val_idx)
train_dataset = Subset(train_dataset, train_idx)

Finally, let's create the data loaders for each split.

In [None]:
# Define loaders
train_loader = DataLoader(train_dataset, batch_size=8, num_workers=0, shuffle=True)
val_loader   = DataLoader(val_dataset,   batch_size=8, num_workers=0, shuffle=False)
test_loader  = DataLoader(test_dataset,  batch_size=8, num_workers=0, shuffle=False)

In [None]:
# Define dictionary of loaders
loaders = {"train": train_loader,
           "val": val_loader,
           "test": test_loader}

### CNN models

The minimal layers we need for defining a CNN are:

#### Convolutional layer

`nn.Conv2d(in_features, out_features, kernel_size, stride, padding, dilation)`

#### Non-linear activation

- `nn.ReLU()` (module, e.g. to put in `nn.Sequential`)
- `nn.functional.relu(x)` (function, e.g. to call in `forward()`)

#### Max pooling

- `nn.MaxPool2d(kernel_size, stride)` (module, e.g. to put in `nn.Sequential()`)
- `nn.functional.max_pool2d(x, kernel_size, padding)` (function, e.g. to call in `forward()`)

In [None]:
# Import
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

The main difficulty in choosing CNN parameters is to compute the size of the first fully-connected layer, which depends on the number of features maps and the size of each feature map at the last convolutional layer, which depend on the kernel sizes, padding, strides, dilations of all convolutional layers.

The simplest thing to do is to add the convolutional layers first, see the output size, and then add the fully-connected layers.

In [None]:
# Define class
class CNN_tmp(nn.Module):
    
    # Constructor
    def __init__(self):
        # Call parent constructor
        super().__init__();
        # Create convolutional layers
        self.conv_layers = nn.Sequential(
            # Layer 1
            nn.Conv2d(1, 64, kernel_size=3, padding=0, stride=1),
            nn.ReLU(),
            # Layer 2
            nn.Conv2d(64, 128, kernel_size=3, padding=0, stride=1),
            nn.ReLU(),
            # Layer 3
            nn.Conv2d(128, 128, kernel_size=3, padding=0, stride=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            # Layer 4
            nn.Conv2d(128, 256, kernel_size=3, padding=0, stride=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )

    # Forward
    def forward(self, x):
        return self.conv_layers(x)
    
# Create the model
model = CNN_tmp()
# Get input
test_x = train_dataset[0][0].unsqueeze(0)
# Try forward
out_size = model(test_x).size()
print(f"Out feature maps: {out_size} => out features: {out_size[1]*out_size[2]*out_size[3]}")

Now that we know the size of the encoded CNN features, let's add the fully connected layers

In [None]:
# Define class
class CNN(nn.Module):
    
    # Constructor
    def __init__(self):
        # Call parent constructor
        super().__init__();
        # Create convolutional layers
        self.conv_layers = nn.Sequential(
            # Layer 1
            nn.Conv2d(1, 64, kernel_size=3, padding=0, stride=1),
            nn.ReLU(),
            # Layer 2
            nn.Conv2d(64, 128, kernel_size=3, padding=0, stride=1),
            nn.ReLU(),
            # Layer 3
            nn.Conv2d(128, 128, kernel_size=3, padding=0, stride=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            # Layer 4
            nn.Conv2d(128, 256, kernel_size=3, padding=0, stride=1),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        # Create fully-connected layers
        self.fc_layers = nn.Sequential(
            # FC layer
            nn.Linear(4096, 1024),
            nn.ReLU(),
            # Classification layer
            nn.Linear(1024, 10)
        )

    # Forward
    def forward(self, x):
        x = self.conv_layers(x) # Bx256x4x4 (then, we will want -> Bx4096)
        x = x.view(x.size(0), -1) # Bx4096
        x = self.fc_layers(x) # Bx4096 -> Bx1024 -> Bx10
        return x

### Model training

Training function (we now use `DataLoader`)

In [None]:
def train(epochs, lr=0.001):
    try:
        # Create model
        model = CNN()
        print(model)
        # Optimizer
        optimizer = optim.SGD(model.parameters(), lr=lr)
        # Initialize history
        history_loss = {"train": [], "val": [], "test": []}
        history_accuracy = {"train": [], "val": [], "test": []}
        # Process each epoch
        for epoch in range(epochs):
            # Initialize epoch variables
            sum_loss = {"train": 0, "val": 0, "test": 0}
            sum_accuracy = {"train": 0, "val": 0, "test": 0}
            # Process each split
            for split in ["train", "val", "test"]:
                # Process each batch
                for (input, labels) in loaders[split]:
                    # Reset gradients
                    optimizer.zero_grad()
                    # Compute output
                    pred = model(input)
                    loss = F.cross_entropy(pred, labels)
                    # Update loss
                    sum_loss[split] += loss.item()
                    # Check parameter update
                    if split == "train":
                        # Compute gradients
                        loss.backward()
                        # Optimize
                        optimizer.step()
                    # Compute accuracy
                    pred_labels = pred.argmax(1)
                    batch_accuracy = (pred_labels == labels).sum().item()/input.size(0)
                    # Update accuracy
                    sum_accuracy[split] += batch_accuracy
            # Compute epoch loss/accuracy
            epoch_loss = {split: sum_loss[split]/len(loaders[split]) for split in ["train", "val", "test"]}
            epoch_accuracy = {split: sum_accuracy[split]/len(loaders[split]) for split in ["train", "val", "test"]}
            # Update history
            for split in ["train", "val", "test"]:
                history_loss[split].append(epoch_loss[split])
                history_accuracy[split].append(epoch_accuracy[split])
            # Print info
            print(f"Epoch {epoch+1}:",
                  f"TrL={epoch_loss['train']:.4f},",
                  f"TrA={epoch_accuracy['train']:.4f},",
                  f"VL={epoch_loss['val']:.4f},",
                  f"VA={epoch_accuracy['val']:.4f},",
                  f"TeL={epoch_loss['test']:.4f},",
                  f"TeA={epoch_accuracy['test']:.4f},")
    except KeyboardInterrupt:
        print("Interrupted")
    finally:
        # Plot loss
        plt.title("Loss")
        for split in ["train", "val", "test"]:
            plt.plot(history_loss[split], label=split)
        plt.legend()
        plt.show()
        # Plot accuracy
        plt.title("Accuracy")
        for split in ["train", "val", "test"]:
            plt.plot(history_accuracy[split], label=split)
        plt.legend()
        plt.show()

In [None]:
# Train model
train(100)

Training is a bit slow... Why?

In [None]:
# Number of training batches
print(f"Num. training batches: {len(train_loader)}")

In [None]:
# Let's get a batch
batch,labels = next(iter(train_loader))

In [None]:
# Print batch size (to check)
print(batch.size())

In [None]:
# How much time does it take to process a batch?
import time
# Let's create a model
model = CNN()

In [None]:
# Compute forward time
start_time = time.time()
out = model(batch)
loss = F.cross_entropy(out,labels)
end_time = time.time()
forward_time = end_time - start_time
print(f"Forward time: {forward_time:.4f} seconds")
# Compute backward time
start_time = time.time()
loss.backward()
end_time = time.time()
backward_time = end_time - start_time
print(f"Backward time: {backward_time:.5f} seconds")

In [None]:
# Recap
print(f"To process an epoch ({len(train_loader)} batches), it takes {len(train_loader)*(forward_time + backward_time)/60:.1f} minutes")

How to speed up computation? **CUDA**!

PyTorch supports moving tensors and models from/to CPU and GPU, without having to change anything in the training code.

In general, the best thing to do is:

`dev = ("cuda" if torch.cuda.is_available() else "cpu")`

In [None]:
# Select device
print(f"CUDA is available? {torch.cuda.is_available()}")
dev = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(dev)

Then, we need to change our training function to move stuff to the target device

In [None]:
def train(epochs, dev, lr=0.001):
    try:
        # Create model
        model = CNN()
        model = model.to(dev)
        print(model)
        # Optimizer
        optimizer = optim.SGD(model.parameters(), lr=lr)
        # Initialize history
        history_loss = {"train": [], "val": [], "test": []}
        history_accuracy = {"train": [], "val": [], "test": []}
        # Process each epoch
        for epoch in range(epochs):
            # Initialize epoch variables
            sum_loss = {"train": 0, "val": 0, "test": 0}
            sum_accuracy = {"train": 0, "val": 0, "test": 0}
            # Process each split
            for split in ["train", "val", "test"]:
                # Process each batch
                for (input, labels) in loaders[split]:
                    # Move to CUDA
                    input = input.to(dev)
                    labels = labels.to(dev)
                    # Reset gradients
                    optimizer.zero_grad()
                    # Compute output
                    pred = model(input)
                    loss = F.cross_entropy(pred, labels)
                    # Update loss
                    sum_loss[split] += loss.item()
                    # Check parameter update
                    if split == "train":
                        # Compute gradients
                        loss.backward()
                        # Optimize
                        optimizer.step()
                    # Compute accuracy
                    _,pred_labels = pred.max(1)
                    batch_accuracy = (pred_labels == labels).sum().item()/input.size(0)
                    # Update accuracy
                    sum_accuracy[split] += batch_accuracy
            # Compute epoch loss/accuracy
            epoch_loss = {split: sum_loss[split]/len(loaders[split]) for split in ["train", "val", "test"]}
            epoch_accuracy = {split: sum_accuracy[split]/len(loaders[split]) for split in ["train", "val", "test"]}
            # Update history
            for split in ["train", "val", "test"]:
                history_loss[split].append(epoch_loss[split])
                history_accuracy[split].append(epoch_accuracy[split])
            # Print info
            print(f"Epoch {epoch+1}:",
                  f"TrL={epoch_loss['train']:.4f},",
                  f"TrA={epoch_accuracy['train']:.4f},",
                  f"VL={epoch_loss['val']:.4f},",
                  f"VA={epoch_accuracy['val']:.4f},",
                  f"TeL={epoch_loss['test']:.4f},",
                  f"TeA={epoch_accuracy['test']:.4f},")
    except KeyboardInterrupt:
        print("Interrupted")
    finally:
        # Plot loss
        plt.title("Loss")
        for split in ["train", "val", "test"]:
            plt.plot(history_loss[split], label=split)
        plt.legend()
        plt.show()
        # Plot accuracy
        plt.title("Accuracy")
        for split in ["train", "val", "test"]:
            plt.plot(history_accuracy[split], label=split)
        plt.legend()
        plt.show()

How do we test a model with CUDA? If you have a CUDA device, that's good!

Otherwise, you can use [Google Colab](https://colab.research.google.com).

It's a Jupyter-like environment, that allows you to train models with GPU support. Just remember to set `Runtime` -> `Change runtime type` -> `Hardware accelerator` -> `GPU`.

Let's check our training times on CUDA, also trying different batch sizes.

In [None]:
# Redefine train loader, to test batch sizes
train_loader = DataLoader(train_dataset, batch_size=8, num_workers=4, shuffle=True)
# Let's get a batch
batch,labels = next(iter(train_loader))
batch = batch.to(dev)
labels = labels.to(dev)
# Let's create a model
model = CNN()
model = model.to(dev)

In [None]:
# Compute forward time
start_time = time.time()
out = model(batch)
loss = F.cross_entropy(out,labels)
end_time = time.time()
forward_time = end_time - start_time
print(f"Forward time: {forward_time:.4f} seconds")
# Compute backward time
start_time = time.time()
loss.backward()
end_time = time.time()
backward_time = end_time - start_time
print(f"Backward time: {backward_time:.5f} seconds")

In [None]:
# Recap
print(f"To process an epoch ({len(train_loader)} batches), it takes {len(train_loader)*(forward_time + backward_time)/60:.1f} minutes")

In [None]:
# Recreate the loaders
train_loader = DataLoader(train_dataset, batch_size=64, num_workers=4, shuffle=True)
val_loader   = DataLoader(val_dataset,   batch_size=64, num_workers=4, shuffle=False)
test_loader  = DataLoader(test_dataset,  batch_size=64, num_workers=4, shuffle=False)
# Define dictionary of loaders
loaders = {"train": train_loader,
           "val": val_loader,
           "test": test_loader}

In [None]:
# Train model
train(100, dev, lr=0.01)