In [None]:
# SETUP CELL make sure to run this first.
!pip install -qq numpy==1.25.2
!pip install -qq matplotlib==3.7.3
!pip install -qq torch==2.0.1
!pip install -qq torchvision==0.15.2

!wget https://raw.githubusercontent.com/jc639/comp_vis_workshop/main/dataloading/__init__.py?_sm_au_=iVV23Qr5WPjQ728QpGsWvKttvN1NG -O dataloading.py

## Add the dataset first before starting this notebook!

Download the dataset zip folder and unzip with the following code cell:

In [None]:
!gdown https://drive.google.com/uc?id=1n637opG73CsPsAtC2VPbJM_GX-NVOVL0
!unzip -qq ./data.zip

# Running and Evaluating a Model

For this workshop we will be training a model on the some of the labelled data, and looking at the results of training a model. Firstly, we will also go through some other ways we could use computer vision methods with the data and what that would require.

## What type of model to use?

Depending on what you want to achieve there are different families of computer vision models that are designed for different tasks. The most common supervised machine learning tasks in computer vision are classification, object detection and segmentation. We want to look in a bit more detail about what those tasks involve and when you would use them. 

***

### Classification

Classification is the type of task we will complete with the data we have labelled. Typically for a given image you want to produce a single or multi labels from a set of predetermined labels that fit your task. You are not concerned with where your classes are located in the image, you just want to categorise the whole image. 

![classification.png](./images/classification.png)

Some use cases where you might want classification could include:
- Enhancing metadata for unsorted and unlabelled images by allowing you to determine what is in the images from a predetermined set of classes.
- Deciding whether images of objects from a manufacturing line contain defects or not.


Some models you might use for classification are:
- Resnets/ ResNexts
- Efficient Nets
- ConvNext
- Vision Transformer
- Swin Transformers

To see the available pretrained models from torchvision see here:

https://pytorch.org/vision/stable/models.html#classification

***

### Object Detection

For object detection we have a list of classes but we would like to be able to determine where they are located in images. Most commonly object detection is done by producing bounding boxes - the boxes have a classification assigned to them, and describe where that object is in the image. Given our dataset of animals with some relabelling we could frame the problem as an object detection task by drawing bounding boxes around the cats/dogs/birds - this may be useful if we need to know exactly where the animals are.

![object_detection.png](./images/object_detection.png)

Some other use cases of object detection:
- Automatic cropping of images to objects of interest, for example in an optical character recognition project we might want to first find the object with the text we want to read before we send it to an OCR model. 
- Safety or crowdedness detection, we could use a person object detector to determine how crowded a location is for example.
- Automatic measurement in images, for example we may want to measure the size of objects in images obtained from microscopes at a given magnification. 

Some common models you can use are:
- YOLO
- SSD
- Faster R-CNN
- Retina Net
- DETR

To see available pretrained models from pytorch see here:

https://pytorch.org/vision/stable/models.html#object-detection

***

### Segmentation

Segmentation takes object detection a step further and the goal is to assign individual pixels to a class from a predetermined list of classes. This is useful when we want precise detections and measurements from our images. For example with our pill dataset we saw in LabelStudio imagine if pills were only discarded if more than 5% of a pill was scratched, to determine what percentage of the pill is defective we would need to have a good segmentatation mask of the pill and the defect.

![segmentation.png](./images/segmentation.png)

Its important to note there are two flavours of segmentation:
- Semantic -  cannot distinguish between different instances in the same category, i.e. all chairs are marked blue as in the example below.
- Instance - can distinguish between different instances of the same categories, i.e. different chairs are distinguished by different colours.

![instance_vs_semantic.png](./images/instance_vs_semantic.png)


Some common models for segmentation are:
- UNET (semantic)
- FCN (semantic)
- Mask R-CNN (instance)

See here for some pretrained models from pytorch:

https://pytorch.org/vision/stable/models.html#semantic-segmentation

https://pytorch.org/vision/stable/models.html#instance-segmentation

## Training a model

For this workshop we will be training a classification model using a cat, dog and bird dataset as seen in the workshop on dataloaders. 

Here we have the classes:
- 'cat'
- 'dog'
- 'bird'

The goal of our model is when given a image to be able to return which animal the image contains. First let's get started with the model - here we will be using a resnet18, a type of neural network architecture designed for images. For the moment we don't really need to understand the actual operations inside the model simply that for a single image it takes in a image tensor and transforms it to a vector of numbers that represent the prediction of the output class:

![model_throughput.png](./images/model_throughput.png)

1. We have a input image which is really just a matrix of numbers that are the pixel values with **shape=(number of channels, height, width)**
2. This input goes through the model. We can really just think of the model as a function, but the key to training is that we update the function based on its error compared to the correct labels. 
3. The model produces an array of unnormalised scores (also known as logits). In this use case we have **three** classes so the unnormalised scores have a **shape=(3, 1)**. The positions in this array correspond to label outputs the first being the Bird node, the seconds the Dog node and the third the Cat node.
4. As we want to make a single classification (one label per image) we put the scores through the softmax function which in code is `softmax(x) == np.exp(x)/sum(np.exp(x))` where `x` is the unnormalised scores. This functions bounds the numbers in the array to between 0-1 and the array sums to 1. In this case the Cat node has the highest value and that is the models prediction for this input.

#### The model

Let's write some code to get us a model to start with. Here we are finetuning a pretrained model which means we are taking a model that has been trained on some other dataset, modifying the number of output nodes and then doing some training on our own dataset. If you can this is often a good way to start as you need less data/training to get some results to work with.

In [None]:
import torch
from torch import nn
from torchvision.models import resnet18, resnet

class CVModel(nn.Module):

    def __init__(self, n_classes):
        super().__init__()
        self.backbone = nn.Sequential(*list(resnet18(weights=resnet.ResNet18_Weights.DEFAULT).children())[:-1])
        self.flatten = nn.Flatten(start_dim=1)
        self.classifier = nn.Linear(512, n_classes)

    def forward(self, x):
        x = self.backbone(x)
        x = self.flatten(x)
        return self.classifier(x)
        

model = CVModel(n_classes=3)

To explain the code a little bit further what we have done here is used a known architecture for the 'backbone' portion of the model, and we have instructed it to load the weights that were obtained by training on ImageNet, a common classification dataset. However we don't want the final output layer from ImageNet as it has 1000 classes, so we have created a new final 'classifier' layer that outputs our desired number of classes. 

The `__init__` method defines the layers in our model and the `forward` method defines how they are applied to a batch of images, by defining a `forward` method we can then use our class like a function and apply `model(..)` to tensors representing a batch of images. 

### What is a convolutional neural network?

The backbone portion of a our network is interchangeable, it simply has to take a image shaped tensor and transform it through a series of mathematical operations (otherwise known as neural network layers) until we have a vector that can represent our class prediction. In this case we have chosen to use a model that comes from a group of architectures known as convolutional neural networks.

Although it is not really necessary to understand the model in detail, only that it processes an image and that is able to be changed by training, we want to just briefly go over convolutions. 

A convolution is essentially a filter, whereby we have a grid of values (for example 3x3 array) known as a kernel that we slide over pixel array and sum the values. These summed values go into a new array like so. This example shows a grayscale image with shape *64x64x1* as it is bit easier to visualise in a single channel image:

![](./images/convlayer_detailedview_demo.gif)

<br>
<br>

This looks something like this for the input with a RGB 3 channel image, here we actually apply a 3-dimensional kernel that is *3x3x3*:

![](./images/convlayer_overview_demo.gif)

The key is that the values of the convolutional kernels are updated in the training process and which allow them to get better at extracting meaningful features for the given task. To be useful we have don't just train a single filter at a given layer, we want to use multiple kernels - the outputs of these kernels are then stacked and are the input for the next layer which can be more convolutions!

***

Lets run a 'test image' which has the shape the model expects through the model. 

In [None]:
# why is the shape (1, 3, 128, 128) rather than n_channels (3), height (128), width (128)?
# the model expects batchs and this is just a batch size of 1
# what happens to what is returned by the following code cell if you change 1 to 5?
test_img = torch.rand((1, 3, 128, 128))
print(test_img.shape)

In [None]:
# set model to evaluation mode - this affects some layers in the model
# and we should call this before we use the model outside of training
model.eval()

# what is this no_grad()?
# we are telling pytorch not to track gradients
# as it requires less memory and gradients are
# only needed during training
with torch.no_grad():
    logits = model(test_img)

print(logits.softmax(axis=1))
logits.softmax(axis=1).argmax(axis=1)

## Dataset and Dataloaders

As we saw in previous lesson we need to efficiently load and batch the data. We will mostly be copying what we did there but we will just add a few more transforms. For the training dataset we will randomly flip the image on the horizontal access with a 50% probability each time we load it, and then for both training and validation we need convert the images to tensors. Additionally as we are using a pretrained model we need to scale the data according the statistics that the model was originally trained with. 

First let's implement a function to get a Dataset and Dataloader from a folder with the structure we saw in the previous exercise. Here we are importing the `CustomDataset` class and `SquareImage` transform from the previous exercise.

In [None]:
import os
from dataloading import CustomDataset, SquareImage
from torch.utils.data import DataLoader

def get_dataloader(root_folder, batchsize=16, class_to_idx=None, transforms=None):
    """Function to get and return a dataloader for folder"""
    items = []
    if class_to_idx is None:
        class_to_idx = {}

    class_folders = [f for f in os.listdir(root_folder) if not 'DS_Store' in f]
    for i, class_ in enumerate(class_folders):
        if class_ not in class_to_idx:
            class_to_idx[class_] = i
        folder_path = os.path.join(root_folder, class_)
        imgs = [(os.path.join(folder_path, f), class_) for f in os.listdir(folder_path) if f.endswith('.jpg')]
        items.extend(imgs)
    
    ds = CustomDataset(items, class_to_idx, transforms=transforms)
    dl = DataLoader(ds, shuffle=True, batch_size=batchsize)
    return ds, dl

When we load the images we need apply our transforms. We should have two distinct transform pipelines for our training and validation datasets as our training transform will have the random flipping. 

For the **training** transforms we have:
1. Our custom transform `SquareImage` which pads our image in either the height or width direction to make it square.
2. `Resize` which resizes to the desired size. If a single number is given both the height and width are resized to that value.
3. `RandomHorizontalFlip` which will flip the image the image on the horizontal axis with a 50% probability by default.
4. `ToImageTensor` and `ConvertImageDtype` deal with converting a PIL Image to Pytorch tensor with the correct type, the second step scales the data to between 0 - 1.
5. `Normalize` normalises the input data with  `output[channel] = (input[channel] - mean[channel]) / std[channel]` where channel represents the index for R, G or B. The values we have given are there because these are the channel values calculated on ImageNet, and our model has been pretrained on ImageNet **so it is expecting input in this distribution range**.

The **validation** pipeline is almost the same but we have removed the `RandomHorizontalFlip`. Why might this be? Well, when we load images for validation we want it to be exactly the same each time so that any time we calculate the validation metrics it is a fair comparison between runs.

In [None]:
from torchvision.transforms import v2

train_transforms = v2.Compose(
    [
        SquareImage(),
        v2.Resize(224),
        v2.RandomHorizontalFlip(),
        v2.ToImageTensor(),
        v2.ConvertImageDtype(),
        v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ]
)

val_transforms = v2.Compose(
    [
        SquareImage(),
        v2.Resize(224),
        v2.ToImageTensor(),
        v2.ConvertImageDtype(),
        v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ]
)

train_ds, train_dl = get_dataloader('data/train', transforms=train_transforms)
# we need to pass the train_ds.class_to_idx to the validation dataset to 
# make sure the classes have the same mapping for train and validation datasets
val_ds, val_dl = get_dataloader('data/val/', transforms=val_transforms, class_to_idx=train_ds.class_to_idx)

The Dataloaders are iterables (we can use `for ... in` to run through each batch), so lets load the first input (`x`)  and the labels (`y`).

In [None]:
for x,y in train_dl:
    print(x.shape)
    print(y.shape)
    print(y)
    break

We have a batch of 16 images (each image is 3 channels of height=224 _x_ width=224) and an associated 16 class labels which are a mix of the difference classes we have in our dataset.

***

## Model Training

Pytorch doesn't provide any utility functions for training models as it leaves the creation of a training loop up to the user, this makes it highly customisable but you may find you are writing similar code for a variety of tasks.

We don't intend to do a full overview of model training as that can be found elsewhere in much more detail (see this [YouTube playlist](https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi) for a good overview of each step) but we will briefly go over the following code now.

The code trains a neural network by showing it a bunch of examples (the training data) and telling it what the correct output is for each example. The neural network uses this information to learn how to predict the correct output for new examples. The code does this by repeatedly showing the neural network batches of training examples and adjusting the weights of the neural network slightly after each batch. This process is called backpropagation. Over time, the neural network learns to predict the correct output for new examples more accurately.

#### Code Line by Line

The code works by first defining the learning rate, optimizer, and loss function. 
- The learning rate (`LEARNING_RATE`) controls how much the weights of the neural network are updated each iteration.
- The optimizer (`Adam`) is an algorithm that updates the weights in a way that minimizes the loss function.
- The loss function (`CrossEntropy`) measures the error and returns how well the neural network is performing on the training data.

Next, the code enters a training loop. For each epoch (iteration), the code does the following:

1. Sets the model to training mode.
2. Iterates over the training data, batch by batch.
3. For each batch, the code
    - Zeroes out the gradients of the model.
    - Forwards the batch of images through the model to get the predictions.
    - Calculates the loss between the predictions and the ground truth labels.
    - Backpropagates the loss through the model to calculate the gradients of the weights.
    - Updates the weights of the model using the optimizer.
    - Prints the training loss for the epoch.
4. Evaluates the model on the validation data (if specified).

#### Backpropagation
Backpropagation is the algorithm used to update the weights of the neural network. It works by propagating the error from the output layer of the network back to the input layer. At each layer, the error is used to calculate the gradients of the weights. The gradients tell us how to change the weights to reduce the error/loss of the model and are used to update the weights.

The backpropagation algorithm is very efficient and is able to update the weights of even very large neural networks. It is one of the key reasons why neural networks are able to learn complex tasks.

#### Example
Suppose we are training a neural network to classify images of cats and dogs. The neural network has two outputs, one for cats and one for dogs. We feed the neural network a batch of images and get the predictions. We then calculate the loss between the predictions and the ground truth labels.

![model_training.png](./images/model_training.png)

The backpropagation algorithm then propagates the error from the output layer of the network back to the input layer. At each layer, the error is used to calculate the gradients of the weights. 

After a complete backpropagation pass, all the weights of the neural network have been updated slightly. We can then repeat the process with another batch of images. Over time, the neural network will learn to classify images of cats and dogs more accurately.

In [None]:
from torch.optim import Adam
from tqdm.notebook import tqdm

LEARNING_RATE = 0.0001
opt = Adam(model.parameters(), lr=LEARNING_RATE)
loss_func = nn.CrossEntropyLoss()

def train(model, train_dl, val_dl, opt, loss_f, n_epochs, eval=True):
    """Training model for number of epochs specified"""
    for epoch in tqdm(range(n_epochs)):
        print(f'Epoch {epoch + 1}:')
        model.train()
        running_loss = 0
        for imgs, labels in tqdm(train_dl):
            opt.zero_grad()
            output = model(imgs)
            loss = loss_f(output, labels)
            loss.backward()
            running_loss += loss.item()
            opt.step()
        print(f'Loss:\t {running_loss / len(train_dl):.2f}')
            
        if eval:
            evaluate(model=model, val_dl=val_dl, loss_f=loss_f)

def evaluate(model, val_dl, loss_f):
    """Evaluate the model with a validation dataloader"""
    model.eval()
    with torch.no_grad():
        loss = 0
        acc = 0
        for imgs, labels in tqdm(val_dl):
            logits = model(imgs)
            proba = logits.softmax(axis=1)
            preds = proba.argmax(axis=1)
            acc += (preds == labels).sum()
            loss += loss_f(logits, labels).item()
        print(f'Validation loss:\t {loss / len(val_dl):.2f}')
        print(f'Validation Accuracy:\t {(acc / len(val_ds))*100:.2f}%')

Let's train for 5 epochs and see what happens to the loss and accuracy:

In [None]:
train(model=model, train_dl=train_dl, val_dl=val_dl, 
      opt=opt, loss_f=loss_func, n_epochs=5)

<br>
<br>
<br>

Let's checkout some predictions:

In [None]:
from PIL import Image

def predict(model, x):
    """Predict given a tensor"""
    model.eval()
    with torch.no_grad():
        logits = model(x)
        softmax = logits.softmax(axis=1)
        arg_max = softmax.argmax(axis=1)
    return logits, softmax, arg_max


def predict_filepath(model, img_path, transforms, idx_to_class):
    """Predict given an image filepath"""
    img = Image.open(img_path)
    tensor = transforms(img)
    # next step adds a batch of 1 so image shape
    # goes from 3, 224, 224 to 1, 3, 224, 224
    tensor = tensor[None, :, :, :]
    logits, softmax, arg_max = predict(model, tensor)
    return img, idx_to_class[arg_max.item()], softmax[:,arg_max].item()

Try running this cell with different images selected from different folders and see what results you get.

In [None]:
sample_img = 'data/val/bird/1018.jpg'
idx_to_class = {v: k for k, v in train_ds.class_to_idx.items()}
img, prediction, conf = predict_filepath(model, sample_img, val_transforms, idx_to_class)
print(prediction, conf)

img

### Exercises

These exercises are intended to help you understand this notebook further, you don't need to do them all or in sequence - just pick the ones that interest you the most. Or even explore how changing the notebook affects the results. **In the cases where you want to try training the model from scratch again then it is a good idea to reset the notebook and run all cells again**:
***
1. Can we use a different backbone? Some very simple changes would be to try a larger ResNet, the number 18 in the one we have used refers to the number of layers but there are versions with 34, 50, 101 and 152 layers. Have a look at the cell defining the model, specifically the lines:
- `from torchvision.models import resnet18, resnet`
- `self.backbone = nn.Sequential(*list(resnet18(weights=resnet.ResNet18_Weights.DEFAULT).children())[:-1])`

***
2. We have evaluated the model on the validation set and returned an overall accuracy score but does this represent the best way to validate the performance of the model? Is there any other metrics we could calculate on this dataset?

Use this code to obtain all predictions and labels for the validation set and think about what else you could calculate:
```python
preds = []
labels = []
for x, y in val_dl:
    logits, softmax, argmax = predict(model, x)
    preds.extend(argmax.tolist())
    labels.extend(y.tolist())
```

***
3. What happens if the normalization steps are removed from the transform pipeline, how does this affect the values of `x` in the batches from the training dataloader? How does this affect the model training?

***
4. Are there any other transforms that could be added to the training transform pipeline - have a look [here](https://pytorch.org/vision/stable/transforms.html#v2-api-reference-recommended) and try a few!


***
5. When we use the pretrained model we are 'cheating' a little bit - it has been trained on ImageNet and the image net dataset includes many animals including dogs and cats so the model actually already knows how to extract features. What happens if we don't use a pretrained model, take a look at this line in the model definition and modify it so we start with a completely fresh model:
- `self.backbone = nn.Sequential(*list(resnet18(weights=resnet.ResNet18_Weights.DEFAULT).children())[:-1])`

How does this change the accuracy achieved in 5 epochs?