# Computer Vision Workshop 1

## 5.1 Training a model, using a model for inference and evaluating predictions.

Given that we now have a dataset and dataloader to work with we need to do something with it! You will most likely want to train a computer vision model to fit your custom use case.

In this lesson we will outline the steps and principles to create custom models. Using our dataset we will:
- **train** a custom **classification model** by using or dataset of **image** and **label** pairs to guide the model as it learns the association between inputs and their labels.
- **evaluate** the model by using a set of **image and labels** that the model **doesn't** use during training. By knowing the correct answer for a given image we can see if the model output matches this and produce **evaluation metrics** to see how well our model performs across a number of examples. 
- use the trained model for **inference** which will take as **input** an image (in tensor form) and return a **prediction** of whether the image is a cat, dog or bird. In this case we will most likely not have **label** to compare to - we want to apply the model to its use case and make predictions as accurately as possible.

Image classification is a task in computer vision where we care about knowing what the image is, but not necessarily where things are located in the image, for that we would need to look at object detection, segmentation models etc which we will examine a bit later, but the general principles outlined remain: training, evaluation and prediction. 

Let's quickly outline some of the structure of this task. A lot of the general principles in this process are not specific to computer vision and will be familiar to those who have some knowledge of machine learning. The only thing specific to computer vision is that the model is capable of taking in a image tensor and producing an output consistent with our problem definition.

***

#### Dataset Split
 Firstly we need to split our dataset into training and evaluation sets - there are several ways to achieve this but it is common to take a percentage split (80% train; 20% evaluation for example). One thing we should always bear in mind is that both the training and evaluation sets reflect the data you expect to encounter when you deploy your trained model to its use case.

![dataset_split.png](./images/dataset_split.png)

***

#### Model Training

With our training set we pass images into the model and compares its output to the known label. Using a loss function we can update the weights of the model based on the amount of error being made. This is done iteratively through the whole training dataset a number of times to improve the model during this training phase. 

![model_training.png](./images/model_training.png)

***

#### Model Evaluation

With our evaluation set we check into see how the model performs making sure to use examples that have not been used in training. We compare the model output to the known label but this time do not update the model. By comparing the label and prediction we can compute whatever metrics that suit our use case. 

It is important to determine the metric for evaluation before we begin and to make sure it makes sense for our use case. You want to consider if all model errors are equal (missing cancer in a CT vs predicting cancer where there is none?), and the balance of the classes you have in your dataset/or expect in the real word. Accuracy is not a good measure where you have rare classes, for example, if only 1% of your CT images would be expected to have cancer present a model that never predicts cancer will be 99% accurate but have no predictive power. 

For classification we may look at accuracy (overall and broken down by classes/unique labels), precision/recall or F1 score.  

![model_evaluation.png](./images/model_evaluation.png)

***

#### Inference/Prediction

After training a model we have the usecase we want to apply it to, in this situation we are not likely to have labels - we want the model to create them as accurately as possible. How we use the output of our trained model depends on whatever downstream tasks we are trying to perform. 

![inference.png](./images/inference.png)

***

Let's get started with this then, for the model we are going to finetune a resnet18 - a small convolutional neural network. For the moment we don't need to understand what the model is actually doing, simply that for a single image it takes in a image tensor and transforms it to a vector of numbers -  like so:

![model_throughput.png](./images/model_throughput.png)

1. We have a input image which is really just a matrix of numbers that are the pixel values with **shape=(number of channels, height, width)**
2. This input goes through the model. We can really just think of the model as a function, but the key to training is that we update the function based on its error compared to the correct labels.
3. The model produces an array of unnormalised scores (also known as logits). In this use case we have **three** classes so the unnormalised scores have a **shape=(3, 1)**. The positions in this array correspond to label outputs the first being the Bird node, the seconds the Dog node and the third the Cat node.
4. As we want to make a single classification (one label per image) we put the scores through the softmax function which in code is `softmax(x) = np.exp(x)/sum(np.exp(x))` where `x` is the unnormalised scores. This functions bounds the numbers in the array to between 0-1 and the array sums to 1. In this case the Cat node has the highest value and that is the models prediction for this input.

#### The model

Let's write some code to get us a model to start with. Here we are finetuning a pretrained model which means we are taking a model that has been trained on some other dataset, modifying the number of output nodes and then doing some training on our own dataset. If you can this is often a good way to start as you need less data/training to get some results to work with.

In [5]:
from torch import nn
from torchvision.models import resnet18

class CNNModel(nn.Module):

    def __init__(self, n_classes):
        super().__init__()
        self.backbone = nn.Sequential(*list(resnet18(pretrained=True).children())[:-1])
        self.flatten = nn.Flatten(start_dim=1)
        self.classifier = nn.Linear(512, n_classes)

    def forward(self, x):
        x = self.backbone(x)
        x = self.flatten(x)
        return self.classifier(x)
        

model = CNNModel(resnet18, 512, 3)

In [6]:
model

CNNModel(
  (backbone): Sequential(
    (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU(inplace=True)
    (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
    (4): Sequential(
      (0): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        (relu): ReLU(inplace=True)
        (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
      (1): BasicBlock(
        (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
        (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=Tru