# ENN583 Week 8: When Computer Vision Fails For Robotics

In the lecture, we discussed how models developed for a dataset (computer vision models) can fall short when tested in an environment onboard a robot (robotic vision application).

In particular, we talked about:
* measuring performance with different metrics -- accuracy, precision, recall
* confidence calibration
* feature shift
* class shift

In this notebook, you'll explore each of these ideas for an image classification task. We're going to utilise a ResNet50 classification model, **pretrained on computer vision dataset ImageNet**, evaluating its performance and testing its limits.

Read through the notebook below and complete any missing sections. If you're attending the practical -- ask questions if you get stuck, and I will pick some points to guide you through the answers. If you could not attend the practical, you can always email d24.miller@qut.edu.au if you got stuck in any sections

In [None]:
import numpy as np
import matplotlib.pyplot as plt

#import torch which has many of the functions to build deep learning models and to train them
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

#import torchvision, which was lots of functions for loading and working with image data
import torchvision
import torchvision.transforms as transforms

#this is a nice progress bar representation that will be good to measure progress during testing
import tqdm

## 1. ImageNet Validation Data

We're going to be exploring a model pretrained on ImageNet -- ImageNet has 1000 different classes and internet-sourced images. We will test on a _subset_ of those classes in this practical, specifically:
* backpack
* ballpoint pen
* cellphone
* computer mouse
* laptop computer
* wallet
* water bottle

You can view the entire set of classes in ImageNet by opening the _imagenet\_labels.py_ file.

## Loading the Data

This step has 2 key parts:
1. Create default transformations to apply to the data. The below 3 steps are very standard, and should always be used.
    There are a number of transformations we will consider here, these include:
    1. [transforms.ToTensor()](https://pytorch.org/vision/stable/generated/torchvision.transforms.ToTensor.html) -- this converts a PIL image or numpy array to a tensor while scaling the pixel values to the range [0, 1].
    2. [transforms.Resize()](https://pytorch.org/vision/stable/generated/torchvision.transforms.Resize.html) -- this resizes an input image to the specified size (height, width).
    Resize is important as it ensures the dimensions remain compatible throughout the network, allowing proper operations at each layer and maintaining the required dimensions for the final fully connected layers in the network.
    3. [transforms.Normalize()](https://pytorch.org/vision/stable/generated/torchvision.transforms.Normalize.html) -- this standardizes the pixel values of a tensor image by subtracting the mean and dividing by the standard deviation along the input channels.
    
    You can then use [transforms.Compose](https://pytorch.org/vision/main/generated/torchvision.transforms.Compose.html#torchvision.transforms.Compose) to sequentially chain multiple transforms together.

2. Load the datasets in with [torchvision.datasets.ImageFolder](https://pytorch.org/vision/stable/generated/torchvision.datasets.ImageFolder.html) -- this loads image datasets from folders, assigning labels automatically based on subdirectories, making it convenient for tasks like image classification.

**Why Resize to 224x224?**
Many popular pre-trained models, such as AlexNet, VGG, and ResNet, were trained on the ImageNet dataset, which used images of size 224x224 pixels. We will use a ResNet architecture pre-trained on ImageNet, so will use this value.

**How do we pick the normalization values?**
We can use the actual underlying statistics in our training data, or we can use the values from the ImageNet dataset (millions of images).

Below, we're using ImageFolder to load in our imagenet_subset dataset with a transform

In [None]:
imagenet_means = (0.485, 0.456, 0.406)
imagenet_stds = (0.229, 0.224, 0.225)

transform = transforms.Compose(
    [transforms.ToTensor(),
    transforms.Resize((224, 224), antialias = True), 
     transforms.Normalize(imagenet_means, imagenet_stds)])

test_data = torchvision.datasets.ImageFolder('imagenet_subset', transform = transform)

## Visualising the data

It's usually also a good idea to look at some of the data that we're testing. Let's do this below with matplotlib.pyplot for visualisation, showing an image for each class.

In [None]:
def imshow(img, lbl = None):
    img = img  * torch.Tensor(imagenet_stds).unsqueeze(1).unsqueeze(2) + torch.Tensor(imagenet_means).unsqueeze(1).unsqueeze(2)    # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    if lbl != None:
        plt.title(lbl)
    plt.show()

class_lbl_to_idx = test_data.class_to_idx
class_idx_to_lbl = {v: k for k, v in class_lbl_to_idx.items()}

for cls in range(7):
    for im, cls_idx in test_data:
        if cls_idx != cls:
            continue
            
        cls_lbl = class_idx_to_lbl[cls_idx]
        imshow(im, cls_lbl)
        break

### Food for thought: 
* Can you already see some differences that could present between this data and data collected from a camera on a robot navigating around your house or office?
* Do we trust the performance of our model on this data reflects performance on a robot?

### Final data preparation

ImageNet typically has 1000 classes, whereas our dataset here has only 7 classes. I'm making a dictionary which can convert from our dataset 7 classes back to any related corresponding imagenet classes.

In [None]:
from imagenet_labels import imagenet_classes

imagenet_classes = np.array(imagenet_classes)

imagenet_idxes = [414, 673, 620, 487, 418, 893, 898]

## 2. Pretrained ResNet50 Model

## Initialise the model
We will use a pretrained ResNet50, that has been trained on ImageNet already. This is very easy to do in PyTorch -- ```torchvision.models.resnet50``` loads the architecture, and using ```weights=ResNet50_Weights.DEFAULT``` loads the trained parameters for the model after it was trained on ImageNet.

You can see all the models built into torchvision [here](https://pytorch.org/vision/stable/models.html#classification).

In [None]:
model = torchvision.models.resnet50(weights=torchvision.models.ResNet50_Weights.DEFAULT)

#this line checks if we have a GPU available
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') 
model = model.to(device)

model.eval() # VERY important step, as some layers behave differently during training and testing (i.e. batch norm)

### Food for thought:
* Look at the model architecture, in particular, take note of the last layer called 'fc'. It has out_features = 1000 -- what does this mean?

## Performance Metrics

The code below is:
* Loading images and their ground-truth labels from the test_data we have created
* Putting these on the GPU in preparation for going through the model
* Passing these through the model to produce 1000 class scores (ImageNet has 1000 classes)
* Grabbing the relevant 1000 class scores for our dataset -- 7 

### Your turn!
Complete the code below to find the performance of the ResNet50 model on our data. In particular, calculate the:
* predicted class label from outputs_subset
* accuracy (correct/total)
* use the imshow() function to understand why our model is sometimes making mistakes

In [None]:
for data in  tqdm.tqdm(test_data):
    inputs, label = data
    
    inputs = inputs.to(device).unsqueeze(0)

    outputs = model(inputs)

    outputs_subset = outputs[0][imagenet_idxes]
    
print(f'Model accuracy is {100.*accuracy :.2f}%')

#### What types of mistakes does the models make?

## Building Confusion matrices

Using the code above, collect a list of all the predictions and ground-truth labels into lists named 'gt' and 'pred'. You can use this to build a confusion matrix that visualises the types of errors the model is making.

We can use [ConfusionMatrixDisplay](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.ConfusionMatrixDisplay.html) from sklearn.metrics to create a confusion matrix.

To use the function, we need to pass in: 
- the GT label for each sample
- the predicted label for each sample
- (optional) display labels for each label (i.e. a list of class strings)
- (optional) normalize over 'true' or 'pred' labels to account for class imbalance

Below, let's first test our model over the val dataset and collect the GT label and predicted label for each sample.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

#run once without normalise and then again with normalise
ConfusionMatrixDisplay.from_predictions(gt, pred, display_labels = test_data.classes, xticks_rotation = 'vertical')


## Confidence Calibration Curves

As discussed in the lecture, calibration curves are useful for understanding how well the confidence scores predicted by a classification model align with the actual accuracy of the model, which is critical in some applications where not just the label but also the uncertainty of the prediction is important -- like robotics. A well-calibrated model should have its predicted confidence probabilities close to the true probability that the prediction is correct.

To do this, we need to collect all our GT labels, predictions, and the confidence associated with each prediction.

**Note: This relies on class scores being converted to pseudo-probablities using the [torch.nn.function.softmax() function](https://pytorch.org/docs/stable/generated/torch.nn.functional.softmax.html).**

Complete the code below to calculate:
* all_gt -- a list of all the ground truth labels
* all_pred -- a list of all the predicted labels
* all_confidences -- a list of all the softmax scores


Once completing the above code, you can run the next cell to see the confidence calibration curves.

#### Consider: Is the model: well-calibrated, over-confident, or under-confident? Is it prone to giving a certain confidence more than others?

How would this inform the advice you give someone who wants to use the model?

In [None]:
#create a variable that holds the confidence intervals we will check on a confidence calibration curve
conf_ranges = [[0, 10], [10, 20], [20, 30], [30, 40], [40, 50], [50, 60], [60, 70], [70, 80], [80, 90], [90, 100]] 

#convert our previously collected lists into numpy arrays so that we can easily manipulate them
all_pred_conf = np.array(all_confidences)
all_pred_class = np.array(all_pred)
all_gt_class = np.array(all_gt)

actual_accuracy = []
conf_level = []
conf_counts = []
for conf_int in conf_ranges:
    lower = conf_int[0]/100 #convert between 0-1
    upper = conf_int[1]/100 #convert between 0-1

    #create a mask that will collect predictions in the confidence interval -- it must be above the lower thresh AND below the upper thresh
    mask = (all_pred_conf >= lower) & (all_pred_conf < upper)
    
    #collect all predictions and GT data within the range using the mask
    preds = all_pred_class[mask]
    gt = all_gt_class[mask]
    
    #find the accuracy of this bin by checking how many correct/total
    correct = np.sum(preds == gt)
    total = len(preds)
    accuracy = correct/total
    actual_accuracy += [accuracy] #save the accuracy for this bin to plot later
    conf_level += [(upper + lower)/2] #this is the average confidence level for this confidence interval (not necessarily for the predictions in the bin though), we will use this for plotting later

    #how many samples in this bin?
    conf_counts += [len(preds)]


#Create a figure 
fig, ax = plt.subplots(2, 1, figsize = (5, 7))
ax[0].bar(conf_level, actual_accuracy, width = 0.09)
ax[0].plot([0, 1], [0, 1], 'r--') #our well-calibrated line
ax[0].set_xlabel('Confidence')
ax[0].set_ylabel('Accuracy')
ax[0].set_title('Confidence Calibration Curve')

ax[1].bar(conf_level, conf_counts, width = 0.09)
ax[1].set_xlabel('Confidence')
ax[1].set_ylabel('Count')

plt.savefig('Confidence_curve.png')
plt.show()

## Exploring Feature Shift

We're going to explore how the performance (accuracy, confusion matrix, and confidence calibration) changes when we test a different dataset -- objectnet_subset. Follow the same process from above, but load in the objectnet_subset folder as the dataset.

**In the below cell, load in the objectnet_subset data.**

**In the below cell, use imshow() to visualise some of the objectnet_subset data.**
#### Consider: How does this data look different to the ImageNet data?

**In the below cell, calculate the accuracy on the new objectnet_subset data**
#### Consider: How does accuracy change for this dataset?

**In the below cells, build a confusion matrix for the objectnet_subset data**
#### Consider: How do the types of errors change for this dataset?

**In the below cell, build the confidence calibration for the objectnet_subset data.**
#### Consider: Has the confidence calibration changed for this dataset?                                              

## Investigating Class Shift

For the last part of this practical, you're going to investigate what happens when we pass images of classes **not** in imagenet class list into the model. You can run the cell below to see the list of imagenet classes.

In [None]:
print(imagenet_classes)

Download images from the internet, or take them on your phone and upload them. You can click the 'upload' symbol under the toolbar and choose an image to upload to JupyterLab. 

Adapt the code below to visualise the image, predicted class label and confidence associated. Currently it only shows the image and predicted class label.

In [None]:
from PIL import Image

im_name = 'hamster.jpg'
im = Image.open(im_name)

inputs = transform(im).unsqueeze(0).to(device)
outputs = model(inputs)
outputs_subset = outputs[0][imagenet_idxes]

predicted = torch.argmax(outputs_subset).cpu().item()
class_name = class_idx_to_lbl[predicted]

imshow(inputs[0].cpu(), f'{class_name}')

#### Time permitting: Try taking photos of the classes in the dataset with your phone and your own version of feature shift, i.e. cluttered images, images at a distance, objects stacked on top of each other, etc. and see how the model responds.