# Lab Instructions

In the lab, you're presented a task such as building a dataset, training a model, or writing a training loop, and we'll provide the code structured in such a way that you can fill in the blanks in the code using the knowledge you acquired in the chapters that precede the lab. You should be able to find appropriate snippets of code in the course content that work well in the lab with minor or no adjustments.

The blanks in the code are indicated by ellipsis (`...`) and comments (`# write your code here`).

In some cases, we'll provide you partial code to ensure the right variables are populated and any code that follows it runs accordingly.

```python
# write your code here
x = ...
```

The solution should be a single statement that replaces the ellipsis, such as:

```python
# write your code here
x = [0, 1, 2]
```

In some other cases, when there is no new variable being created, the blanks are shown like in the example below: 

```python
# write your code here
...
```

Although we're showing you only a single ellipsis (`...`), you may have to write more than one line of code to complete the step, such as:

```python
# write your code here
for i, xi in enumerate(x):
    x[i] = xi * 2
```

## 12.8 Lab 5A: Fine-Tuning Object Detection Models

In this lab, you'll build a dataset, including data augmentation, and fine-tune a custom object detection model by replacing its standard backbone with a different computer vision model. In the end, you'll evaluate the model using metrics from the COCO challenge.

### 12.8.1 Oxford-IIIT Pet Dataset

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step1.png)

You'll build a dataset using the images and annotations from the [Oxford-IIIT Pet dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/):

"_We have created a 37 category pet dataset with roughly 200 images for each class. The images have a large variations in scale, pose and lighting. All images have an associated ground truth annotation of breed, head ROI, and pixel level trimap segmentation._"

You will load the data using [PyTorch's built-in class](https://pytorch.org/vision/stable/generated/torchvision.datasets.OxfordIIITPet.html), but you're tasked with preprocessing the annotations and building a dataset that is compatible with V2 transforms for data augmentation (without wrapping the built-in dataset, that is).

First, load the data to the a folder of your choice (e.g. `./pets`), making sure to retrieve the `trainval` split (which has annotations), and choose both target types, `category` and `segmentation`, since you'll be fine-tuning a model to detect pets on images.

In [None]:
from torchvision.datasets import OxfordIIITPet

root_folder = './pets'
# write the arguments to create an instance of the dataset
pets = OxfordIIITPet(...)

### 12.8.2 Annotations

The annotations follow the Pascal VOC challenge format, and are stored as individual XML files, one for each annotated image, inside the `oxford-iiit-pet/annotations/xmls` subfolder. Use the `xml_to_csv()` helper function to convert all these files into a Pandas dataframe and inspect its contents.

In [None]:
import glob
import pandas as pd
import xml.etree.ElementTree as ET

def xml_to_csv(path):
    """Iterates through all .xml files (generated by labelImg) in a given directory and combines
    them in a single Pandas dataframe.

    Parameters:
    ----------
    path : str
        The path containing the .xml files
    Returns
    -------
    Pandas DataFrame
        The produced dataframe
    """

    xml_list = []
    for xml_file in glob.glob(path + '/*.xml'):
        tree = ET.parse(xml_file)
        root = tree.getroot()
        filename = root.find('filename').text
        width = int(root.find('size').find('width').text)
        height = int(root.find('size').find('height').text)
        for member in root.findall('object'):
            bndbox = member.find('bndbox')
            value = (filename,
                     width,
                     height,
                     member.find('name').text,
                     int(bndbox.find('xmin').text),
                     int(bndbox.find('ymin').text),
                     int(bndbox.find('xmax').text),
                     int(bndbox.find('ymax').text),
                     )
            xml_list.append(value)
    column_name = ['filename', 'width', 'height',
                   'class', 'xmin', 'ymin', 'xmax', 'ymax']
    xml_df = pd.DataFrame(xml_list, columns=column_name)
    return xml_df

In [None]:
# write your code here
xml_df = ...
xml_df

The annotations contain the box coordinates in the Pascal VOC system (`[xmin, ymin, xmax, ymax]`), but they only have two main classes, cats and dogs, instead of the expected 37 classes found in the description. As it turns out, there are more files in the `annotations` folder, namely, `list.txt`, `trainval.txt`, and `test.txt`.

If you're in Google Colab, the command below will list the files inside the `annotations` folder:

In [None]:
# if you chose a different root folder, change it accordingly
!ls -l ./pets/oxford-iiit-pet/annotations

Let's take a look at the `list.txt` file.

If you're in Google Colab, the command below will show you the first few lines of the `list.txt` file:

In [None]:
!head ./pets/oxford-iiit-pet/annotations/list.txt

It contains a list of all images in the dataset, organized in four columns separated by spaces: Image, CLASS-ID, SPECIES, BREED ID. As it turns out, the "class" from the XML file is actually the species. We're interested in the true class ids, from 1 to 37, as stated in the description.

Now, let's take a look at the file corresponding to the data you loaded, the `trainval` split.

If you're in Google Colab, the command below will show you the first few lines of the `trainval.txt` file:

In [None]:
!head ./pets/oxford-iiit-pet/annotations/trainval.txt

It clearly follows the same structure as the previous file, but it does not contain any headers, and it lists only the images that belong to the original train and validation split.

We can load it in Pandas for easier visualization (just run the code below as is to visualize the dataframe with the information from the `trainval.txt` file):

In [None]:
import pandas as pd

trainval_df = pd.read_csv('./pets/oxford-iiit-pet/annotations/trainval.txt', sep=' ', header=None, names=['filename', 'class_id', 'species', 'breed_id'])
trainval_df

Each filename has its own corresponding class index (`class_id`), but the label itself, as the descriptive name corresponding to the category is only available as part of the filename itself. We can easily extract it, though. Just run the code below as is to create a new column (`category`) in the dataframe:

In [None]:
trainval_df['category'] = trainval_df['filename'].apply(lambda v: ' '.join([w.capitalize()
                                                                            for w in v.split('_')[:-1]]))

Moreover, there are 3,680 rows, one for each image, but there are 3,687 annotations retrieved from the XML files. Why? It is important to highlight that:
- some images may have more than one annotation/box - you saw that already in the Penn-Fudan dataset
- some images probably have no annotations/boxes (you'll see that soon)

We'll use the same custom dataset class `ObjDetectionDataset` once again, since it is prepared to take a CSV file or Pandas dataframe containing the annotations (filename, labels, xmin, ymin, xmax, and ymax columns), but keep in mind that only the filenames in the file/dataframe are going to be considered by it.

Therefore, we need to build an annotations file/dataframe that includes filenames that have no annotations as well. It is better to keep images without annotations as negative cases, so we merge both dataframes and make sure that:
- every filename is kept, so there are still 3,680 unique filenames after merging
- the resulting dataframe has, at least, the following columns: `filename`, `label`, `category`, `xmin`, `ymin`, `xmax`, and `ymax`

Run the code below as is to build the corresponding dataframe of annotations:

In [None]:
trainval_df['filename'] = trainval_df['filename'].apply(lambda v: f'{v}.jpg')
annotations_df = trainval_df.merge(xml_df, how='left', on='filename')

colnames = ['filename', 'label', 'category', 'width', 'height', 'xmin', 'ymin', 'xmax', 'ymax']
annotations_df = annotations_df.rename(columns={'class_id': 'label'})[colnames]
annotations_df

Besides, we'll use the resulting dataframe to build a `id2label` dictionary to map class id into the corresponding category. Run the code below as is to build the dictionary:

In [None]:
id2label = dict(annotations_df[['label', 'category']].drop_duplicates().values)
id2label

Let's run some assert commands to ensure everything is as expected. Run the code below as is. It shouldn't raise any errors nor produce any output. If an error is raised, you should double-check the code for loading the dataset and its annotations.

In [None]:
assert len(annotations_df['filename'].unique()) == 3680
assert len(id2label.values()) == 37
assert len(annotations_df) == 3681

Shouldn'it be 3,687? Perhaps even more, since it should also include images without any annotations? It actually should, but some of the annotated images were excluded from the `trainval.txt` list of files for some unknown reason. In case you're curious, these are the images. Run the code below as is to visualize the extra annotations:

In [None]:
extra_annotations = set(xml_df['filename'].unique()).difference(set(annotations_df['filename'].unique()))
extra_annotations

The whole point of this apparent detour from our main job here - fine-tuning an object detection model - is to illustrate the fact that every dataset has its issues, and you should always take your time to investigate how it's organized, if there are quality issues, and ensure it's in the right shape to be loaded into an instance of your dataset class.

By the way, PyTorch's built-in dataset class for the Oxford-IIIT Pet Dataset handles this preprocssing (splitting filenames, building id2label dictionary, etc) in its [constructor method](https://pytorch.org/vision/main/_modules/torchvision/datasets/oxford_iiit_pet.html), in case you'd like to check it out.

### 12.8.3 Train-Validation Split

The original list of files does not give any indication regarding the split between training and validation sets, so you'll have to do it yourself.

Our suggestion is to shuffle the filenames, and take a large part of them (e.g. 3,000) as training set, and the remaining files as validation set.

Split the annotations dataframe in two, as the filenames in each dataframe determine which files are going be part of each dataset (assuming you're using our `ObjDetectionDataset`):

In [None]:
import numpy as np

np.random.seed(11)

# Get all (unique) file names from the annotations dataframe
# write your code here
fnames = ...
np.random.shuffle(fnames)

# Create a boolean pandas series to determine if a given annotation belongs
# to the training set
# Tip: don't forget that images may have multiple annotations - make sure
# two annotations of the same image don't end up in different sets
# write your code here
is_train = ...

annotations = {}
# Use the boolean series to slice the annotations dataframe
# write your code here
annotations['train'] = ...
annotations['val'] = ...

### 12.8.4 Loading Model's Weights

You're using a new backbone for your Faster R-CNN model, so you need to pick one that's different from ResNet50. You could, for example, choose a smaller model from the ResNet family, but it's likely more fun to choose a completely different model instead. We suggest you use MobileNet V2 as the new backbone.

Once you choose the model, load its pretrained weights and the prescribed transformations that come with it. 

In [None]:
from torchvision.models import get_weight

# write your code here
weights = ...
transforms_fn = ...
transforms_fn

This is its `forward()` method (of MobileNet V2 transform, that is). Take a good look at the sequence of transformations it performs because, as you probably already guesses, this function is not compatible with V2 transforms, so you'll have to include them yourself - if needed - in your data augmentation pipeline (the next section).

```python
def forward(self, img: Tensor) -> Tensor:
    img = F.resize(img, self.resize_size, interpolation=self.interpolation, antialias=self.antialias)
    img = F.center_crop(img, self.crop_size)
    if not isinstance(img, Tensor):
        img = F.pil_to_tensor(img)
    img = F.convert_image_dtype(img, torch.float)
    img = F.normalize(img, mean=self.mean, std=self.std)
    return img
```

### 12.8.5 Data Augmentation

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

It is time to write your own `get_transform()` function that takes one argument, namely, ìf it is performing transformations on the training or the validation set:
- if it is the validation set, it should stick to the basics (hint: check the prescribed transformations to assess these points)
  - make sure the image is in the right size/shape for the backbone of your choice
  - convert, if needed, PIL images to tensors
  - normalize the values
- if it is in the training set, it may perform data augmentation as well:
  - choose one or more data augmenting transformations
  - sanitize bounding boxes, just in case

Pay special attention to the order in which transformations will happen, to make sure the transformed image at the end of the pipeline does indeed match the requirements of the backbone model.

In [None]:
import torch
from collections import defaultdict
from torchvision.transforms import v2 as transforms

augmenting = [
    # Choose one (or more) augmentation transform(s), such as RandomHorizontalFlip, for example
    # write your code here
    ...
]

basic = [
    # Include required transformations here, such as transforming PIL images into tensors
    # and normalizing pixel values
    # write your code here
    ...
]

def get_transform(train):
    ops = [
        # Include resizing transformations here, to make images the right size for the chosen model
        # write your code here
        ...
    ]
    # Only does augmenting in training mode
    if train:
        ops.extend(augmenting)
    # Basic transforms: to tensor, sanitizing, and normalizing
    ops.extend(basic)
    return transforms.Compose(ops)

### 12.8.6 Datasets and DataLoaders

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step4.png)

In Chapter 10, we built a dataset class that handles the nitty-gritty details of wrapping images, boxes, and masks, and applying transformations to both images and targets. Let's use the same class once again. Just run the code below as is:

In [None]:
import os
import pandas as pd
import torch
from torchvision.io import read_image, ImageReadMode
from torchvision.tv_tensors import Image, BoundingBoxes, BoundingBoxFormat, Mask
from torchvision.ops import masks_to_boxes, box_area
from torchvision.datasets import VisionDataset

class ObjDetectionDataset(VisionDataset):
    def __init__(self, image_folder, annotations=None, mask_folder=None, transforms=None):
        super().__init__(image_folder, transforms, None, None)
        # folder where images are stored
        self.image_folder = image_folder
        # path to a CSV file or pandas dataframe with annotations
        self.annotations = annotations
        # folder where masks, if any, are stored
        self.mask_folder = mask_folder
        # transforms/augmentations to be applied to images
        self.transforms = transforms

        # gets the list of all images sorted by name
        self.images = list(sorted(os.listdir(image_folder)))

        self.df_boxes = None
        assert (annotations is not None) or (mask_folder is not None), "At least one, annotations or masks, must be supplied"

        # if a CSV or dataframe was prodivded
        if annotations is not None:
            if isinstance(annotations, str):
                self.df_boxes = pd.read_csv(annotations)
            else:
                self.df_boxes = annotations
            # makes sure the annotations are in the XYXY format
            assert len(set(self.df_boxes.columns).intersection({'filename', 'xmin', 'ymin', 'xmax', 'ymax'})) == 5, "Missing columns in CSV"
            # only annotated images are considered - it overwrites the images attribute
            self.images = self.df_boxes['filename'].unique().tolist()

        self.masks = None
        # if there are masks, makes sure each image has its own mask
        if mask_folder is not None:
            self.masks = list(sorted(os.listdir(mask_folder)))
            assert len(self.masks) == len(self.images), "Every image must have one, and only one, mask"

    def __getitem__(self, idx):
        image_filename = os.path.join(self.image_folder, self.images[idx])
        image_tensor = read_image(image_filename, mode=ImageReadMode.RGB)
        # gets the last two dimensions, height and width
        image_hw = image_tensor.shape[-2:]

        labels = None
        # If there are masks, we work with them
        if self.masks is not None:
            mask_filename = os.path.join(self.mask_folder, self.masks[idx])
            merged_mask = read_image(mask_filename)
            # checks how many instances are present in the mask
            # assumes the first one, zero, is background only
            instances = merged_mask.unique()[1:]

            # splits the merged mask, so there's one mask for instance
            masks = (merged_mask == instances.view(-1, 1, 1))
            # converts masks into boxes
            boxes = masks_to_boxes(masks)
            # uses the datapoints namespace to wrap the masks
            wrapped_masks = Mask(masks)
        # No masks, so we fallback to a DF of annotated boxes
        else:
            # retrieves the annotations for the corresponding image
            annots = self.df_boxes.query(f'filename == "{self.images[idx]}"')
            # keeps only the coordinates
            boxes = torch.as_tensor(annots.dropna()[['xmin', 'ymin', 'xmax', 'ymax']].values)
            # if there are labels available as well, retrieves them
            if 'label' in annots.columns:
                labels = torch.as_tensor(annots.dropna()['label'].values)
            wrapped_masks = None

        # uses the datapoints namespace to wrap the boxes
        wrapped_boxes = BoundingBoxes(boxes, format=BoundingBoxFormat.XYXY, canvas_size=image_hw)
        num_objs = len(boxes)

        if len(boxes):
            if labels is None:
                # if there are no labels, we assume every instance is of
                # the same, and only, class
                labels = torch.ones((num_objs,), dtype=torch.int64)
            area = box_area(wrapped_boxes)
        else:
            # Only background, no boxes
            labels = torch.zeros((0,), dtype=torch.int64)
            area = torch.tensor([0.], dtype=torch.float32)

        # creates a target dictionary with all elements
        target = {
            'boxes': wrapped_boxes,
            'area': area,
            'labels': labels,
            'image_id': torch.tensor([idx+1]),
            'iscrowd': torch.zeros((num_objs,), dtype=torch.int64)
        }
        # if there are masks, includes them
        if wrapped_masks is not None:
            target['masks'] = wrapped_masks

        # uses the datapoints namespace to wrap the image
        image = Image(image_tensor)

        # if there are transformations/augmentations
        # apply them to the image and target
        if self.transforms is not None:
            image, target = self.transforms(image, target)

        return image, target

    def __len__(self):
        return len(self.images)

Create two datasets, one for training, and one for validation, and assign the corresponding transformations to each one of them:

In [None]:
datasets = {}

# write your code here
datasets['train'] = ...
datasets['val'] = ...

len(datasets['train']), len(datasets['val'])

Next, create two data loaders, one for each dataset. You should shuffle the training set, but not the validation one. Also, keep batch size small (e.g. two) to avoid out-of-memory issues in the GPU.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step5.png)

In [None]:
from torch.utils.data import DataLoader

dataloaders = {}

# write your code here
dataloaders['train'] = ...
dataloaders['val'] = ...

Try fetching a mini-batch from your training set. Just run the code below as is:

In [None]:
next(iter(dataloaders['train']))

Did you get an error? No? Consider yourself lucky! At some point, it will raise an error, whenever an image with either zero or more than one annotation is included in the mini-batch.

The collate function is the function used by the data loader to patch together multiple data points into a mini-batch. If your dataset is nothing but tensors, that's trivial: it only has to stack them up. Stacking them up, though, assumes every data point has exactly the same shape for its features.

In object detection models, though, this is not guaranteed to be the case: one image may have no boxes, another one may have three boxes, and yet another one may have only one. Those cannot be stacked together.

The solution, fortunately, is pretty easy, and it looks like this:

```python
lambda batch: tuple(zip(*batch))
```

Throw the lambda function above as the `collate_fn` argument of your data loaders, and try again:

In [None]:
dataloaders = {}

# write your code here
dataloaders['train'] = ...
dataloaders['val'] = ...

next(iter(dataloaders['train']))