# Lab Instructions

In the lab, you're presented a task such as building a dataset, training a model, or writing a training loop, and we'll provide the code structured in such a way that you can fill in the blanks in the code using the knowledge you acquired in the chapters that precede the lab. You should be able to find appropriate snippets of code in the course content that work well in the lab with minor or no adjustments.

The blanks in the code are indicated by ellipsis (`...`) and comments (`# write your code here`).

In some cases, we'll provide you partial code to ensure the right variables are populated and any code that follows it runs accordingly.

```python
# write your code here
x = ...
```

The solution should be a single statement that replaces the ellipsis, such as:

```python
# write your code here
x = [0, 1, 2]
```

In some other cases, when there is no new variable being created, the blanks are shown like in the example below: 

```python
# write your code here
...
```

Although we're showing you only a single ellipsis (`...`), you may have to write more than one line of code to complete the step, such as:

```python
# write your code here
for i, xi in enumerate(x):
    x[i] = xi * 2
```

### Installation Notes

The `xml_to_csv()` function can be easily imported from a set of helper functions we're making available for your convenience. You can download it from the following link:

```
https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/helper_functions.py
```

In Google Colab, you can run the following command to download the file:

In [None]:
!wget https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/helper_functions.py

Once the file is downloaded, you only need to import the required helper function:

In [3]:
from helper_functions import xml_to_csv

## 13.4 Lab 5B: Fine-Tuning Object Detection Models

In this lab, you'll build a dataset, including data augmentation, and fine-tune a custom object detection model by replacing its standard backbone with a different computer vision model. In the end, you'll evaluate the model using metrics from the COCO challenge.

### 13.4.1 Recap

Let's recap what we did in the last lab to properly load and preprocess our dataset, so we can use it to train a non-linear regression in PyTorch. You may run all the cells in this section as they are.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step1.png)

First, we loaded the dataset:

In [None]:
from torchvision.datasets import OxfordIIITPet

root_folder = './pets'
pets = OxfordIIITPet(root='./pets', split='trainval', target_types=['category', 'segmentation'], download=True)

Then, we loaded its annotations into a dataframe, and built a dictionary of categories:

In [None]:
import pandas as pd

xml_df = xml_to_csv(f'{root_folder}/oxford-iiit-pet/annotations/xmls')

trainval_df = pd.read_csv('./pets/oxford-iiit-pet/annotations/trainval.txt', sep=' ', header=None, names=['filename', 'class_id', 'species', 'breed_id'])
trainval_df['category'] = trainval_df['filename'].apply(lambda v: ' '.join([w.capitalize()
                                                                            for w in v.split('_')[:-1]]))
trainval_df['filename'] = trainval_df['filename'].apply(lambda v: f'{v}.jpg')
annotations_df = trainval_df.merge(xml_df, how='left', on='filename')

colnames = ['filename', 'label', 'category', 'width', 'height', 'xmin', 'ymin', 'xmax', 'ymax']
annotations_df = annotations_df.rename(columns={'class_id': 'label'})[colnames]

id2label = dict(annotations_df[['label', 'category']].drop_duplicates().values)

Next, we used the annotations to split the dataset into training and validation sets using the filenames:

In [None]:
import numpy as np

np.random.seed(11)

fnames = sorted(annotations_df['filename'].unique())
np.random.shuffle(fnames)

is_train = annotations_df['filename'].isin(fnames[:3000])

annotations = {}
annotations['train'] = annotations_df[is_train]
annotations['val'] = annotations_df[~is_train]

Preprocessing images and performing data augmentation is a big part of training a model, so we created a function that applies the transformations to an image, depending on which dataset it belongs to:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

In [None]:
import torch
from collections import defaultdict
from torchvision.transforms import v2 as transforms

augmenting = [
    transforms.RandomHorizontalFlip(),
]

basic = [
    # transforms.ToTensor() was deprecated so we
    # replace it by the two transforms below
    transforms.ToImage(),
    transforms.ToDtype(torch.float32, scale=True),
    # it is a no-op in this flow, but it may
    # be necessary if we use different augmentations
    transforms.SanitizeBoundingBoxes(),
    # last op from transforms_fn
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
]

def get_transform(train):
    # Starts by applying transformations to
    # get images to the right size
    ops = [
        # from transforms_fn
        transforms.Resize(232, antialias=True),
        transforms.CenterCrop(224)
    ]
    # Only does augmenting in training mode
    if train:
        ops.extend(augmenting)
    # Basic transforms: to tensor, sanitizing, and normalizing
    ops.extend(basic)
    return transforms.Compose(ops)

In the previous lab, we built a custom dataset that can handle the nitty-gritty details of organizing the images and their corresponding annotations, as well as applying transformations to images and targets:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step4.png)

In [None]:
import os
import pandas as pd
import torch
from torchvision.io import read_image, ImageReadMode
from torchvision.tv_tensors import Image, BoundingBoxes, BoundingBoxFormat, Mask
from torchvision.ops import masks_to_boxes, box_area
from torchvision.datasets import VisionDataset

class ObjDetectionDataset(VisionDataset):
    def __init__(self, image_folder, annotations=None, mask_folder=None, transforms=None):
        super().__init__(image_folder, transforms, None, None)
        # folder where images are stored
        self.image_folder = image_folder
        # path to a CSV file or pandas dataframe with annotations
        self.annotations = annotations
        # folder where masks, if any, are stored
        self.mask_folder = mask_folder
        # transforms/augmentations to be applied to images
        self.transforms = transforms

        # gets the list of all images sorted by name
        self.images = list(sorted(os.listdir(image_folder)))

        self.df_boxes = None
        assert (annotations is not None) or (mask_folder is not None), "At least one, annotations or masks, must be supplied"

        # if a CSV or dataframe was prodivded
        if annotations is not None:
            if isinstance(annotations, str):
                self.df_boxes = pd.read_csv(annotations)
            else:
                self.df_boxes = annotations
            # makes sure the annotations are in the XYXY format
            assert len(set(self.df_boxes.columns).intersection({'filename', 'xmin', 'ymin', 'xmax', 'ymax'})) == 5, "Missing columns in CSV"
            # only annotated images are considered - it overwrites the images attribute
            self.images = self.df_boxes['filename'].unique().tolist()

        self.masks = None
        # if there are masks, makes sure each image has its own mask
        if mask_folder is not None:
            self.masks = list(sorted(os.listdir(mask_folder)))
            assert len(self.masks) == len(self.images), "Every image must have one, and only one, mask"

    def __getitem__(self, idx):
        image_filename = os.path.join(self.image_folder, self.images[idx])
        image_tensor = read_image(image_filename, mode=ImageReadMode.RGB)
        # gets the last two dimensions, height and width
        image_hw = image_tensor.shape[-2:]

        labels = None
        # If there are masks, we work with them
        if self.masks is not None:
            mask_filename = os.path.join(self.mask_folder, self.masks[idx])
            merged_mask = read_image(mask_filename)
            # checks how many instances are present in the mask
            # assumes the first one, zero, is background only
            instances = merged_mask.unique()[1:]

            # splits the merged mask, so there's one mask for instance
            masks = (merged_mask == instances.view(-1, 1, 1))
            # converts masks into boxes
            boxes = masks_to_boxes(masks)
            # uses the datapoints namespace to wrap the masks
            wrapped_masks = Mask(masks)
        # No masks, so we fallback to a DF of annotated boxes
        else:
            # retrieves the annotations for the corresponding image
            annots = self.df_boxes.query(f'filename == "{self.images[idx]}"')
            # keeps only the coordinates
            boxes = torch.as_tensor(annots.dropna()[['xmin', 'ymin', 'xmax', 'ymax']].values)
            # if there are labels available as well, retrieves them
            if 'label' in annots.columns:
                labels = torch.as_tensor(annots.dropna()['label'].values)
            wrapped_masks = None

        # uses the datapoints namespace to wrap the boxes
        wrapped_boxes = BoundingBoxes(boxes, format=BoundingBoxFormat.XYXY, canvas_size=image_hw)
        num_objs = len(boxes)

        if len(boxes):
            if labels is None:
                # if there are no labels, we assume every instance is of
                # the same, and only, class
                labels = torch.ones((num_objs,), dtype=torch.int64)
            area = box_area(wrapped_boxes)
        else:
            # Only background, no boxes
            labels = torch.zeros((0,), dtype=torch.int64)
            area = torch.tensor([0.], dtype=torch.float32)

        # creates a target dictionary with all elements
        target = {
            'boxes': wrapped_boxes,
            'area': area,
            'labels': labels,
            'image_id': torch.tensor([idx+1]),
            'iscrowd': torch.zeros((num_objs,), dtype=torch.int64)
        }
        # if there are masks, includes them
        if wrapped_masks is not None:
            target['masks'] = wrapped_masks

        # uses the datapoints namespace to wrap the image
        image = Image(image_tensor)

        # if there are transformations/augmentations
        # apply them to the image and target
        if self.transforms is not None:
            image, target = self.transforms(image, target)

        return image, target

    def __len__(self):
        return len(self.images)

In [None]:
datasets = {}
datasets['train'] = ObjDetectionDataset(image_folder='./pets/oxford-iiit-pet/images',
                                        annotations=annotations['train'],
                                        transforms=get_transform(True))
datasets['val'] = ObjDetectionDataset(image_folder='./pets/oxford-iiit-pet/images',
                                      annotations=annotations['val'],
                                      transforms=get_transform(False))

Once the datasets are ready, we created data loaders so we can load mini-batches of data, one at a time:

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step5.png)

In [None]:
from torch.utils.data import DataLoader

dataloaders = {}
dataloaders['train'] = DataLoader(datasets['train'], batch_size=2, shuffle=True, collate_fn=lambda batch: tuple(zip(*batch)))
dataloaders['val'] = DataLoader(datasets['val'], batch_size=2, shuffle=False, collate_fn=lambda batch: tuple(zip(*batch)))

### 13.4.2 Model

Now, the fun part begins: replacing the backbone of a pretrained object detection model!

You will have to create a brand new instance of the `FasterRCNN` class using the required arguments to make your model work:
   - `backbone`: your feature extractor
   - `rpn_anchor_generator`: the new anchor generator
   - `box_roi_pool`: the new ROI pooler
   - `num_classes`: the number of classes for your task

You already know the number of classes - but don't forget another one for the negative case, that is, whenever there's no object in the image. This class (for the background, if you will) is usually assigned the zero index (and that's why the class indices from the dataset start at one).

You also have the weights for the backbone model too, but you need to create a model that returns its features only (the "headless" model, as seen in Chapter 2). The model must return either a feature map dictionary (if you're extracing features from multiple layers of your backbone) or a single tensor (if you're extracting a single set of features). Also, keep in mind that:
   - some models (like [MobileNet V2](https://pytorch.org/hub/pytorch_vision_mobilenet_v2/), our suggested choice of new bacbone) can have its features extracted easily accesing a single attribute (`features` in the case of MobileNet)
   - for more complex models, you can use `create_feature_extractor()` or `IntermediateLayerGetter` to build your backbone

Use the weights you already loaded to create an instance of your backbone model and use one of the alternatives above to get its features only returned:

In [None]:
from torchvision.models.detection import FasterRCNN
from torchvision.models import mobilenet_v2, get_weight

num_classes = len(id2label) + 1

weights = get_weight('MobileNet_V2_Weights.DEFAULT')
mobilenet = mobilenet_v2(weights=weights)
new_backbone = mobilenet.features

Double-check if your model is returning what you expect of it by feeding it a random tensor in the shape of a mini-batch (make sure the height and width of your random images match the expected input of your model):

In [None]:
dummy_x = torch.randn(2, 3, 224, 224)
dummy_output = new_backbone(dummy_x)

You shouldn't get any errors, and your dummy output must be either a single tensor, or a feature map dictionary. Check the shape of each returned tensor (one or more), and make sure they all have the same number of output channels. This is required by the Faster R-CNN architecture.

In [None]:
out_channels = dummy_output.shape[1]
out_channels

Assign the number of output channels to the instance of your backbone as an `out_channels` attribute:

In [None]:
# write your code here
new_backbone.out_channels = 1280

Create an instance of the `AnchorGenerator` class, and make sure each argument - `sizes` and `aspect_ratios` is a tuple containing as many elements as the number of feature maps returned by your backbone.

Each element is a tuple itself, and may have as many elements as you wish. For more details, refer to the "Region Proposal Network" subsection.

In [None]:
from torchvision.models.detection.rpn import AnchorGenerator

sizes = ((32, 64, 128, 256, 512),)
aspect_ratios = ((0.5, 1.0, 2.0),)

# write your code here
anchor_generator = AnchorGenerator(sizes=sizes, aspect_ratios=aspect_ratios)

Create an instance of the `MultiScaleRoIAlign` class, and make sure it points to at least one valid feature map as returned by our backbone model. For more details, refer to the "Regions of Interest" subsection.

In [None]:
from torchvision.ops import MultiScaleRoIAlign

output_size = 7
sampling_ratio = 2

# Tip: simpler models don't return dictionaries, but feature maps are guaranteed to be a dictionary
# containing, at least, a "0" key
# write your code here
roi_pooler = MultiScaleRoIAlign(featmap_names=['0'], output_size=7, sampling_ratio=2)

Now, put everything together as your own Faster R-CNN model:

In [None]:
# write your code here
model = FasterRCNN(new_backbone,
                   rpn_anchor_generator=anchor_generator,
                   box_roi_pool=roi_pooler,
                   num_classes=num_classes)

There you go!

#### 13.4.2.1 Double-Checking the Model

To make sure your configuration is working fine, you can feed your new Faster R-CNN model a random tensor representing a dummy mini-batch once again. If you don't get any errors back, you're likely good to go!

Don't forget to send each tensor in your mini-batch, individually, to the device. You cannot simply send them all at once as you used to do before.

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

model.to(device)
model.train()

images, targets = next(iter(dataloaders['train']))

# Send images and targets to device
# write your code here
images = list(image.to(device) for image in images)
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

# Make predictions using your model - you should get a dict of losses back
# write your code here
output = model(images, targets)
output

### 13.4.3 Training Loop

It is time to write a real training loop now! You can use the dummy loop as a template and build on top of it, once you're happy with your schedulers.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step3.png)

Don't forget to send every tensor, individually, to the same device as the model. Also, keep in mind that the model returns a dictionary with many separate losses. It is your job to sum them all up to compute gradients based on the total.

In [None]:
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR, LinearLR

optimizer = optim.SGD(model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005)

# Decreases the learning rate by 10x every 3 epochs
# write your code here
lr_scheduler = StepLR(optimizer, step_size=3, gamma=0.1)

# Warms-up the learning rate from zero to 0.005 over one epoch
warmup_factor = 1.0 / 1000
warmup_iters = min(1000, len(dataloaders['train']) - 1)

# write your code here
lr_scheduler2 = LinearLR(optimizer, start_factor=warmup_factor, total_iters=warmup_iters)

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step4.png)

In [None]:
num_epochs = 5

model.to(device)

for epoch in range(num_epochs):
    for i, (images, targets) in enumerate(dataloaders['train']):
        # Send images and targets to device
        # write your code here
        images = list(image.to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        # Set your model's mode
        # write your code here
        model.train()

        # Call the model to get a loss dict back
        # write your code here
        loss_dict = model(images, targets)
        
        if not (i % 50):
            print([(k, v.item()) for k, v in loss_dict.items()])

        # You have many losses in the dict, but you can only
        # call backward one a single value, so you must
        # add them up
        # write your code here
        losses = sum(loss for loss in loss_dict.values())

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        if epoch == 0:
            lr_scheduler2.step()

    lr_scheduler.step()

Training this model takes quite a while...

Once it's finished training, you can save it to disk for later use:

In [None]:
torch.save(model.state_dict(), 'mobilenet_v2_pets.pth')