# Project 6: Object Recognition
---

## Assignments

Please, edit your report by fulfilling the following list of assignments.

**Introduction.** Short summary of the goals of the  project. The sections composing the report.

**Section 1. Data loading and visualization**
* 1a. Download the data and visualize an image from the dataset;
* 1b. Write a custom dataset for Penn-Fudan;
* 1c. Visualize images with ground truth.

**Section 2. Data preparation**
* 2a. Define a tronsformer function;
* 2b. Define the training set and test set.

**Section 3. Preparation of the training**
* 3a. The NN model: download and adjust;
* 3b. Define hyperparameters.

**Section 4. Training execution**
* 4a. Training;
* 4b. Retrieve the loss data during training.

**Section 5. Test**
* 5a. Plot training losses;
* 5b. Visualize the predicted bounding boxes.

**Section 6. Results, observations and conclusions**

**Full code**



## Introduction

TODO

## 1. Data loading and preparation

Install the `pycocotools` library. It will be used for computing the evaluation metrics (accuracy) using the COCO metric for intersection over union. Then, import the libraries.

In [None]:
!pip install cython
!pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

In [None]:
%%shell

# Download TorchVision repo to use some files from
# references/detection
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.9.0

cp references/detection/utils.py ../
cp references/detection/transforms.py ../
cp references/detection/coco_eval.py ../
cp references/detection/engine.py ../
cp references/detection/coco_utils.py ../

In [None]:
# imports
from PIL import Image, ImageDraw, ImageFont
import matplotlib.pyplot as plt
import random
from torchvision.datasets.utils import download_and_extract_archive
import os
import numpy as np
import torch
import torch.utils.data
from engine import train_one_epoch, evaluate
import utils
import transforms
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
import pickle

### 1a. Download the data and visualize an image from the dataset
In this project we will use the [Penn-Fudan Database for Pedestrian Detection and Segmentation](https://www.cis.upenn.edu/~jshi/ped_html/).



#### Download and extract the data

In [None]:
## TODO: download the dataset and extract it in the current directory.
#        The archive is located here:
#             https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip
# hints: * use 'download_and_extract_archive(url, root)' (see Project 4);
#        * to specify the current directory use 'root = "."'.


#### Visualize an image and its corresponding segmentation mask

Using PIL, open an image from the dataset.

In [None]:
## TODO: open the PennFudanPed/PNGImages/ folder, look at the name
#        of the images and change the 'image_path' string to visualize 
#        other images from the dataset.

image_path = 'PennFudanPed/PNGImages/FudanPed00007.png'
Image.open(image_path)

### 1b. Write a custom dataset for Penn-Fudan

Write a custom class that inherits from `torch.utils.data.Dataset` for defining the Penn-Fudan dataset.

In [None]:
class PennFudanDataset(torch.utils.data.Dataset):
    def __init__(self, root, transforms=None):
        self.root = root
        self.transforms = transforms
        # load all image files, sorting them to
        # ensure that they are aligned
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))

    def __getitem__(self, idx):
        # load images ad masks
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = Image.open(img_path).convert("RGB")
        # note that we haven't converted the mask to RGB,
        # because each color corresponds to a different instance
        # with 0 being background
        mask = Image.open(mask_path)

        mask = np.array(mask)
        # instances are encoded as different colors
        obj_ids = np.unique(mask)
        # first id is the background, so remove it
        obj_ids = obj_ids[1:]

        # split the color-encoded mask into a set
        # of binary masks
        masks = mask == obj_ids[:, None, None]

        # get bounding box coordinates for each mask
        num_objs = len(obj_ids)
        boxes = []
        for i in range(num_objs):
            pos = np.where(masks[i])
            xmin = np.min(pos[1])
            xmax = np.max(pos[1])
            ymin = np.min(pos[0])
            ymax = np.max(pos[0])
            boxes.append([xmin, ymin, xmax, ymax])

        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        # there is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)
        masks = torch.as_tensor(masks, dtype=torch.uint8)

        image_id = torch.tensor([idx])
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        target = {}
        target["boxes"] = boxes
        target["labels"] = labels
        target["masks"] = masks
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)

Let's see how the outputs are structured for this dataset by instantiating a `PennFudanDataset` object and printing the first element of the dataset.

In [None]:
## TODO: instantiate a PennFudanDataset object, call it 'dataset'.
#        Then, print the length of the dataset and an element from the dataset.
# hints: * The PennFudanDataset class accepts a path as argument, in this 
#          case we have to give it the path to the extracted folder, which
#          is 'PennFudanPed/';
#        * To print an element, call the 'dataset' object giving it an index
#          (it can be from 0 to the length of the dataset minus 1)


### 1c. Visualize images with ground truth

Define three functions: 
* `show_boxed_image` draws bounding boxes around pedestrians in an image;
* `show_area_image` draws area rectangles above pedestrians;
* `show_mask_image` constructs a tensor with all the segmentation masks of pedestrians in the image.

In [None]:
def show_boxed_image(img):
    '''
    args: img -> PennFudanDataset object
    returns an ImageDraw image with bounding boxes around pedestrians.
    '''
    img_show = img[0] #get the PIL image
    img_coordinates = img[1].get("boxes") #get the bounding boxes coordinates

    # draw a rectangle (bounding box) following the coordinates
    for coords in img_coordinates:
        ## TODO: get the coordinates
        xmin = coords[0]
        ymin = #...
        xmax = #...
        ymax = #...
        draw = ImageDraw.Draw(img_show)
        # draw the rectangle
        # (xmin, ymin) -> upper left corner
        # (xmax, ymax) -> lower right corner
        draw.rectangle([(xmin, ymin), (xmax, ymax)], 
                       outline ="lightgreen", 
                       width=5)

    return img_show


def show_area_image(img):
    '''
    args: img -> PennFudanDataset object
    returns an ImageDraw image with area rectangles over pedestrians.
    '''
    ## TODO:
    # get the PIL image

    # get the bounding boxes coordinates

    # get the area values
    img_areas = #...

    fnt = ImageFont.truetype("~/usr/share/fonts/truetype/LiberationMono-Bold.ttf", 20)

    # draw a filled rectangle following the coordinates, and 
    # print the value of the area over each rectangle
    for i in range(len(img_areas)):
        ## TODO: get the coordinates
        # ATTENTION: you have to retrieve them in a different way in respect
        #            to the previous function (look at the for loop...)
        xmin = #...
        #...

        ## TODO: draw a filled rectangle using the coordinates.
        #...

        # Draw the corresponding area over each rectangle
        draw.text(((xmax+xmin)/2-40, (ymin+ymax)/2), 
                  str(img_areas[i].item()), 
                  font = fnt,
                  fill="blue")

    return img_show


def show_mask_image(img):
    '''
    args: img -> PennFudanDataset object
    returns a tensor corresponding to the segmentation masks of pedestrians 
    in the image.
    '''
    ## TODO: get the masks tensors
    masks = #...

    # add each tensor (mask) onto the others. 
    # The result is a single tensor containing all the segmentation masks
    result = torch.zeros(masks.size()[1], masks.size()[2])
    for i in range(len(masks)):
        result += masks[i]

    return result

Create a list of four random integers.

In [None]:
## TODO: create a list containing 4 random integers


Visualize 16 images:
* 1st row: 4 random images from the dataset (use the list of 4 random numbers);
* 2nd row: the same 4 images, with bounding boxes;
* 3rd row: the same 4 images, with areas;
* 4th row: the segmentation masks of the same 4 images.

In [None]:
## TODO: visualize 4 original images.


## TODO: visualize 4 images with bounding boxes.


## TODO: visualize 4 images with areas.


## TODO: visualize 4 images with segmentation masks.



## 2. Data preparation

### 2a. Define a tronsformer function

Write a function that enhances the training set by doing data transformation and augmentation, namely horizontal flip of images.

In [None]:
def get_transform(train):
    transforms_to_apply = []
    ## TODO: convert the PIL image into a PyTorch Tensor
    transforms_to_apply.append(
        transforms.#...
        )
    if train:
        # during training, randomly flip the training images
        # and ground-truth for data augmentation
        transforms_to_apply.append(transforms.RandomHorizontalFlip(0.5))
    return transform.Compose(transforms_to_apply)

### 2b. Define the training set and test set

Instantiate the training and testing data classes and assign them to `DataLoader`.

In [None]:
# use our dataset and defined transformations
train_set = PennFudanDataset('PennFudanPed', get_transform(train=True))
test_set = PennFudanDataset('PennFudanPed', get_transform(train=False))

# split the dataset in train and test set
torch.manual_seed(1)
indices = torch.randperm(len(train_set)).tolist()
## TODO: using torch.utils.data.Subset, define the train_set and test_set.
# hints: * train_set will be a new subset of train_set defined above. The
#          same applies to the test_set;
#        * get the indices from the 'indices' variable defined above. The
#          train_set must contain 100 elements (e.g. the first 100 indices),
#          and the test_set must contain 20 elements (e.g. the indices from
#          100 to 119). Pay attention to how the indices are selected in 
#          list slicing.


## TODO: define training and test data loaders. Use 'DataLoader'.
#        Some arguments:
#        train_loader: batch size=2, shuffle=True, num_workers=2, 
#                      collate_fn=utils.collate_fn
#        test_loader: batch size=1, shuffle=False, num_workers=2, 
#                     collate_fn=utils.collate_fn


## TODO: check the size of the training set and the test set by printing it


## 3. Preparation of the training

### 3a. The NN model: download and adjust

Define a `get_model` function to download a pre-trained model. Adjust its input layer to resize the images, and the last layer to output a number of classes that suits our dataset.

We will use a Faster R-CNN model with a Resnet-50 backbone.


In [None]:
def get_model(num_classes):
    ## TODO: load an object detection model pre-trained on COCO.
    #        Construct a Faster R-CNN model with a ResNet-50-FPN backbone by
    #        using torchvision.models.detection.fasterrcnn_resnet50_fpn().
    #        We want the model to be pre-trained.
    #        https://pytorch.org/vision/stable/models.html#torchvision.models.detection.fasterrcnn_resnet50_fpn
    model = #...

    # get the number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
   
    return model

Instantiate the model and apply a resize on the images. The input images resolution must be 224x224 pixels.

In [None]:
## TODO: if available use GPU, else use CPU.


## TODO: define NUM_CLASSES (we only have 2: background and person).


## TODO: get the model using the 'get_model' function.
model = #...

# Perform input / target transformation before feeding the data to the model
grcnn = torchvision.models.detection.transform.GeneralizedRCNNTransform(min_size=224, 
                                                                        max_size=224, 
                                                                        image_mean=[0.485, 0.456, 0.406], 
                                                                        image_std=[0.229, 0.224, 0.225]
                                                                        )
model.transform = grcnn #apply the transforms

### 3b. Define hyperparameters

Implement a Stochastic Gradient Descent optimizer with a learning rate of 0.0005, a momentum factor of 0.9, and a weight decay of 0.0005. Then, construct a learning rate scheduler which decreases the learning rate by 10 times every 3 epochs. Set the number of epochs at 10.

In [None]:
# construct the optimizer
params = [p for p in model.parameters() if p.requires_grad]
## TODO: complete the arguments.
#        https://pytorch.org/docs/stable/optim.html#torch.optim.SGD
#        learning rate of 0.0005, momentum factor of 0.9, 
#        weight decay of 0.0005.
optimizer = torch.optim.SGD(params, 
                           #...
                           )

# construct the learning rate scheduler
## TODO: complete the arguments.
#        https://pytorch.org/docs/stable/optim.html#torch.optim.lr_scheduler.StepLR
#        step size of 3, gamma of 0.1.
# parameters: step_size: period of learning rate decay;
#             gamma: multiplicative factor of learning rate decay.
# The StepLR decays the learning rate of each parameter group by gamma every step_size epochs.
lr_scheduler = torch.optim.lr_scheduler.StepLR(
                                              #...    
                                              )

## TODO: define the number of epochs (10)


## 4. Training execution

At the beginning of the notebook, we copied some helper functions to simplify training and evaluating models in `references/detection/`.

### 4a. Training

Train the model for 10 epochs, evaluating at the end of every epoch.

In [None]:
## TODO: move model to the right device


metric_collector = []

for epoch in range(num_epochs):
    ## TODO: using the 'train_one_epoch' function from 'engine', train for
    #        one epoch, printing every 50 iterations (print_freq=50).
    # hints: you can find the source code here 
    #        https://github.com/pytorch/vision/blob/master/references/detection/engine.py
    #        these are the arguments:
    #        train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq)
    metric_logger = train_one_epoch(
                                   #...
                                   )
    metric_collector.append(metric_logger)
    # update the learning rate
    lr_scheduler.step()

### 4b. Retrieve the loss data during training

The task of object detection has more than one loss value, so during training these are the losses calculated at every epoch:
* `loss`: the sum of all losses;
* `loss_classifier`: measures the performance of the object classification for detected bounding boxes;
* `loss_box_reg`: measures the performance of the network for retrieving the coordinates of the ground truth bounding boxes;
* `loss_objectness`: measures the performance of the network for retrieving bounding boxes which contain an object; 
* `loss_rpn_box_reg`: measures the performance of the network for retrieving the region proposals.

Retrieve the `loss` and `loss_box_reg` metrics from the `metric_collector` and save them in a `metrics` list.

In [None]:
loss_metric = []
loss_classifier_metric = []
loss_box_reg_metric = []
loss_objectness_metric = []
loss_rpn_box_reg_metric = []

# str. split([sep[, maxsplit]])
# maxsplit: number of splits to do; default is -1 which splits all the items.
# since we want just one split (e.g.: '0.8763' and '(0.8763)') we can write '.split(' ', 1)'.

for i in range(len(metric_collector)):
    # get the loss value
    loss = float(str(metric_collector[i].__getattr__("loss")).split(' ', 1)[0])
    # add the loss value to the 'loss_metric' list
    loss_metric.append(loss)

    ## TODO: get the loss_classifier, loss_box_reg, loss_objectness and
    #        loss_rpn_box_reg, and add them to their lists.
    

# list of lists of all losses
metrics = [loss_metric, 
           loss_classifier_metric, 
           loss_box_reg_metric, 
           loss_objectness_metric, 
           loss_rpn_box_reg_metric]

Save the metrics and the model.

In [None]:
# save the metrics (list) to a pickle file
with open('metrics.pickle', 'wb') as f:
    pickle.dump(metrics, f)

## TODO: save the model using 'state_dict()'.


## 5. Test

Load the metrics and the model.

In [None]:
# load the list of metrics
with open('metrics.pickle', 'rb') as f:
    metrics = pickle.load(f)

# load the model
loaded_model = get_model(num_classes = 2)
## TODO: * load the model using 'load_state_dict()';
#        * put the model in evaluation mode.
loaded_model.#...


### 5a. Plot training losses

Plot the data of the losses during training.

In [None]:
## TODO: plot the 5 curves of the losses (loss, loss_classifier, loss_box_reg, 
#        loss_objectness and loss_rpn_box_reg) in a 2x3 grid.



### 5b. Visualize the predicted bounding boxes

Give 8 images to the model and visualize the predicted bounding boxes.
* The last 4 images (first row) must show the ground truth bounding boxes and ALL the predicted bounding boxes;
* The first 4 images (second row) must show the ground truth bounding boxes and the predicted bounding boxes with a score greater than 0.8.

The following code selects the first image from the test set and shows it with the ground truth bounding boxes and the predicted bounding boxes with a score greater than 0.8. Complete it and then modify it following the assignment.

In [None]:
# get an image and its bounding box from the test set
idx = 0 #index of the first image of the test set
img, _ = test_set[idx]
bounding_boxes = test_set[idx][1]["boxes"] #ground truth

# get predictions
prediction = loaded_model([img]) #predicted bounding boxes

In [None]:
## TODO: print 'bounding_boxes' and 'prediction' to analyze them and see
#        how they look like.


In [None]:
# transform the image from tensor to PIL
img_PIL = torchvision.transforms.ToPILImage()(img.squeeze(0))

# draw image
draw = ImageDraw.Draw(img_PIL)

# draw ground truth
for box in range(len(bounding_boxes)): #for each bounding box in the ground truth
    # draw it on the image
    draw.rectangle([
                   ## TODO: coordinates of the bounding box 
                   (bounding_boxes[box][0], #...
                   ], 
                   ## TODO: color of the outline (green) and width
                   )

# draw prediction
predicted_boxes = prediction[0]["boxes"]
predicted_scores = prediction[0]["scores"]

for box in range(len(predicted_boxes)):
    # get the predicted bounding boxes
    boxes = predicted_boxes[box].detach().numpy()
    # get the scores (percentage) of the predicted bounding boxes
    score = np.round(predicted_scores[box].detach().numpy(), decimals= 4)
    if score > 0.8: #draw only high scoring boxes
        ## TODO: draw a red rectangle for the predicted bounding box
        #...
        # write the score of each predicted bounding box in the top left corner
        draw.text((boxes[0], boxes[1]), text = str(score))

img_PIL #show the image with the bounding boxes

In [None]:
## TODO: After completing the previous code, modify it following the previous
#        instructions (show 8 images in 2 rows of 4 images each).

## 6. Results, observations and conclusions

TODO

## Full code

TODO

---

## Bibliography
* [Torchvision Object Detection Finetuning Tutorial](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html)
* [Building your own object detector — PyTorch vs TensorFlow and how to even get started?](https://towardsdatascience.com/building-your-own-object-detector-pytorch-vs-tensorflow-and-how-to-even-get-started-1d314691d4ae)