# Detecting poles

Reference: [tutorial](https://pytorch.org/tutorials/intermediate/torchvision_tutorial.html), for [Mask R-CNN](https://arxiv.org/abs/1703.06870). 

First, we need to install `pycocotools`. This library will be used for computing the evaluation metrics following the COCO metric for intersection over union. 

(Mounting drive to access data easily, otherwise upload your own data)

When using this notebook, make sure to first run, this in shell:
``` 
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
python setup.py build_ext install
```

Use `pytorch_kernel` to have all of the conda environment and the useful packages.

It can be defined through with these lines: 

``` $ conda install ipykernel ``` 

``` $ ipython kernel install --user --name=pytorch_kernel ``` 

``` $ conda deactivate ``` 

## Defining the Dataset

The [torchvision reference scripts for training object detection, instance segmentation and person keypoint detection](https://github.com/pytorch/vision/tree/v0.3.0/references/detection) allows for easily supporting adding new custom datasets.
The dataset should inherit from the standard `torch.utils.data.Dataset` class, and implement `__len__` and `__getitem__`.

The only specificity that we require is that the dataset `__getitem__` should return:

* image: a PIL Image of size (H, W)
* target: a dict containing the following fields
    * `boxes` (`FloatTensor[N, 4]`): the coordinates of the `N` bounding boxes in `[x0, y0, x1, y1]` format, ranging from `0` to `W` and `0` to `H`
    * `labels` (`Int64Tensor[N]`): the label for each bounding box
    * `image_id` (`Int64Tensor[1]`): an image identifier. It should be unique between all the images in the dataset, and is used during evaluation
    * `area` (`Tensor[N]`): The area of the bounding box. This is used during evaluation with the COCO metric, to separate the metric scores between small, medium and large boxes. --> Not used here
    * `iscrowd` (`UInt8Tensor[N]`): instances with `iscrowd=True` will be ignored during evaluation. We set them all to 0. 

### Writing a custom dataset for the Airport dataset

Let's write a dataset for the Airport dataset. Make sure the data is contained in the dataset folder.

Let's have a look at the dataset and how it is layed down.

The data is structured as follows
```
dataset_aeroport/
  annotations/
    Image0000.xml
    Image0001.xml
    Image0002.xml
    Image0003.xml
    ...
  images/
    Image0000.jpg
    Image0001.jpg
    Image0002.jpg
    Image0003.jpg
```

#### Some useful functions for bounding boxes representation and xml extraction

In [None]:
from __future__ import print_function
import cv2 as cv
import numpy as np
import argparse
import xml.etree.ElementTree as ET
import matplotlib.pyplot as plt
from PIL import Image, ImageDraw
from imutils.object_detection import non_max_suppression


def extract_coordinates(file, min_width=20, img_width=4104):
    """ Extract [xmin, ymin, xmax, ymax] coordinates """
    tree = ET.parse(file)
    root = tree.getroot()
    boxes = []
    for bb in root.iter('bndbox'):
        x0 = int(bb[0].text)  # xmin
        y1 = int(bb[3].text)  # ymax

        x1 = int(bb[2].text)  # xmax
        y0 = int(bb[1].text)  # ymin
        xmin = min(x0, x1)
        xmax = max(x0, x1)
        if xmax - xmin < min_width:
            if xmin + min_width < img_width:
                xmax = xmin + min_width
            else:
                xmin = xmax - min_width
        ymin = min(y0, y1)
        ymax = max(y0, y1)
        if ymax - ymin < min_width:
            ymax = ymin + min_width
        boxes.append([xmin, ymin, xmax, ymax])
    return boxes


def export_xml(file):
    """ Export xml file to drax bounding box in OpenCV """
    tree = ET.parse(file)
    root = tree.getroot()
    corners_up_left = []
    corners_bottom_left = []
    corner_up_right = []
    corners_bottom_right = []
    for bb in root.iter('bndbox'):
        x1 = int(bb[0].text)  # xmin
        x2 = int(bb[3].text)  # ymax
        corners_up_left.append((x1, x2))
        y1 = int(bb[2].text)  # xmax
        y2 = int(bb[1].text)  # ymin
        corners_bottom_right.append((y1, y2))
    return corners_up_left, corners_bottom_right
    # return corners_bottom_right, corners_up_left

def draw_bounding_box(image, corners_up_left, corners_bottom_right):
    """ draw bounding boxes given the corners [xmin, ymax] and [xmax, ymin]"""
    for i in range(len(corners_up_left)):
        x = corners_up_left[i]
        y = corners_bottom_right[i]  # x and y are two opposite corners of the rectangle
        cv.rectangle(image, x, y, (255, 255, 255), thickness=10)  # (0, 255, 0), 2)
    plt.figure()
    plt.imshow(image, cmap="gray")
    plt.show()

def display_image_with_bounding_boxes(image, target, to_npy=True):
    """ Display image with bounding boxes given image and target from dataset """
    if to_npy:
        image = image.numpy()
    image = np.sum(image, 0)

    # bounding boxes
    boxes = target['boxes']
    boxes = boxes.cpu()
    if type(boxes) != np.ndarray:
          boxes = boxes.numpy()
    num_boxes = boxes.shape[0]
    corners_up_left = [(boxes[i][0], boxes[i][3]) for i in range(num_boxes)]
    corners_bottom_right = [(boxes[i][2], boxes[i][1]) for i in range(num_boxes)]
    draw_bounding_box(image, corners_up_left, corners_bottom_right)

def PIL_disp_img_bb(image, target, color='red'):
    to_pil = torchvision.transforms.ToPILImage(mode='RGB')
    base = to_pil(image)
    base = base.convert('RGBA')
    box_container = Image.new(base.mode, base.size, (255,255,255, 0))
    d = ImageDraw.Draw(box_container)
    boxes = target['boxes']
    boxes = boxes.cpu()
    if type(boxes) != np.ndarray:
        boxes = boxes.numpy()
    num_boxes = boxes.shape[0]
    for i in range(num_boxes):
        bb = d.rectangle(boxes[i], outline=color)
    base.show()
    box_container.show()
    # return base, box_container
    out = Image.alpha_composite(base, box_container)
    return out

SMOOTH = 1e-6

def iou_pytorch(img, annotated_bb, gt_bb):
    img_size = img.size()
    outputs = torch.zeros((img_size[2], img_size[1]), dtype = int)
    labels = torch.zeros((img_size[2], img_size[1]), dtype = int)
    for bb in annotated_bb.tolist():
        for x in range(int(bb[0]), int(bb[2])):
            for y in range(int(bb[1]), int(bb[3])):
                outputs[x, y] = 1
    for bb in gt_bb.tolist():
        for x in range(int(bb[0]), int(bb[2])):
            for y in range(int(bb[1]), int(bb[3])):
                labels[x, y] = 1
    
    intersection = (outputs & labels).float().sum()  # sum((1, 2)) Will be zero if Truth=0 or Prediction=0
    union = (outputs | labels).float().sum()         # sum((1, 2)) Will be zzero if both are 0
    
    iou = (intersection + SMOOTH) / (union + SMOOTH)  # We smooth our devision to avoid 0/0
    
    thresholded = torch.clamp(20 * (iou - 0.5), 0, 10).ceil() / 10  # This is equal to comparing with thresholds
    
    return thresholded


So each image has a corresponding set of bounding boxes. Let's write a `torch.utils.data.Dataset` class for this dataset.

In [None]:
import os
import numpy as np
import torch
import torch.utils.data
from PIL import Image


class AirportDataSet(object):
    def __init__(self, root_dir, transforms):
        self.root = root_dir
        self.transforms = transforms
        # load all image files, sorting them to ensure that they are aligned
        self.images = list(sorted(os.listdir(os.path.join(root_dir, "images"))))
        self.bb = list(sorted(os.listdir(os.path.join(root_dir, "annotations"))))

    def __getitem__(self, index):
        # load images and bounding boxes
        image_path = os.path.join(self.root, "images", self.images[index])
        bb_path = os.path.join(self.root, "annotations", self.bb[index])
        image = Image.open(image_path).convert("RGB")
        width, height = image.size

        boxes = extract_coordinates(bb_path, img_width=width)

        # convert everything into a tensor
        boxes = torch.as_tensor(boxes, dtype=torch.float32)
        labels = torch.ones((len(boxes), ), dtype=torch.int64)

        # image_id = torch.tensor([index])
        # area = (boxes[:, 3] - boxes[:, 1])*(boxes[:, 2] - boxes[:, 0])

        target = {}
        target['boxes'] = boxes
        # target['area'] = area
        target['labels'] = labels
        target['image_id'] = torch.as_tensor(index, dtype = torch.int64)
        target['area'] = torch.zeros((len(boxes), ), dtype=torch.float32)
        target['iscrowd'] = torch.zeros((len(boxes), ), dtype=torch.int64)

        # compute areas of boxes
        for i in range(len(boxes)):
            box = boxes[i]
            target['area'][i] = (box[2]-box[0])*(box[3]-box[1]) 

        if self.transforms is not None:
            image, target = self.transforms(image, target)
            # transform target
            # boxes[:, 0] = width - boxes[:, 0]  # x_min
            # boxes[:, 2] = width - boxes[:, 2]  # x_max
            # target['boxes'] = boxes

        return image, target

    def __len__(self):
        return len(self.images)

## Defining your model

In this tutorial, we will be using [Faster R-CNN](https://arxiv.org/abs/1506.01497). Faster R-CNN is a model that predicts both bounding boxes and class scores for potential objects in the image.

We use a pretrained (on COCO) from the `torchvision.models` library, only modifying the number of possible classes to 2 (pole or not pole). 

Another solution (for faster predictions especially) would be to modify the backbone (with MobileNetv2 from the same library). 



### An Instance segmentation model for the Airport Dataset


In [None]:
import torchvision
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.rpn import AnchorGenerator
      
def get_instance_segmentation_model(num_classes, model='mobilenet'):
    if model == 'faster-rcnn':
        # model = torch.load('/content/drive/My Drive/CNN_trainings/detecting_poles.pth')
      # load an instance segmentation model pre-trained on COCO -- uncomment to train faster rcnn
        model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True, progress=False)

      # get the number of input features for the classifier
        in_features = model.roi_heads.box_predictor.cls_score.in_features
      # replace the pre-trained head with a new one
        model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    else:
        backbone = torchvision.models.mobilenet_v2(pretrained=True, progress=False).features
        backbone.out_channels = 1280
        anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),), aspect_ratios=((0.5, 1.0, 2.0),))
        roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0'], output_size=7, sampling_ratio=2)
        model = FasterRCNN(backbone, num_classes=2, rpn_anchor_generator=anchor_generator, box_roi_pool=roi_pooler)
    return model

## Training and evaluation functions

In `references/detection/`from the pytorch git, there are a number of helper functions to simplify training and evaluating detection models.
Here, we will use `references/detection/engine.py`, `references/detection/utils.py` and `references/detection/transforms.py`.

Let's copy those files (and their dependencies) in here so that they are available in the notebook, run these lines in the terminal:

```
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.3.0

cp references/detection/utils.py ../
cp references/detection/transforms.py ../
cp references/detection/coco_eval.py ../
cp references/detection/engine.py ../
cp references/detection/coco_utils.py ../
```



Let's write some helper functions for data augmentation / transformation, which leverages the functions in `references/detection` that we have just copied:


In [None]:
from engine import train_one_epoch, evaluate
import utils
import transforms as T


def get_transform(train):
    transforms = []
    # converts the image, a PIL image, into a PyTorch Tensor
    transforms.append(T.ToTensor())
    if train:
        # during training, randomly flip the training images
        # and ground-truth for data augmentation
        transforms.append(T.RandomHorizontalFlip(0.5))
    return T.Compose(transforms)

#### Note that we do not need to add a mean/std normalization nor image rescaling in the data transforms, as those are handled internally by the Mask R-CNN model.

## Training Faster RCNN

We now have the dataset class, the models and the data transforms. Let's instantiate them

In [None]:
# use our dataset and defined transformations
dataset = AirportDataSet('dataset/dataset_aeroport', get_transform(train=True))
dataset_test = AirportDataSet('dataset/dataset_aeroport', get_transform(train=False))

# split the dataset in train and test set
torch.manual_seed(1)
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])  #[:-50])
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])  #[-50:])

# define training and validation data loaders
data_loader = torch.utils.data.DataLoader(
    dataset, batch_size=1, shuffle=False, num_workers=4,
    collate_fn=utils.collate_fn)

data_loader_test = torch.utils.data.DataLoader(
    dataset_test, batch_size=1, shuffle=False, num_workers=4,
    collate_fn=utils.collate_fn)

print("Size of the dataset: \t", len(data_loader))

Now let's instantiate the model and the optimizer

In [None]:
from ipywidgets import IntProgress

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# our dataset has two classes only - background and person
num_classes = 2

# get the model using our helper function
model = get_instance_segmentation_model(num_classes, 'faster-rcnn')
# move model to the right device
model.to(device)

# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
                            momentum=0.9, weight_decay=0.0005)

# and a learning rate scheduler which decreases the learning rate by
# 10x every 3 epochs
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                               step_size=3,
                                               gamma=0.1)

And now let's train the model for 10 epochs, evaluating at the end of every epoch.

In [None]:
# let's train it for 10 epochs
num_epochs = 10

for epoch in range(num_epochs):
    # train for one epoch, printing every 10 iterations
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
    # update the learning rate
    lr_scheduler.step()
    # evaluate on the test dataset
    evaluate(model, data_loader_test, device=device)

Now that training has finished, let's have a look at what it actually predicts in a test image

In [None]:
# pick one image from the set
index = 1
# plt_img = plt.imread('/content/experiment1/images/Image0000.jpg')
img, _ = dataset_test[index]
# put the model in evaluation mode
model.eval()
with torch.no_grad():
    prediction = model([img.to(device)])

Printing the prediction shows that we have a list of dictionaries. Each element of the list corresponds to a different image. As we have a single image, there is a single dictionary in the list.
The dictionary contains the predictions for the image we passed. In this case, we can see that it contains `boxes`, `labels`, `masks` and `scores` as fields.

In [None]:
prediction

Let's inspect the image and the predicted segmentation masks.

For that, we need to convert the image, which has been rescaled to 0-1 and had the channels flipped so that we have it in `[C, H, W]` format.

In [None]:
torch.save(model.state_dict(), 'CNN_trainings/faster-rcnn_weights.pth')
torch.save(model, 'CNN_trainings/faster-rcnn.pth')

In [None]:
# Mean accuracy (compared to man made ground truth)
dataset = AirportDataSet('dataset/dataset_aeroport', get_transform(train=True))
model.eval()
acc = np.zeros(len(dataset))
for index in range(len(dataset)):
    img, target = dataset[index]
    img = pil2tensor(img)
    if index % 10 == 0:
        print("Annotating frame \t", index)
    with torch.no_grad():
        prediction = model([img.to(device)])
    # out = PIL_disp_img_bb(img, pred[0])
    # out.save('results/aeroport1/annotated_'+str(index)+'.png')
    acc[index] = iou_pytorch(img, pred[0]['boxes'], target['boxes'])
print("Mean accuracy of Faster-RCNN on Airport dataset: \n", np.mean(acc))

And let's now visualize the top predicted segmentation mask. The masks are predicted as `[N, 1, H, W]`, where `N` is the number of predictions, and are probability maps between 0-1.

In [None]:
out = PIL_disp_img_bb(img, prediction[0])
out

## Training MobileNetV2

In [None]:
# use our dataset and defined transformations
dataset = AirportDataSet('dataset/dataset_aeroport', get_transform(train=True))
dataset_test = AirportDataSet('dataset/dataset_aeroport', get_transform(train=False))
print("Dataset size: ", len(dataset.images)-50)

# split the dataset in train and test set
torch.manual_seed(1)
indices = torch.randperm(len(dataset)).tolist()
dataset = torch.utils.data.Subset(dataset, indices[:-50])  
dataset_test = torch.utils.data.Subset(dataset_test, indices[-50:])  

# define training and validation data loaders
data_loader = torch.utils.data.DataLoader(
    dataset, batch_size=1, shuffle=True, num_workers=4,
    collate_fn=utils.collate_fn)

data_loader_test = torch.utils.data.DataLoader(
    dataset_test, batch_size=1, shuffle=False, num_workers=4,
    collate_fn=utils.collate_fn)

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# our dataset has two classes only - background and person
num_classes = 2

# get the model using our helper function
model = get_instance_segmentation_model(num_classes)
# move model to the right device
model.to(device)

# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005,
                            momentum=0.9, weight_decay=0.0005)

# and a learning rate scheduler which decreases the learning rate by
# 10x every 3 epochs
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                               step_size=3,
                                               gamma=0.1)

In [None]:
# let's train it for 10 epochs
num_epochs = 10

model.train()
for epoch in range(num_epochs):
    # train for one epoch, printing every 10 iterations
    train_one_epoch(model, optimizer, data_loader, device, epoch, print_freq=10)
    # update the learning rate
    lr_scheduler.step()
    # evaluate on the test dataset
    evaluate(model, data_loader_test, device=device)

In [None]:
# pick one image from the set
index = 1
# plt_img = plt.imread('/content/experiment1/images/Image0000.jpg')
img, _ = dataset_test[index]
# put the model in evaluation mode
model.eval()
with torch.no_grad():
    prediction = model([img.to(device)])

In [None]:
# Mean accuracy (compared to man made ground truth)
pil2tensor = torchvision.transforms.ToTensor()
dataset = AirportDataSet('dataset/dataset_aeroport', get_transform(train=False))
model.eval()
acc = np.zeros(len(dataset))
for index in range(len(dataset)):
    img, target = dataset[index]
    # img = pil2tensor(img)
    if index % 10 == 0:
        print("Annotating frame \t", index)
    with torch.no_grad():
        prediction = model([img.to(device)])
    out = PIL_disp_img_bb(img, prediction[0])
    out.save('results/aeroport1_ssd/annotated_'+str(index)+'.png')
    # acc[index] = iou_pytorch(img, prediction[0]['boxes'], target['boxes'])
# print("Mean accuracy of MobileNet on Airport dataset: \n", np.mean(acc))

In [None]:
# saving the model/the weights
torch.save(model.state_dict(), 'CNN_trainings/mobile_detection_weights.pth')
torch.save(model, 'CNN_trainings/mobile_detection.pth')

In [None]:
# visualizing a result
out = PIL_disp_img_bb(img, prediction[0])
out 

## Testing on new images

In [None]:
# only execute if no training before (warning only Faster RCNN model is available) and upload your test archive
model = torch.load('CNN_trainings/faster-rcnn.pth')
# model = torch.load('CNN_trainings/mobile_detection.pth')

In [None]:
# test on another image
# CEA dataset
test_set = AirportDataSet('dataset/cea_dataset', None)
# airport dataset (1)
# test_set = AirportDataSet('dataset/dataset_aeroport', None)
# airport dataset (2)
# test_set = AirportDataSet('dataset/aeroport2', None)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

index = 2
test_img, _ = test_set[index]
target = _
pil2tensor = torchvision.transforms.ToTensor()
test_img = pil2tensor(test_img)
model.eval()
with torch.no_grad():
    pred = model([test_img.to(device)])
display_image_with_bounding_boxes(test_img, pred[0])
print(pred[0]['scores'])
print('# bounding boxes found for img ', index, ': ', len(pred[0]['boxes']))

In [None]:
def apply_NMS(pred, filter=True, proba_confidence=0.2):
    bounding_boxes = pred[0]['boxes'].cpu().numpy()
    scores = pred[0]['scores'].cpu().numpy()
    if filter:
        filter_to_apply = np.zeros(bounding_boxes.shape[0], dtype=bool)
        for i in range(bounding_boxes.shape[0]):
            bb = bounding_boxes[i]
            filter_to_apply[i] = bb[2] - bb[0] > 10 and bb[3] - bb[1] > 2*(bb[2]-bb[0]) and scores[i]>=proba_confidence
        bounding_boxes = bounding_boxes[filter_to_apply]
        scores = scores[filter_to_apply]

    # bounding_boxes = non_max_suppression(bounding_boxes, overlapThresh=0.1)
    pred[0]['boxes'] = torch.Tensor(bounding_boxes) 
    pred[0]['scores'] = torch.Tensor(scores)  # no NMS to use this line
    return pred

In [None]:
# testing NMS (Normal Maximum Suppression)
pred_= apply_NMS(pred, proba_confidence = 0.15)
display_image_with_bounding_boxes(test_img, pred_[0])
print('Scores: ', pred_[0]['scores'])
print('# bounding boxes kept after NMS: ', len(pred_[0]['boxes']))

In [None]:
pil2tensor = torchvision.transforms.ToTensor()
acc = np.zeros(len(test_set))
model.eval()
for index in range(len(test_set)):
    test_img, _ = test_set[index]
    target = _
    if index % 10 == 0:
        print("Annotating frame \t", index)
    test_img = pil2tensor(test_img)
    with torch.no_grad():
        pred = model([test_img.to(device)])
        # pred = apply_NMS(pred)
    out = PIL_disp_img_bb(test_img, pred[0])
    out.save('results/cea/annotated_'+str(index)+'.png')
    # acc[index] = iou_pytorch(test_img, pred[0]['boxes'], target['boxes'])
# print("Mean accuracy: \n", np.mean(acc))

All results can be consulted in the folder `results` following the path above. 

In [None]:
out = PIL_disp_img_bb(test_img, pred[0])
out

In [None]:
display_image_with_bounding_boxes(test_img, target)

In [None]:
true = PIL_disp_img_bb(test_img, target, color=(0, 255, 0))
# true