## Problem 1


<img src="https://dl.fbaipublicfiles.com/detectron2/Detectron2-Logo-Horz.png" width="500">

In this homework assignment, we will use Detectron2 (Facebook) to help us to do the tasks of detection and segmentation. 

Detectron2 is Facebook AI Research's software system that implements state-of-the-art object detection algorithms. Here, we will go through some basic usage of detectron2, and finish the problem 1 and problem 2. 


### Getting Started

In [None]:
# First step, let's install detectron2 first!
# install dependencies: 
%pip uninstall torch torchvision -y
%pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
%pip install pyyaml==5.1 pycocotools>=2.0.1


import torch, torchvision
print(torch.__version__, torch.cuda.is_available())
!gcc --version

: 

In [3]:
import torch, torchvision

# install detectron2: (Colab has CUDA 11.1 + torch 1.10)
# See https://detectron2.readthedocs.io/tutorials/install.html for instructions
assert torch.__version__.startswith("1.10")
!pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu111/torch1.10/index.html
# It may ask you to restart the runtime

AssertionError: 

In [None]:
# Some basic setup:
# Setup detectro2 logger
import detectron2
from detectron2.utils.logger import setup_logger
setup_logger()

# import some common libraries
import numpy as np 
import os, json, cv2, random
from google.colab.patches import cv2_imshow

# import some common detectron2 utils
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog, DatasetCatalog
import cv2

### Run a pretrained Detectron2 model

We first download some image from the given URLs:

In [None]:
!wget http://images.cocodataset.org/val2017/000000007574.jpg -q -O input.jpg
im_input = cv2.imread("./input.jpg")
cv2_imshow(im_input)

In [None]:
!wget http://images.cocodataset.org/val2017/000000013923.jpg -q -O test1.jpg
im_test1 = cv2.imread("./test1.jpg")
cv2_imshow(im_test1)

In [None]:
!wget http://images.cocodataset.org/val2017/000000018380.jpg -q -O test2.jpg
im_test2 = cv2.imread("./test2.jpg")
cv2_imshow(im_test2)

We can see there are multiple objects in these images: bottles, tables, chairs, people, etc. Let us see if we can detect them all by using a pre-trained model given by Detectron2.


In [None]:
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST= 0.5  # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(im_input)

Let's take a look at the model output. 

In inference mode, the builtin model outputs a `list[dict]`, one dict for each image. For the object detection task, the dict contain the following fields:

*   "instances": Instances object with the following fields:
    * "pred_boxes": Storing N boxes, one for each detected instance.
    * "scores": a vector of N scores.
    * "pred_classes": a vector of N labels in range [0, num_categories].

For more details, please see https://detectron2.readthedocs.io/tutorials/models.html#model-output-format for specification



In [None]:
print(outputs)

In [None]:
print(outputs["instances"].pred_classes)
print(outputs["instances"].pred_boxes)

In [None]:
outputs_q1q2 = {'q1': [], 'q2': []}
outputs_q1q2['q1'].append(outputs["instances"])

In [None]:
# We can use "Visualizer" to draw the predictions on the image
v = Visualizer(im_input[:, :, ::-1], MetadataCatalog.get(cfg.DATASETS.TRAIN[0]), scale=1.2)
out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
cv2_imshow(out.get_image()[:, :, ::-1])

AWESOME!!! Great progress so far! We are able to detect sink, microwave, bottle and even refrigerator! At this point, we have used the pre-trained model to do the inference on the given image. There are in total 17 objects are being detected. The image is adopted from the [MS-COCO](https://cocodataset.org/#home) dataset and there are 81 classes including person, bicycle, car, etc. You may find the id-category mapping [here](https://gist.github.com/AruniRC/7b3dadd004da04c80198557db5da4bda).

The model we just used is `COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml`. Actually, the Detectron2 provides us more than that, you may find great amouts of models for different tasks in the given [MODEL_ZOO](https://github.com/facebookresearch/detectron2/tree/master/configs). What about we try a different model to see what its output will look like? 


* Q1 (5%): Object Detection. Use the same configuration `COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml`, with IoU threshold of 0.5 (`SCORE_THRESH_TEST=0.5`), to also run inference on the rest two images (test1.jpg & test2.jpg) and view the outputs with bounding boxes. 

* Q2: Object Detection. Use the `COCO-Detection/faster_rcnn_R_101_FPN_3X.yaml`, which has a ResNet-101 as the backbone, with IoU threshold of 0.5 and view the outputs of all three images with bounding boxes. By looking at the outputs, can you find the difference with the one `COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml` we used in Q1? (e.g., numbers of objects, confidence scores, ...)

* Q3: Object Detection. Use the `COCO-Detection/faster_rcnn_R_101_FPN_3X.yaml` with an IoU threshold of 0.9 and view the outputs of all three images with bounding boxes.

* Q4 (5%): Instance Segmentation. The models we have tried in Q1-Q3 are the Faster R-CNN models for object detection. Here, let’s try a Mask R-CNN model `COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml`, with IoU threshold of 0.5, to perform the instance segmentation and view the outputs of all three images with segmentation masks. Compare the difference of outputs between an object detection model with an instance segmentation model. 






In [None]:
# todo: Q1


In [None]:
# todo: Q2


In [None]:
# todo: Q3


In [None]:
# todo: Q4


### Train Faster R-CNN on a traffic sign dataset

We have already used the pre-trained model on MS COCO datasets. Why not we try to train our own model ourselves? Here, we will train an existing detectron2 model on a custom dataset in a new format. 

You have already used the pre-trained model on MS COCO datasets. Why not try to train your own model? Here, let’s train an existing Faster R-CNN model on a custom dataset in a new format. 

We use the [traffic sign dataset](https://www.dropbox.com/s/d8y6uc06027fpqo/traffic_sign_data.zip?dl=1). We’ll train a traffic sign detection model from an existing model pre-trained on COC dataset, available in detectron2’s model zoo. Note that the MS COCO dataset does not have the "traffic sign" category, but we'll be able to recognize this new class in a few minutes.


#### Prepare the dataset

In [None]:
# download, decompress the data
!wget https://www.dropbox.com/s/d8y6uc06027fpqo/traffic_sign_data.zip?dl=1 -O traffic_sign_data.zip
!unzip -q traffic_sign_data.zip > /dev/null

Here, the traffic sign dataset is in its custom dataset, therefore we write a function to parse it and prepare it into detectron2's standard format. See `get_traffic_sign_dicts` function for more details. To verify the data loading is correct, let's visualize the annotations of randomly selected samples in the training set:

In [None]:
from detectron2.structures import BoxMode

def get_traffic_sign_dicts(data_root, txt_file):
    dataset_dicts = []
    filenames = []
    csv_path = os.path.join(data_root, txt_file)
    with open(csv_path, "r") as f:
        for line in f:
            filenames.append(line.rstrip())
    
    for idx, filename in enumerate(filenames):
        record = {}

        image_path = os.path.join(data_root, filename)

        height, width = cv2.imread(image_path).shape[:2]

        record['file_name'] = image_path
        record['image_id'] = idx
        record['height'] = height
        record['width'] = width

        image_filename = os.path.basename(filename)
        image_name = os.path.splitext(image_filename)[0]
        annotation_path = os.path.join(data_root, 'labels', '{}.txt'.format(image_name))
        annotation_rows = []

        with open(annotation_path, "r") as f:
            for line in f:
                temp = line.rstrip().split(" ")
                annotation_rows.append(temp)

        objs = []
        for row in annotation_rows:
            xcentre = int(float(row[1])*width)
            ycentre = int(float(row[2])*height)
            bwidth = int(float(row[3])*width)
            bheight = int(float(row[4])*height)

            xmin = int(xcentre - bwidth/2)
            ymin = int(ycentre - bheight/2)
            xmax = xmin  + bwidth
            ymax = ymin + bheight

            obj= {
                'bbox': [xmin, ymin, xmax, ymax],
                'bbox_mode': BoxMode.XYXY_ABS,
                # alternatively, we can use bbox_mode = BoxMode.XYWH_ABS
                # 'bbox': [xmin, ymin, bwidth, bheight],
                # 'bbox_mode': BoxMode.XYWH_ABS,
                'category_id': int(row[0]),
                'iscrowd': 0
            }

            objs.append(obj)
        record['annotations'] = objs
        dataset_dicts.append(record)
    return dataset_dicts

In [None]:
# Metadata configurations
data_root = "traffic_sign_data"
train_txt = "traffic_sign_train.txt"
test_txt = "traffic_sign_test.txt"

train_data_name = "traffic_sign_train"
test_data_name = "traffic_sign_test"

thing_classes = ["traffic-sign"]

output_dir = "./outputs"

def count_lines(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

train_img_count = count_lines(os.path.join(data_root, train_txt))
print("There are {} samples in training data".format(train_img_count))

In [None]:
# Register the traffic_sign_train datasets
DatasetCatalog.register(name=train_data_name, 
                        func=lambda: get_traffic_sign_dicts(data_root, train_txt))
train_metadata = MetadataCatalog.get(train_data_name).set(thing_classes=thing_classes)

# Register the traffic_sign_test datasets
DatasetCatalog.register(name=test_data_name, 
                        func=lambda: get_traffic_sign_dicts(data_root, test_txt))
test_metadata = MetadataCatalog.get(test_data_name).set(thing_classes=thing_classes)

To verify the data loading is correct, let's visualize the annotations of randomly selected samples in the training set:

In [None]:
train_data_dict = get_traffic_sign_dicts(data_root, train_txt)

for d in random.sample(train_data_dict, 3):
    img = cv2.imread(d["file_name"])
    visualizer = Visualizer(img[:, :, ::-1], metadata=train_metadata, scale=0.5)
    out = visualizer.draw_dataset_dict(d)
    cv2_imshow(out.get_image()[:, :, ::-1])

#### Train!

Now, let's fine-tune a COCO-pretrained R50-FPN Faster R-CNN model `COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml` on the traffic sign dataset. 

In [None]:
from detectron2.engine import DefaultTrainer

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml"))
cfg.DATASETS.TRAIN = (train_data_name,)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml") # let's trainining initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.0001  # pick a good LR
cfg.SOLVER.MAX_ITER = 300    # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = len(thing_classes)  # only has one class (traffic-sign)
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128   # faster, and good enough for this toy dataset (default: 512)
cfg.OUTPUT_DIR = output_dir

In [None]:
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg) 
trainer.resume_or_load(resume=False)
trainer.train()

In [None]:
# Look at training curves in tensorboard:
%load_ext tensorboard
%tensorboard --logdir outputs/

### Inference & evaluation using the trained model


Now let's run inference contains everything we've set previously. First, let's create a predictor using the model we just trained:

In [None]:
# cfg alrady contains everything we've set previously. Now we changed it a little bit for inference:
cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")  # path to the model we just trained
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
predictor = DefaultPredictor(cfg)

Then, we randomly select several samples to visualize the prediction results. 

In [None]:
from detectron2.utils.visualizer import ColorMode

test_data_dict = get_traffic_sign_dicts(data_root, test_txt)

for d in random.sample(test_data_dict, 3):
    im = cv2.imread(d["file_name"])
    outputs = predictor(im) 
    # print(outputs)
    v = Visualizer(im[:, :, ::-1],
                   metadata=test_metadata,
                   scale=0.5,
                   )
    out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
    cv2_imshow(out.get_image()[:, :, ::-1])

We can also evaluate its performance using AP metric implemented in COCO API. For more details about AP, please refer to [Blog](https://medium.com/@jonathan_hui/map-mean-average-precision-for-object-detection-45c121a31173).

In [None]:
from detectron2.evaluation import COCOEvaluator, inference_on_dataset
from detectron2.data import build_detection_test_loader

# create evaluator instance with coco evaluator 
evaluator = COCOEvaluator(test_data_name, cfg, False, output_dir="./outputs/")

# create validation data loader
val_loader = build_detection_test_loader(cfg, test_data_name)

# start validation
print(inference_on_dataset(trainer.model, val_loader, evaluator))

The AP is ~30%. You may also see the detailed metrics for small, medium and large objects as well. Not bad! Here are something that I want you to try by yourself:

* Q5: Change the initial learning rate (`BASE_LR`) from `0.001` to `0.00025` and show the 4 training curves from the TensorBoard. By viewing the results (You may keep the rest of configurations fixed), does it improve the AP or not? Explain why.


In [None]:
# todo: Q5

* Q6: Change the number of iterations (`MAX_ITERS`) from `300` to `500` and show the 4 training curves from the Tensorboard. By viewing the results (You may keep the rest of configurations fixed), does it improve the AP or not? What about `1000`? Explain why.

In [None]:
# todo: Q6

## Problem 2: Tracktor for Pedestrian Multi-Object Tracking

In [None]:
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
import cv2
import math
import os
import importlib
import json

In [None]:
# Prepare the codes
!git clone https://github.com/phil-bergmann/tracking_wo_bnw

# install packages
!pip install -r tracking_wo_bnw/requirements.txt

# install tracktor
!pip install -e tracking_wo_bnw/.

In [None]:
% cd tracking_wo_bnw/

### Download MOT 17 Dataset

In [None]:
# Prepare MOT-17Det datasets

%%shell

# download the MOT17 detection challenge
wget https://motchallenge.net/data/MOT17Det.zip .
# extract it in the current folder
unzip -q MOT17Det.zip

In [None]:
# Prepare MOT-17 label datasets

%%shell

wget https://motchallenge.net/data/MOT17Labels.zip .
unzip -q -d data/MOT17Labels MOT17Labels.zip
unzip -q -d data/MOT17Det MOT17Det.zip

In [None]:
# Obtain the ground truth and pre-trained model files

%%shell

wget https://vision.in.tum.de/webshare/u/meinhard/tracking_wo_bnw-output_v3.zip .
unzip -q tracking_wo_bnw-output_v3.zip 

Let's have a look at the dataset and how it is layed down.

The data is structured as follows

In [None]:
%%shell

ls
ls train
ls test

ls train/MOT17-02/
ls train/MOT17-02/img1/

Visualize some images in the MOT-17 dataset. 

In [None]:
# todo


### Download torchvision and coco

First, we need to install `pycocotools`. This library will be used for computing the evaluation metrics following the COCO metric for intersection over union.

In `references/detection/,` we have a number of helper functions to simplify training and evaluating detection models.
Here, we will use `references/detection/engine.py`, `references/detection/utils.py` and `references/detection/transforms.py`.

Let's copy those files (and their dependencies) in here so that they are available in the notebook

In [None]:
from google.colab import drive

drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
%%shell

# Install pycocotools
git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
python setup.py build_ext install

In [None]:
%%shell

# Download TorchVision repo to use some files from
# references/detection
git clone https://github.com/pytorch/vision.git
cd vision
git checkout v0.3.0

cp references/detection/utils.py ../
cp references/detection/transforms.py ../
cp references/detection/coco_eval.py ../
cp references/detection/engine.py ../
cp references/detection/coco_utils.py ../

### Defining the Dataset

The [torchvision reference scripts for training object detection, instance segmentation and person keypoint detection](https://github.com/pytorch/vision/tree/v0.3.0/references/detection) allows for easily supporting adding new custom datasets.
The dataset should inherit from the standard `torch.utils.data.Dataset` class, and implement `__len__` and `__getitem__`.

The only specificity that we require is that the dataset `__getitem__` should return:

* image: a PIL Image of size (H, W)
* target: a dict containing the following fields
    * `boxes` (`FloatTensor[N, 4]`): the coordinates of the `N` bounding boxes in `[x0, y0, x1, y1]` format, ranging from `0` to `W` and `0` to `H`
    * `labels` (`Int64Tensor[N]`): the label for each bounding box
    * `image_id` (`Int64Tensor[1]`): an image identifier. It should be unique between all the images in the dataset, and is used during evaluation
    * `area` (`Tensor[N]`): The area of the bounding box. This is used during evaluation with the COCO metric, to separate the metric scores between small, medium and large boxes.
    * `iscrowd` (`UInt8Tensor[N]`): instances with `iscrowd=True` will be ignored during evaluation.
    * (optionally) `masks` (`UInt8Tensor[N, H, W]`): The segmentation masks for each one of the objects
    * (optionally) `keypoints` (`FloatTensor[N, K, 3]`): For each one of the `N` objects, it contains the `K` keypoints in `[x, y, visibility]` format, defining the object. `visibility=0` means that the keypoint is not visible. Note that for data augmentation, the notion of flipping a keypoint is dependent on the data representation, and you should probably adapt `references/detection/transforms.py` for your new keypoint representation

If your model returns the above methods, they will make it work for both training and evaluation, and will use the evaluation scripts from pycocotools.

Additionally, if you want to use aspect ratio grouping during training (so that each batch only contains images with similar aspect ratio), then it is recommended to also implement a `get_height_and_width` method, which returns the height and the width of the image. If this method is not provided, we query all elements of the dataset via `__getitem__` , which loads the image in memory and is slower than if a custom method is provided.


So each image has a corresponding segmentation mask, where each color correspond to a different instance. Let's write a `torch.utils.data.Dataset` class for this dataset.

In [None]:
import configparser
import csv
import os
import os.path as osp
import pickle

from PIL import Image
import numpy as np
import scipy
import torch


class MOT17ObjDetect(torch.utils.data.Dataset):
    """ Data class for the Multiple Object Tracking Dataset
    """

    def __init__(self, root, transforms=None, vis_threshold=0.25):
        self.root = root
        self.transforms = transforms
        self._vis_threshold = vis_threshold
        self._classes = ('background', 'pedestrian')
        self._img_paths = []

        for f in os.listdir(root):
            path = os.path.join(root, f)
            config_file = os.path.join(path, 'seqinfo.ini')

            assert os.path.exists(config_file), \
                'Path does not exist: {}'.format(config_file)

            config = configparser.ConfigParser()
            config.read(config_file)
            seq_len = int(config['Sequence']['seqLength'])
            im_width = int(config['Sequence']['imWidth'])
            im_height = int(config['Sequence']['imHeight'])
            im_ext = config['Sequence']['imExt']
            im_dir = config['Sequence']['imDir']

            _imDir = os.path.join(path, im_dir)

            for i in range(1, seq_len + 1):
                img_path = os.path.join(_imDir, f"{i:06d}{im_ext}")
                assert os.path.exists(img_path), \
                    'Path does not exist: {img_path}'
                self._img_paths.append(img_path)

    @property
    def num_classes(self):
        return len(self._classes)

    def _get_annotation(self, idx):
        """
        """

        if 'test' in self.root:
          
            num_objs = 0
            boxes = torch.zeros((num_objs, 4), dtype=torch.float32)

            return {'boxes': boxes,
                'labels': torch.ones((num_objs,), dtype=torch.int64),
                'image_id': torch.tensor([idx]),
                'area': (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0]),
                'iscrowd': torch.zeros((num_objs,), dtype=torch.int64),
                'visibilities': torch.zeros((num_objs), dtype=torch.float32)}
                
        img_path = self._img_paths[idx]
        file_index = int(os.path.basename(img_path).split('.')[0])

        gt_file = os.path.join(os.path.dirname(
            os.path.dirname(img_path)), 'gt', 'gt.txt')

        assert os.path.exists(gt_file), \
            'GT file does not exist: {}'.format(gt_file)

        bounding_boxes = []

        with open(gt_file, "r") as inf:
            reader = csv.reader(inf, delimiter=',')
            for row in reader:
                visibility = float(row[8])
                if int(row[0]) == file_index and int(row[6]) == 1 and int(row[7]) == 1 and visibility >= self._vis_threshold:
                    bb = {}
                    bb['bb_left'] = int(row[2])
                    bb['bb_top'] = int(row[3])
                    bb['bb_width'] = int(row[4])
                    bb['bb_height'] = int(row[5])
                    bb['visibility'] = float(row[8])

                    bounding_boxes.append(bb)

        num_objs = len(bounding_boxes)

        boxes = torch.zeros((num_objs, 4), dtype=torch.float32)
        visibilities = torch.zeros((num_objs), dtype=torch.float32)
        
        for i, bb in enumerate(bounding_boxes):
            # Make pixel indexes 0-based, should already be 0-based (or not)
            x1 = bb['bb_left'] - 1
            y1 = bb['bb_top'] - 1
            # This -1 accounts for the width (width of 1 x1=x2)
            x2 = x1 + bb['bb_width'] - 1
            y2 = y1 + bb['bb_height'] - 1

            boxes[i, 0] = x1
            boxes[i, 1] = y1
            boxes[i, 2] = x2
            boxes[i, 3] = y2
            visibilities[i] = bb['visibility']
            
        return {'boxes': boxes,
                'labels': torch.ones((num_objs,), dtype=torch.int64),
                'image_id': torch.tensor([idx]),
                'area': (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0]),
                'iscrowd': torch.zeros((num_objs,), dtype=torch.int64),
                'visibilities': visibilities,}

    def __getitem__(self, idx):
        # load images ad masks
        img_path = self._img_paths[idx]
        # mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = Image.open(img_path).convert("RGB")

        target = self._get_annotation(idx)

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self._img_paths)
    
    def write_results_files(self, results, output_dir):
        """Write the detections in the format for MOT17Det sumbission

        all_boxes[image] = N x 5 array of detections in (x1, y1, x2, y2, score)

        Each file contains these lines:
        <frame>, <id>, <bb_left>, <bb_top>, <bb_width>, <bb_height>, <conf>, <x>, <y>, <z>

        Files to sumbit:
        ./MOT17-01.txt
        ./MOT17-02.txt
        ./MOT17-03.txt
        ./MOT17-04.txt
        ./MOT17-05.txt
        ./MOT17-06.txt
        ./MOT17-07.txt
        ./MOT17-08.txt
        ./MOT17-09.txt
        ./MOT17-10.txt
        ./MOT17-11.txt
        ./MOT17-12.txt
        ./MOT17-13.txt
        ./MOT17-14.txt
        """

        #format_str = "{}, -1, {}, {}, {}, {}, {}, -1, -1, -1"

        files = {}
        for image_id, res in results.items():
            path = self._img_paths[image_id]
            img1, name = osp.split(path)
            # get image number out of name
            frame = int(name.split('.')[0])
            # smth like /train/MOT17-09-FRCNN or /train/MOT17-09
            tmp = osp.dirname(img1)
            # get the folder name of the sequence and split it
            tmp = osp.basename(tmp).split('-')
            # Now get the output name of the file
            out = tmp[0]+'-'+tmp[1]+'.txt'
            outfile = osp.join(output_dir, out)

            # check if out in keys and create empty list if not
            if outfile not in files.keys():
                files[outfile] = []

            for box, score in zip(res['boxes'], res['scores']):
                x1 = box[0].item()
                y1 = box[1].item()
                x2 = box[2].item()
                y2 = box[3].item()
                files[outfile].append(
                    [frame, -1, x1, y1, x2 - x1, y2 - y1, score.item(), -1, -1, -1])

        for k, v in files.items():
            with open(k, "w") as of:
                writer = csv.writer(of, delimiter=',')
                for d in v:
                    writer.writerow(d)

    def print_eval(self, results, ovthresh=0.5):
        """Evaluates the detections (not official!!)

        all_boxes[cls][image] = N x 5 array of detections in (x1, y1, x2, y2, score)
        """

        if 'test' in self.root:
            print('No GT data available for evaluation.')
            return
            
        # Lists for tp and fp in the format tp[cls][image]
        tp = [[] for _ in range(len(self._img_paths))]
        fp = [[] for _ in range(len(self._img_paths))]

        npos = 0
        gt = []
        gt_found = []

        for idx in range(len(self._img_paths)):
            annotation = self._get_annotation(idx)
            bbox = annotation['boxes'][annotation['visibilities'].gt(self._vis_threshold)]
            found = np.zeros(bbox.shape[0])
            gt.append(bbox.cpu().numpy())
            gt_found.append(found)

            npos += found.shape[0]

        # Loop through all images
        # for res in results:
        for im_index, (im_gt, found) in enumerate(zip(gt, gt_found)):
            # Loop through dets an mark TPs and FPs
            im_det = results[im_index]['boxes'].cpu().numpy()
            im_tp = np.zeros(len(im_det))
            im_fp = np.zeros(len(im_det))
            for i, d in enumerate(im_det):
                ovmax = -np.inf
                if im_gt.size > 0:
                    # compute overlaps
                    # intersection
                    ixmin = np.maximum(im_gt[:, 0], d[0])
                    iymin = np.maximum(im_gt[:, 1], d[1])
                    ixmax = np.minimum(im_gt[:, 2], d[2])
                    iymax = np.minimum(im_gt[:, 3], d[3])
                    iw = np.maximum(ixmax - ixmin + 1., 0.)
                    ih = np.maximum(iymax - iymin + 1., 0.)
                    inters = iw * ih

                    # union
                    uni = ((d[2] - d[0] + 1.) * (d[3] - d[1] + 1.) +
                            (im_gt[:, 2] - im_gt[:, 0] + 1.) *
                            (im_gt[:, 3] - im_gt[:, 1] + 1.) - inters)
                    overlaps = inters / uni
                    ovmax = np.max(overlaps)
                    jmax = np.argmax(overlaps)

                if ovmax > ovthresh:
                    if found[jmax] == 0:
                        im_tp[i] = 1.
                        found[jmax] = 1.
                    else:
                        im_fp[i] = 1.
                else:
                    im_fp[i] = 1.

            tp[im_index] = im_tp
            fp[im_index] = im_fp

        # Flatten out tp and fp into a numpy array
        i = 0
        for im in tp:
            if type(im) != type([]):
                i += im.shape[0]

        tp_flat = np.zeros(i)
        fp_flat = np.zeros(i)

        i = 0
        for tp_im, fp_im in zip(tp, fp):
            if type(tp_im) != type([]):
                s = tp_im.shape[0]
                tp_flat[i:s+i] = tp_im
                fp_flat[i:s+i] = fp_im
                i += s

        tp = np.cumsum(tp_flat)
        fp = np.cumsum(fp_flat)
        rec = tp / float(npos)
        # avoid divide by zero in case the first detection matches a difficult
        # ground truth (probably not needed in my code but doesn't harm if left)
        prec = tp / np.maximum(tp + fp, np.finfo(np.float64).eps)
        tmp = np.maximum(tp + fp, np.finfo(np.float64).eps)

        # correct AP calculation
        # first append sentinel values at the end
        mrec = np.concatenate(([0.], rec, [1.]))
        mpre = np.concatenate(([0.], prec, [0.]))

        # compute the precision envelope
        for i in range(mpre.size - 1, 0, -1):
            mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])

        # to calculate area under PR curve, look for points
        # where X axis (recall) changes value
        i = np.where(mrec[1:] != mrec[:-1])[0]

        # and sum (\Delta recall) * prec
        ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])

        tp, fp, prec, rec, ap = np.max(tp), np.max(fp), prec[-1], np.max(rec), ap
        
        print(f"AP: {ap} Prec: {prec} Rec: {rec} TP: {tp} FP: {fp}")


In [None]:
import matplotlib.pyplot as plt
import transforms as T

dataset = MOT17ObjDetect('train')
img, target = dataset[0]

def plot(img, boxes):
  fig, ax = plt.subplots(1, dpi=96)

  img = img.mul(255).permute(1, 2, 0).byte().numpy()
  width, height, _ = img.shape
    
  ax.imshow(img, cmap='gray')
  fig.set_size_inches(width / 80, height / 80)

  for box in boxes:
      rect = plt.Rectangle(
        (box[0], box[1]),
        box[2] - box[0],
        box[3] - box[1],
        fill=False,
        linewidth=1.0)
      ax.add_patch(rect)

  plt.axis('off')
  plt.show()

img, target = T.ToTensor()(img, target)
plot(img, target['boxes'])

That's all for the dataset. Let's see how the outputs are structured for this dataset

In [None]:
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
      
def get_detection_model(num_classes):
    # load an instance segmentation model pre-trained on COCO
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
    # get the number of input features for the classifier
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    # replace the pre-trained head with a new one
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    model.roi_heads.nms_thresh = 0.3
    
    return model

DATASETS

In [None]:
from engine import train_one_epoch, evaluate
import utils


def get_transform(train):
    transforms = []
    # converts the image, a PIL image, into a PyTorch Tensor
    transforms.append(T.ToTensor())
    if train:
        # during training, randomly flip the training images
        # and ground-truth for data augmentation
        transforms.append(T.RandomHorizontalFlip(0.5))
    return T.Compose(transforms)

In [None]:
# use our dataset and defined transformations
dataset = MOT17ObjDetect('train', get_transform(train=True))
dataset_no_random = MOT17ObjDetect('train', get_transform(train=False))
dataset_test = MOT17ObjDetect('test', get_transform(train=False))

# split the dataset in train and test set
torch.manual_seed(1)

# define training and validation data loaders
data_loader = torch.utils.data.DataLoader(
    dataset, batch_size=2, shuffle=True, num_workers=4,
    collate_fn=utils.collate_fn)
data_loader_no_random = torch.utils.data.DataLoader(
    dataset_no_random, batch_size=1, shuffle=False, num_workers=4,
    collate_fn=utils.collate_fn)

data_loader_test = torch.utils.data.DataLoader(
    dataset_test, batch_size=1, shuffle=False, num_workers=4,
    collate_fn=utils.collate_fn)

INIT MODEL AND OPTIM

In [None]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

# get the model using our helper function
model = get_detection_model(dataset.num_classes)
# move model to the right device
model.to(device)

# construct an optimizer
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.00001,
                            momentum=0.9, weight_decay=0.0005)

# and a learning rate scheduler which decreases the learning rate by
# 10x every 3 epochs
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer,
                                               step_size=10,
                                               gamma=0.1)

TRAINING

In [None]:
def evaluate_and_write_result_files(model, data_loader):
  model.eval()
  results = {}
  for imgs, targets in data_loader:
    imgs = [img.to(device) for img in imgs]

    with torch.no_grad():
        preds = model(imgs)
    
    for pred, target in zip(preds, targets):
        results[target['image_id'].item()] = {'boxes': pred['boxes'].cpu(),
                                              'scores': pred['scores'].cpu()}

  data_loader.dataset.print_eval(results)
  data_loader.dataset.write_results_files(results, '/content/gdrive/MyDrive/faster_rcnn_fpn_training_mot_17/resnet50/')
  
# evaluate_and_write_result_files(model, data_loader_test)
# evaluate_and_write_result_files(model, data_loader_no_random)

Use the provided pretrained Faster R-CNN to further train the model for 27 epochs on MOT-17 dataset. Use the sample codes to evaluate and report the accuracy of Average Precision (AP) on both train and test set. 


In [None]:
# todo


Randomly select some images and visualize their detection results.  


In [None]:
# todo


The Tracktor can be configured by changing the corresponding experiments/cfgs/tracktor.yaml config file. The default configurations runs Tracktor with the FPN object detector are almost same as described in the paper except the Re-identification model is turned off (**do_reid=False, load_results=True**). 


Run the inference experiments/scripts/test_tracktor.py using the MOT-17 train set input. The tracking results are logged in the corresponding `outputs/ ` folder. Open one of the generated results, explain what are the first six values generated in each line? (i.e., frame_id, bounding box (xywh/xyxy?), confidence, track_id, etc.). Plot the values on the corresponding images and show the video results. 

In [None]:
%%shell

# There are some bugs casued by the previous installation
pip install motmetrics
pip install sacred==0.8.0
pip install PyYAML==5.1.2
pip install lap==0.4.0

In [None]:
% cd /content
!git clone https://github.com/KaiyangZhou/deep-person-reid.git

% cd deep-person-reid
!pip install -r requirements.txt
!python setup.py develop

% cd /content/tracking_wo_bnw
!wget https://motchallenge.net/data/MOT17.zip
!unzip -q -d data MOT17.zip

!wget https://vision.in.tum.de/webshare/u/meinhard/tracking_wo_bnw-output_v5.zip .
!unzip -q tracking_wo_bnw-output_v5.zip 

do_align: True, do_reID: False (Tracktor+CMC)

In [None]:
!python experiments/scripts/test_tracktor.py

The results are logged in the corresponding `output` directory.

Question: What are the first six values generated in each line?


In [None]:
# todo: plot the values on the corresponding images and show the video results


Evaluate the performance using test_tracktor.py and report the following metrics: MOTA, MOTP, IDF1, FP. (**Hints: These metrics have already been logged in Colab outputs from the previous problems. You can just copy down here.**) 

Run the inference test_tracktor.py with the changed configurations in experiments/cfgs/tracktor.yaml and evaluate the performance: 

* tracktor/tracker:
    * detection_person_thresh (FRCNN score threshold for detections): 0.5
    * detection_nms_thresh (NMS threshold for detection): 0.3
    * number_of_iterations (maximal number of iterations): 100
    * max_features_num (How much last appearance features are to keep): 10
    * motion_model (motion model settings, mentioned in 2.3): disabled

Feel free to change at least three hyperparameters (can be from detection or tracking). Discuss how these changes may affect the tracking performance based on MOTA. 


**Discussion**

How do these changes affect the tracking performance based on MOTA?



## Problem 3: Train Mask R-CNN on a balloon dataset

In Problem 1, we use Faster R-CNN to train on the traffic sign datasets to perform object detection. With few line modifications, we can train an instance segmentation model as well. Notice that the traffic sign dataset only contains the bounding box labeling information, with no segmentation mask labeling, which is not enough to train a Mask R-CNN model. Due to this reason, we switch to another dataset: [balloon segmentation dataset](https://github.com/matterport/Mask_RCNN/tree/master/samples/balloon), which only has one class: balloon. 


In [None]:
# download the ballon dataset, decompress the data
!wget https://github.com/matterport/Mask_RCNN/releases/download/v2.1/balloon_dataset.zip
!unzip balloon_dataset.zip > /dev/null

Write codes to load and visualize the balloon dataset in the similar manner. You need to take a careful look at the label files and construct your `get_balloon_dicts` functions to load extra poly mask information. If you load the dataset correctly, you will see training samples like the following. 

In [None]:
import json
from detectron2.structures import BoxMode

def get_balloon_dicts(img_dir):
    """
    Write your codes to Load and visualize the balloon datasets
    """
    # todo
    return

In [None]:
# todo

Fine-tune the pre-trained model `COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml` on the balloon dataset with the following configurations and show the TensorBoard Visualization. 
    * IMS_BATCH_SIZE = 2
    * BASE_LR = 0.00025
    * MAX_ITER = 300
    * ROI_HEADS.BATCH_SIZE_PER_IMG = 128
    * ROI_HEADS.NUM_CLASSES = 1



In [None]:
# todo

Use your own trained model to do the inference on testing datasets, at least plot 3 prediction results. Then, use the COCO API to report your testing Average Precision (AP). If your model is trained correctly, you will see the prediction results like the following figures. 

In [None]:
# todo

COCO API Evaluation

In [None]:
# todo

## Problem 4: 2D Human Pose Estimation

In [None]:
import h5py
import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
import cv2
import math
import os
import importlib

### (a) Prepare code and dataset

#### Download source code

In [None]:
!git clone https://github.com/princeton-vl/pytorch_stacked_hourglass.git

#### Download MPII dataset

In [None]:
!wget https://datasets.d2.mpi-inf.mpg.de/andriluka14cvpr/mpii_human_pose_v1.tar.gz

In [None]:
!tar -xvzf mpii_human_pose_v1.tar.gz -C pytorch_stacked_hourglass/data/MPII/

### (b) Visualize some images in MPII dataset

In [None]:
# todo: visualize some images in MPII dataset


## (c) Train the network

Train with 2 stack (2HG). Please keep your terminal output in order to get full credit.

In [None]:
%cd pytorch_stacked_hourglass/
!python train.py -e test_run_001 --max_iters 4

Draw your 2HG loss plot from log file.

In [None]:
# todo


Evaluate your trained models on the MPII validation set

In [None]:
!python test.py -c test_run_001

## (d) Inference and Visualization

### Infer HPE using the pretrain model

Download 2HG and 8HG pretrained model

In [None]:
!wget http://www-personal.umich.edu/~cnris/original_8hg/checkpoint.pt -P ./exp/test_run_003/

In [None]:
!wget http://www-personal.umich.edu/~cnris/original_2hg/checkpoint.pt -P ./exp/test_run_002/

Infer and evaluate the pretrained 2HG model. Please keep your terminal output in order to get full credit.

In [None]:
!python test.py -c test_run_002

Infer and evaluate the pretrained 8HG model. Please keep your terminal output in order to get full credit.

In [None]:
# todo

Visualize some human pose estimation results.

2HG results

In [None]:
# todo


8HG results

In [None]:
# todo


### Try customized images

Download some images from the internet and infer and visualize the human pose.

In [None]:
# todo