# Detectron2 Beginner's Tutorial

<img src="https://dl.fbaipublicfiles.com/detectron2/Detectron2-Logo-Horz.png" width="500">

Welcome to detectron2! This is the official colab tutorial of detectron2. Here, we will go through some basics usage of detectron2, including the following:
* Run inference on images or videos, with an existing detectron2 model
* Train a detectron2 model on a new dataset.


In [None]:
!python -m pip install pyyaml==5.1
import sys, os, distutils.core
# Note: This is a faster way to install detectron2 in Colab, but it does not include all functionalities (e.g. compiled operators).
# See https://detectron2.readthedocs.io/tutorials/install.html for full installation instructions
!git clone 'https://github.com/facebookresearch/detectron2'
dist = distutils.core.run_setup("./detectron2/setup.py")
!python -m pip install {' '.join([f"'{x}'" for x in dist.install_requires])}
sys.path.insert(0, os.path.abspath('./detectron2'))

In [None]:
import torch, detectron2
!nvcc --version
TORCH_VERSION = ".".join(torch.__version__.split(".")[:2])
CUDA_VERSION = torch.__version__.split("+")[-1]
print("torch: ", TORCH_VERSION, "; cuda: ", CUDA_VERSION)
print("detectron2:", detectron2.__version__)

In [None]:
# Some basic setup:
# Setup detectron2 logger
import detectron2
from detectron2.utils.logger import setup_logger
setup_logger()

# import some common libraries
import numpy as np
import os, json, cv2, random
from google.colab.patches import cv2_imshow

# import some common detectron2 utilities
from detectron2 import model_zoo
from detectron2.engine import DefaultPredictor
from detectron2.config import get_cfg
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog, DatasetCatalog

### Run a pretrained Detectron2 model

We first download some image from the given URLs:

In [None]:
!wget http://images.cocodataset.org/val2017/000000007574.jpg -q -O input.jpg
im_input = cv2.imread("./input.jpg")
cv2_imshow(im_input)

In [None]:
!wget http://images.cocodataset.org/val2017/000000013923.jpg -q -O test1.jpg
im_test1 = cv2.imread("./test1.jpg")
cv2_imshow(im_test1)

In [None]:
!wget http://images.cocodataset.org/val2017/000000018380.jpg -q -O test2.jpg
im_test2 = cv2.imread("./test2.jpg")
cv2_imshow(im_test2)

We can see there are multiple objects in these images: bottles, tables, chairs, people, etc. Let us see if we can detect them all by using a pre-trained model given by Detectron2.


Let's take a look at the model output.

In inference mode, the builtin model outputs a `list[dict]`, one dict for each image. For the object detection task, the dict contain the following fields:

*   "instances": Instances object with the following fields:
    * "pred_boxes": Storing N boxes, one for each detected instance.
    * "scores": a vector of N scores.
    * "pred_classes": a vector of N labels in range [0, num_categories].

For more details, please see https://detectron2.readthedocs.io/tutorials/models.html#model-output-format for specification



In [None]:
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml"))
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST= 0.5  # set threshold for this model
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml")
predictor = DefaultPredictor(cfg)
outputs = predictor(im_input)

In [None]:
print(outputs)

In [None]:
print(outputs["instances"].pred_classes)
print(outputs["instances"].pred_boxes)

In [None]:
outputs_q1q2 = {'q1': [], 'q2': []}
outputs_q1q2['q1'].append(outputs["instances"])

In [None]:
# We can use "Visualizer" to draw the predictions on the image
v = Visualizer(im_input[:, :, ::-1], MetadataCatalog.get(cfg.DATASETS.TRAIN[0]), scale=1.2)
out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
cv2_imshow(out.get_image()[:, :, ::-1])

AWESOME!!! Great progress so far! We are able to detect sink, microwave, bottle and even refrigerator! At this point, we have used the pre-trained model to do the inference on the given image. There are in total 17 objects are being detected. The image is adopted from the [MS-COCO](https://cocodataset.org/#home) dataset and there are 81 classes including person, bicycle, car, etc. You may find the id-category mapping [here](https://gist.github.com/AruniRC/7b3dadd004da04c80198557db5da4bda).

The model we just used is `COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml`. Actually, the Detectron2 provides us more than that, you may find great amouts of models for different tasks in the given [MODEL_ZOO](https://github.com/facebookresearch/detectron2/tree/master/configs). What about we try a different model to see what its output will look like?


* Q1 (5%): Object Detection. Use the same configuration `COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml`, with IoU threshold of 0.5 (`SCORE_THRESH_TEST=0.5`), to also run inference on the rest two images (test1.jpg & test2.jpg) and view the outputs with bounding boxes.

* Q2: Object Detection. Use the `COCO-Detection/faster_rcnn_R_101_FPN_3X.yaml`, which has a ResNet-101 as the backbone, with IoU threshold of 0.5 and view the outputs of all three images with bounding boxes. By looking at the outputs, can you find the difference with the one `COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml` we used in Q1? (e.g., numbers of objects, confidence scores, ...)

* Q3: Object Detection. Use the `COCO-Detection/faster_rcnn_R_101_FPN_3X.yaml` with an IoU threshold of 0.9 and view the outputs of all three images with bounding boxes.

* Q4 (5%): Instance Segmentation. The models we have tried in Q1-Q3 are the Faster R-CNN models for object detection. Here, let’s try a Mask R-CNN model `COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml`, with IoU threshold of 0.5, to perform the instance segmentation and view the outputs of all three images with segmentation masks. Compare the difference of outputs between an object detection model with an instance segmentation model.

In [None]:
# todo: Q1


In [None]:
# todo: Q2


In [None]:
# todo: Q3


In [None]:
# todo: Q4


# Now let's train on the provided sportsmot dataset
please upload the provided sportsmot dataset first

In [None]:
!unzip sportsmot.zip

In [None]:
from detectron2.structures import BoxMode

def get_dataset_dicts(data_root, txt_file):
    dataset_dicts = []
    filenames = []
    csv_path = os.path.join(data_root, txt_file)
    with open(csv_path, "r") as f:
        for line in f:
            filenames.append(line.rstrip())

    for idx, filename in enumerate(filenames):
        record = {}

        image_path = os.path.join(data_root, filename)

        im = cv2.imread(image_path)
        if im is None:
            continue
        height, width = im.shape[:2]

        record['file_name'] = image_path
        record['image_id'] = idx
        record['height'] = height
        record['width'] = width

        image_filename = os.path.basename(filename)
        image_name = os.path.splitext(image_filename)[0]
        annotation_path = os.path.join(data_root, 'labels', '{}.txt'.format(image_name))
        annotation_rows = []

        with open(annotation_path, "r") as f:
            for line in f:
                temp = line.rstrip().split(" ")
                annotation_rows.append(temp)

        objs = []
        for row in annotation_rows:
            xcentre = int(float(row[1])*width)
            ycentre = int(float(row[2])*height)
            bwidth = int(float(row[3])*width)
            bheight = int(float(row[4])*height)

            xmin = int(xcentre - bwidth/2)
            ymin = int(ycentre - bheight/2)
            xmax = xmin  + bwidth
            ymax = ymin + bheight

            obj= {
                'bbox': [xmin, ymin, xmax, ymax],
                'bbox_mode': BoxMode.XYXY_ABS,
                # alternatively, we can use bbox_mode = BoxMode.XYWH_ABS
                # 'bbox': [xmin, ymin, bwidth, bheight],
                # 'bbox_mode': BoxMode.XYWH_ABS,
                'category_id': int(row[0]),
                'iscrowd': 0
            }

            objs.append(obj)
        record['annotations'] = objs
        dataset_dicts.append(record)
    return dataset_dicts

In [None]:
import os
import os.path as osp
# Metadata configurations
data_root = "sportsmot"
train_txt = "sportsmot_train.txt"
test_txt = "sportsmot_test.txt"

train_data_name = "train"
test_data_name = "test"

thing_classes = ["person"]

output_dir = "./outputs"

def count_lines(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

train_img_count = count_lines(os.path.join(data_root, train_txt))
print("There are {} samples in training data".format(train_img_count))

In [None]:
# Register the traffic_sign_train datasets
DatasetCatalog.register(name=train_data_name,
                        func=lambda: get_dataset_dicts(data_root, train_txt))
train_metadata = MetadataCatalog.get(train_data_name).set(thing_classes=thing_classes)

# Register the traffic_sign_test datasets
DatasetCatalog.register(name=test_data_name,
                        func=lambda: get_dataset_dicts(data_root, test_txt))
test_metadata = MetadataCatalog.get(test_data_name).set(thing_classes=thing_classes)

In [None]:
train_data_dict = get_dataset_dicts(data_root, train_txt)

for d in random.sample(train_data_dict, 3):
    img = cv2.imread(d["file_name"])
    visualizer = Visualizer(img[:, :, ::-1], metadata=train_metadata, scale=0.5)
    out = visualizer.draw_dataset_dict(d)
    cv2_imshow(out.get_image()[:, :, ::-1])

In [None]:
from detectron2.engine import DefaultTrainer

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml"))
cfg.DATASETS.TRAIN = (train_data_name,)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_1x.yaml") # let's trainining initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.0001  # pick a good LR
cfg.SOLVER.MAX_ITER = 300    # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = len(thing_classes)  # only has one class (traffic-sign)
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128   # faster, and good enough for this toy dataset (default: 512)
cfg.OUTPUT_DIR = output_dir

In [None]:
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()

In [None]:
# cfg alrady contains everything we've set previously. Now we changed it a little bit for inference:
cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")  # path to the model we just trained
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
predictor = DefaultPredictor(cfg)

In [None]:
from detectron2.utils.visualizer import ColorMode

test_data_dict = get_dataset_dicts(data_root, test_txt)

for d in random.sample(test_data_dict, 3):
    im = cv2.imread(d["file_name"])
    outputs = predictor(im)
    v = Visualizer(im[:, :, ::-1],
                   metadata=test_metadata,
                   scale=0.5,
                   )
    out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
    cv2_imshow(out.get_image()[:, :, ::-1])

## Problem 2
Multi-Object Tracking  

After training the detector, now we want to implement tracking on the testing video.

In [None]:
# Let's start with a detector class
class detector:
    def __init__(self,predictor):
        self.model = predictor

    def predict(self,img):
        pred = self.model(img)
        pred = [pred['instances'][i].pred_boxes.tensor.tolist()[0] for i in range(len(pred['instances']))]
        return pred


# TODO 
# Initiate a detector and inference on the first test image('sportsmot/JPEGImages/test_000001.jpg') and print the bounding box prediction.
# The output format should be x1,y1,x2,y2

Now you will implement your own tracker!  

Let's start with the IoU function and tracklet class.

In [None]:
# calculate the overlap ratio of two bounding boxes
def calculate_iou(bbox1, bbox2):

    x1_1, y1_1, x2_1, y2_1 = bbox1
    x1_2, y1_2, x2_2, y2_2 = bbox2
    x_left = max(x1_1, x1_2)
    y_top = max(y1_1, y1_2)
    x_right = min(x2_1, x2_2)
    y_bottom = min(y2_1, y2_2)

    if x_right < x_left or y_bottom < y_top:
        return 0.0

    area_bbox1 = (x2_1 - x1_1 + 1) * (y2_1 - y1_1 + 1)
    area_bbox2 = (x2_2 - x1_2 + 1) * (y2_2 - y1_2 + 1)
    intersection_area = (x_right - x_left + 1) * (y_bottom - y_top + 1)

    iou = intersection_area / float(area_bbox1 + area_bbox2 - intersection_area)

    return iou

In [None]:
# base class for tracklet, you can of course add more features and try to improve the performance!
class tracklet:
    def __init__(self,tracking_ID,box):
        self.ID = tracking_ID
        self.cur_box = box
        self.alive = True

    def update(self,box):
        self.cur_box = box

    def close(self):
        self.alive = False

In [None]:
from scipy.optimize import linear_sum_assignment

class IoU_Tracker:
    def __init__(self):
        self.all_tracklets = [] # this saves all the tracklets so that we can know how many tracklets we have
        self.cur_tracklets = [] # this saves tracklets from the last frame for current frame's association
        self.online_tracklets = [] # this saves the tracklets after association, so we can pass the tracking result to output

    def update(self,frame_id,detection):

        if frame_id%100 == 0:
            print(f'Running tracking || current frame {frame_id}')

        if len(self.cur_tracklets) == 0:
            for det in detection:
                new_tracklet = tracklet(len(self.all_tracklets)+1,det)
                self.cur_tracklets.append(new_tracklet)
                self.all_tracklets.append(new_tracklet)
        else:
            cost_matrix = np.zeros((len(self.cur_tracklets),len(detection)))

            # build up cost matrix, each element in cost matrix should be 1-IoU between tracklet and detection
            for row in range(len(self.cur_tracklets)):
                for col in range(len(detection)):
                    cost_matrix[row][col] = 1 - calculate_iou(self.cur_tracklets[row].cur_box,detection[col])

            row_inds,col_inds = linear_sum_assignment(cost_matrix)

            matches = min(len(row_inds),len(col_inds))

            for idx,trk in enumerate(self.cur_tracklets):
                if idx not in row_inds: # if it is not matched in the above Hungarian algorithm stage
                # TODO
                # use tracklet's close function to kill those unmatched tracklets

            for idx,det in enumerate(detection):
                if idx not in col_inds: # if it is not matched in the above Hungarian algorithm stage
                # TODO
                # initiate unmatched detections as new tracklets


            for idx in range(matches):
                row,col = row_inds[idx],col_inds[idx]
                if cost_matrix[row][col] == 1:
                    # TODO 1. Kill the tracklet using tracklet's close function
                    # TODO 2. Initiate a new tracklet for the new detection
                    # TODO 3. Append new tracklet to the current tracklets and all tracklets
                else:
                    self.cur_tracklets[row].update(detection[col])

        self.cur_tracklets = [trk for trk in self.cur_tracklets if trk.alive]

        return self.cur_tracklets

Now it's time to run tracking!

In [None]:
import glob

# TODO
# initialize your detector (the one you just trained)

# TODO
# initialize tracker

# TODO run tracking
images = np.loadtxt('sportsmot/sportsmot_test.txt',dtype=str)
results = []

print(f'length of sequence is {len(images)}')

for frame_id, img_path in enumerate(images,1):
    img = cv2.imread('sportsmot/'+img_path)
    # TODO get detection with your model
    detection = 

    # TODO update tracker with detection
    result = 

    for track in result:
        x1,y1,x2,y2 = track.cur_box
        track_id = track.ID
        results.append(f'{int(frame_id)},{int(track_id)},{int(x1)},{int(y1)},{int(x2-x1)},{int(y2-y1)},{1},{1},{1}') # format: frame_id, track_id, x, y, w, h, score, x_coord, y_coord

with open('results.txt','w') as f:
    for line in results:
        f.writelines(line)
        f.writelines('\n')

Some packages are required to do the evaluation and visualization

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install motmetrics
!pip install pytrec_eval

Now you can evaluate the tracking results using the standard MOT CLEAR Metrics.

In [None]:
# TODO Evaluate your tracking performance
!python eval.py gt.txt results.txt

In [None]:
# TODO To get the visualization result
!python vis.py

Can you try to improve the performance with differnet tracking tricks? Here, we provide several potential tricks to increase the tracking performance, including:
1. Increase detection performance using a lower confidence threshold in detector - some low score detections that is filtered might be false negative detections! (or switch to a higher threshold if you have too many false positive)
2. Extend the tracklet's age and allow tracklets to live for more than 1 frame.
3. Incorporate different association method into tracking process, perhaps using differnt types of IoU or even use bounding box distance.

Please include your experiment results in your report. We give grade not only based on performance, but also your effort and finding.