# Comparing Object Detection Models for Video

In this tutorial we'll demonstrate how to use Pixeltable to do frame-by-frame object detection, made simple through Pixeltable's video-related functionality:
* automatic frame extraction
* running complex functions against frames (in this case, an object detection model)
* reassembling frames back into videos

We'll be working with a single video file (from Pixeltable's test data directory). Let's download that now:

In [None]:
import urllib.request

download_url = 'https://raw.github.com/pixeltable/pixeltable/master/docs/source/data/bangkok.mp4'
filename, _ = urllib.request.urlretrieve(download_url)

In [None]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Creating a tutorial directory and table

In Pixeltable, all data resides in tables, which in turn located inside directories.

Let's start by creating a client and a `video_tutorial` directory:

In [None]:
import pixeltable as pxt

cl = pxt.Client()
cl.create_dir('model_comparison', ignore_errors=True)

We create a table for our videos, with a single column:

In [None]:
video_path = 'model_comparison.videos'
frame_path = 'model_comparison.frames'
cl.drop_table(frame_path, ignore_errors=True)
cl.drop_table(video_path, ignore_errors=True)
v = cl.create_table(video_path, {'video': pxt.VideoType()})

In order to interact with the frames, we take advantage of Pixeltable's component view concept: we create a "view" of our video table that contains one row for each frame. Pixeltable provides the built-in `FrameIterator` class for this.

In [None]:
from pixeltable.iterators import FrameIterator
args = {'video': v.video, 'fps': 0}
f = cl.create_view(frame_path, v, iterator_class=FrameIterator, iterator_args=args)

The `fps` parameter determines the frame rate, with `0` indicating the native frame rate.

Running this creates a view with six columns:
- `frame_idx`, `pos_msec`, `pos_frame` and `frame` are created by the `FrameIterator` class.
- `pos` is a system column in every component view
- `video` is the column for our base table (all base table columns are visible in the view, to facilitate querying)

Note that you could create additional views on the `videos` table, each with its own frame rate.

In [None]:
f

We now insert a single row containing the name of the video file we just downloaded, which is expanded into 462 frames/rows in the `frames` view.

In general, `insert()` takes as its first argument a list of rows, each of which is a dictionary mapping column names to column values.

In [None]:
v.insert([{'video': filename}])

We loaded a video that shows a busy intersection in Bangkok. Let's look at the first frame:

In [None]:
f.where(f.pos == 200).select(f.frame, f.frame.width, f.frame.height).show(1)

When we create the `frames` view, Pixeltable does not physically store the frames. Instead, Pixeltable re-extracts the frames on retrieval using the `pos` column value, which can be done very efficiently and avoids any storage overhead (which would be very substantial for video frames).

## Object detection with Pixeltable

Pixeltable comes pre-packaged with a number of object detection models. We're going to explore one from the YoloX family.

In [None]:
from pixeltable.functions.nos.object_detection_2d import yolox_tiny as model1

We can then use `model1()` in the Pixeltable index operator using standard Python function call syntax:

In [None]:
f.where(f.frame_idx == 0).select(f.frame, model1(f.frame)).show(1)

This works as expected, and we now add the detections as a computed column `detections_1` to the table (there'll be a `detections_2` later).

Running model inference is generally an expensive operation; adding it as a computed column makes sure it only runs once, at the time the row is inserted. After that, the result is available as part of the stored table data.

Note that for computed columns of any type other than `image`, the computed values are **always** stored (ie, `stored=True`).

In [None]:
f.add_column(detections_1=model1(f.frame))

The column is now part of `f`'s schema:

In [None]:
f

We can create a simple user-defined function `draw_boxes()` to visualize detections:

In [None]:
import PIL.ImageDraw

@pxt.udf(return_type=pxt.ImageType(), param_types=[pxt.ImageType(), pxt.JsonType()])
def draw_boxes(img, boxes):
    result = img.copy()
    d = PIL.ImageDraw.Draw(result)
    for box in boxes:
        d.rectangle(box, width=3)
    return result

This function takes two arguments:
- `img` has type `image` and receives an instance of `PIL.Image.Image`
- `boxes` has type `json` and receives a JSON-serializable structure, in this case a list of 4-element lists of floats

When we "call" this function, we need to pass in the frame and the bounding boxes identified in that frame. The latter can be selected with the JSON path expression `t.detections.boxes`:

In [None]:
f.where(f.pos == 0).select(f.frame, draw_boxes(f.frame, f.detections_1.bboxes)).show(1)

Looking at individual frames gives us some idea of how well our detection algorithm works, but it would be more instructive to turn the visualization output back into a video.

We do that with the built-in function `make_video()`, which is an aggregation function that takes a frame index (actually: any expression that can be used to order the frames; a timestamp would also work) and an image, and then assembles the sequence of images into a video:

In [None]:
f.select(pxt.make_video(f.pos, draw_boxes(f.frame, f.detections_1.bboxes))).group_by(v).show()

## Comparing multiple detection models

The output of YoloX-tiny seems reasonable, but we're curious how much better a slightly larger model, such as YoloX-medium, would be for our particular use case. Instead of creating another table and reloading the data, etc., we can simply add another column to our existing table:

In [None]:
from pixeltable.functions.nos.object_detection_2d import yolox_medium as model2

We're using the alternative form of adding table columns:

In [None]:
f['detections_2'] = model2(f.frame)

We don't have ground truth data yet, but visualizing the output in the form of a video gives us some clue how much a smaller model affects the result:

In [None]:
f.select(
    pxt.make_video(f.frame_idx, draw_boxes(f.frame, f.detections_1.bboxes)),
    pxt.make_video(f.frame_idx, draw_boxes(f.frame, f.detections_2.bboxes)),
).group_by(v).show(1)

# Evaluating the models against ground truth

In order to have something to base the evaluation on, let's generate some 'ground truth' data by running the largest YoloX model available.

In [None]:
from pixeltable.functions.nos.object_detection_2d import yolox_xlarge

In [None]:
f['gt'] = yolox_xlarge(f.frame)

We now have two columns with detections, `detections_1` and `detections_2`, and one column `gt` with synthetic ground-truth data, which we're going to use as the basis for evaluation:

In [None]:
f

We're going to be evaluating the generated detections with the commonly-used [mean average precision metric](https://learnopencv.com/mean-average-precision-map-object-detection-model-evaluation-metric/) (mAP).

The mAP metric is based on per-frame metrics, such as true and false positives per detected class, which are then aggregated into a single (per-class) number. In Pixeltable, functionality is available via the `eval_detections()` and `mean_ap()` built-in functions:

In [None]:
from pixeltable.functions.eval import eval_detections, mean_ap

The `eval_detections()` function computes the required per-frame metrics, and we're going to add those as computed columns in order to cache the output (and avoid having to re-type the call to `eval_detections()` repeatedly later).

In [None]:
f['eval_1'] = eval_detections(
    f.detections_1.bboxes, f.detections_1.labels, f.detections_1.scores, f.gt.bboxes, f.gt.labels)

In [None]:
f['eval_2'] = eval_detections(
    f.detections_2.bboxes, f.detections_2.labels, f.detections_2.scores, f.gt.bboxes, f.gt.labels)

Let's take a look at the output:

In [None]:
f.select(f.eval_1, f.eval_2).show(1)

The computation of the mAP metric is now simply a query over the evaluation output, aggregated with the `mean_ap()` function:

In [None]:
f.select(mean_ap(f.eval_1), mean_ap(f.eval_2)).show(1)

This two-step process allows you to compute mAP at every granularity: over your entire dataset, only for specific videos, only for videos that pass a certain filter, etc. Moreover, you can compute this metric any time, not just during training, and use it to guide your understand of your dataset and how it affects the quality of your models.

# Exporting Detection Data as a COCO Dataset

In [None]:
@pxt.udf(return_type=pxt.JsonType(nullable=False), param_types=[pxt.JsonType(nullable=False)])
def yolo_to_coco(detections):
    bboxes, labels = detections['bboxes'], detections['labels']
    num_annotations = len(detections['bboxes'])
    assert num_annotations == len(detections['labels'])
    result = []
    for i in range(num_annotations):
        bbox = bboxes[i]
        ann = {
            'bbox': [round(bbox[0]), round(bbox[1]), round(bbox[2] - bbox[0]), round(bbox[3] - bbox[1])],
            'category': labels[i],
        }
        result.append(ann)
    return result