In [None]:
import sys, glob, os
import pandas as pd
import numpy as np
import PIL
import json
import urllib.request
import tempfile
import tqdm

import torch, torchvision
from torchvision import transforms
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

In [None]:
import pixeltable as pt
import pixeltable.functions
%load_ext autoreload
%autoreload 2

# Creating a tutorial database and table

In Pixeltable, all data resides in tables, which in turn are assigned to databases.

Let's start by creating a client and a `tutorial` database:

In [None]:
cl = pt.Client()
cl.drop_db('tutorial', ignore_errors=True, force=True)
db = cl.create_db('tutorial')

In this tutorial we're going to be a working with a single video file (from Pixeltable's test data directory). Let's download that now:

In [None]:
download_url = 'https://gitlab.com/pixeltable/python-sdk/-/raw/master/pixeltable/tests/data/videos/bangkok.mp4'
filename, _ = urllib.request.urlretrieve(download_url)

To begin with, the table contains three columns: the original video, the frame and a frame index:

In [None]:
cols = [
    pt.Column('video', pt.VideoType(), nullable=False),
    pt.Column('frame', pt.ImageType(), nullable=False),
    pt.Column('frame_idx', pt.IntType(), nullable=False),
]

When creating the table, we supply parameters needed for automatic frame extraction during `insert_rows()`/`insert_pandas()` calls:
- The `extract_frames_from` argument is the name of the column of type `video` from which to extract frames.
- During an `insert_rows()` call, each input row, corresponding to one video, is expanded into one row per frame (subject to the frame rate requested in the `extracted_fps` keyword argument; `0` indicates the full frame rate).
- Each frame is extract to a JPEG file that is stored in a location under the Pixeltable home directory.
- The columns `frame` and `frame_idx` receive the frame file path and frame sequence number, respectively.

In [None]:
t = db.create_table(
    'video_data', cols,
    extract_frames_from='video', extracted_frame_col='frame', extracted_frame_idx_col='frame_idx',
    extracted_fps=0)

We now insert a single row containing the name of the video file we just downloaded, which is expanded into 462 frames/rows in the `video_data` table.

In general, `insert_rows()` takes as its first argument a list of rows, each of which is a list of column values (and in this case, we only need to supply data for the `video` column).

In [None]:
t.insert_rows([[filename]], columns=['video'])

We loaded a video that shows a busy intersection in Bangkok. Let's look at the first frame:

In [None]:
t[t.frame_idx == 0][t.frame, t.frame.width, t.frame.height].show(1)

Running this command takes a bit of time, and the reason is that Pixeltable re-extracts the frames during a query. The default behavior for computed columns of type `image` is not to store the images directly (this can quickly lead to an explosion of required storage when dealing with video data), but to cache them instead, so that repeated accesses to the same column values are fast.

Let's try this again:

In [None]:
t[t.frame_idx == 1][t.frame, t.frame.width, t.frame.height].show(1)

Whether a computed image column is stored or cached is controlled by the `stored` keyword argument of the `Column` constructor:
- the default is `None` which means that the value is not stored explicitly, but it is cached
- when set to `True`, the value is stored explicitly
- when set to `False`, the value is always recomputed during a query (and never stored or cached)

Let's take another look at the definition of the `frame` column:
```python
pt.Column('frame', pt.ImageType(), nullable=False)
```
In this case, we didn't specify `stored`, and so the default applies.


# Object Detection as a User-Defined Function

User-defined functions let you customize Pixeltable's functionality for your own data.

In this example, we're going use a `torchvision` object detection model (Faster R-CNN):

In [None]:
model = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn(weights="DEFAULT")
model.eval()  # switch to inference mode

Our function converts the image to PyTorch format and obtains a prediction from the model, which is a list of dictionaries with fields `boxes`, `labels`, and `scores` (one per input image). The fields themselves are PyTorch tensors, and we convert them to standard Python lists (so they become JSON-serializable data):

In [None]:
@pt.function(return_type=pt.JsonType(), param_types=[pt.ImageType()])
def detect(img):
    t = transforms.ToTensor()(img)
    t = transforms.ConvertImageDtype(torch.float)(t)
    result = model([t])[0]
    return {
        'boxes': result['boxes'].tolist(), 'labels': result['labels'].tolist(), 'scores': result['scores'].tolist()
    }

We can then use `detect()` in the Pixeltable index operator using standard Python function call syntax:

In [None]:
t[t.frame_idx == 0][t.frame, detect(t.frame)].show(1)

This works as expected, and we now add the detections as a computed column `detections` to the table.

Running model inference is generally an expensive operation; adding it as a computed column makes sure it only runs once, at the time the row is inserted. After that, the result is available as part of the stored table data.

Note that for computed columns of any type other than `image`, the computed values are **always** stored (ie, `stored=True`).

In [None]:
t.add_column(pt.Column('detections', computed_with=detect(t.frame)))

We can create a simple function `draw_boxes()` to visualize detections:

In [None]:
@pt.function(return_type=pt.ImageType(), param_types=[pt.ImageType(), pt.JsonType()])
def draw_boxes(img, boxes):
    result = img.copy()
    d = PIL.ImageDraw.Draw(result)
    for box in boxes:
        d.rectangle(box, width=3)
    return result

This function takes two arguments:
- `img` has type `image` and receives an instance of `PIL.Image.Image`
- `boxes` has type `json` and receives a JSON-serializable structure, in this case a list of 4-element lists of floats

When we "call" this function, we need to pass in the frame and the bounding boxes identified in that frame. The latter can be selected with the JSON path expression `t.detections.boxes`:

In [None]:
t[t.frame_idx == 0][t.frame, draw_boxes(t.frame, t.detections.boxes)].show(1)

Looking at individual frames gives us some idea of how well our detection algorithm works, but it would be more instructive to turn the visualization output back into a video.

We can accomplish that with the built-in function `make_video()`, which is an aggregation function that takes a frame index (actually: any expression that can be used to order the frames; a timestamp would also work) and an image, and then assembles the sequence of images into a video:

In [None]:
t[pt.make_video(t.frame_idx, draw_boxes(t.frame, t.detections.boxes))].group_by(t.video).show(1)