# Object Detection in Videos

In this tutorial we'll demonstrate how to use Pixeltable to do frame-by-frame object detection, made simple through Pixeltable's video-related functionality:
* automatic frame extraction
* running complex functions against frames (in this case, an object detection model)
* reassembling frames back into videos

We'll be working with a single video file (from Pixeltable's test data directory). Let's download that now:

In [1]:
import urllib.request

download_url = 'https://raw.github.com/mkornacker/pixeltable/master/docs/source/data/bangkok.mp4'
filename, _ = urllib.request.urlretrieve(download_url)

## Creating a tutorial database and table

In Pixeltable, all data resides in tables, which in turn are assigned to databases.

Let's start by creating a client and a `tutorial` database:

In [2]:
import pixeltable as pt

cl = pt.Client()
cl.drop_db('tutorial', ignore_errors=True, force=True)
db = cl.create_db('tutorial')

  from tqdm.autonotebook import tqdm


2023-06-08 09:46:14,655 INFO env env.py:169: found store container
2023-06-08 09:46:14,675 INFO env env.py:156: found database postgresql://postgres:*****@localhost:6543/pixeltable


The table we're going to create to hold our videos will have three columns to begin with: the original video, the frame and a frame index:

In [3]:
cols = [
    pt.Column('video', pt.VideoType(), nullable=False),
    pt.Column('frame', pt.ImageType(), nullable=False),
    pt.Column('frame_idx', pt.IntType(), nullable=False),
]

When creating the table, we supply parameters needed for automatic frame extraction during `insert_rows()`/`insert_pandas()` calls:
- The `extract_frames_from` argument is the name of the column of type `video` from which to extract frames.
- During an `insert_rows()` call, each input row, corresponding to one video, is expanded into one row per frame (subject to the frame rate requested in the `extracted_fps` keyword argument; `0` indicates the full frame rate).
- Each frame is extract to a JPEG file that is stored in a location under the Pixeltable home directory.
- The columns `frame` and `frame_idx` are populated with the frame file path and frame sequence number, respectively.

In [4]:
t = db.create_table(
    'video_data', cols,
    extract_frames_from='video', extracted_frame_col='frame', extracted_frame_idx_col='frame_idx',
    extracted_fps=0)

We now insert a single row containing the name of the video file we just downloaded, which is expanded into 462 frames/rows in the `video_data` table.

In general, `insert_rows()` takes as its first argument a list of rows, each of which is a list of column values (and in this case, we only need to supply data for the `video` column).

In [5]:
t.insert_rows([[filename]], columns=['video'])

  for input_row_idx, (_, val) in enumerate(data[col.name].iteritems()):


Inserting rows into table:   0%|          | 0/462 [00:00<?, ?rows/s]

'inserted 462 rows with 0 errors '

We loaded a video that shows a busy intersection in Bangkok. Let's look at the first frame:

In [6]:
t[t.frame_idx == 0][t.frame, t.frame.width, t.frame.height].show(1)

frame,width,height
,1280,720


Whether a computed image column is stored is controlled by the `stored` keyword argument of the `Column` constructor:
- when set to `True`, the value is stored explicitly
- when set to `False`, the value is always recomputed during a query (and never stored)
- the default is `None`, which means that the Pixeltable decides (currently that means that the image won't be stored, but in the future it could take resource consumption into account)

Let's take another look at the definition of the `frame` column:
```python
pt.Column('frame', pt.ImageType(), nullable=False)
```
In this case, we didn't specify `stored`, and so the default applies.


## Object Detection as a user-defined function

User-defined functions let you customize Pixeltable's functionality for your own data.

In this example, we're going use a `torchvision` object detection model (Faster R-CNN):

In [7]:
import torch, torchvision
from torchvision import transforms
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

model = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn(weights="DEFAULT")
_ = model.eval()  # switch to inference mode

Our function converts the image to PyTorch format and obtains a prediction from the model, which is a list of dictionaries with fields `boxes`, `labels`, and `scores` (one per input image). The fields themselves are PyTorch tensors, and we convert them to standard Python lists (so they become JSON-serializable data):

In [8]:
@pt.function(return_type=pt.JsonType(), param_types=[pt.ImageType()])
def detect(img):
    t = transforms.ToTensor()(img)
    t = transforms.ConvertImageDtype(torch.float)(t)
    result = model([t])[0]
    return {
        'boxes': result['boxes'].tolist(), 'labels': result['labels'].tolist(), 'scores': result['scores'].tolist()
    }

We can then use `detect()` in the Pixeltable index operator using standard Python function call syntax:

In [9]:
t[t.frame_idx == 0][t.frame, detect(t.frame)].show(1)

frame,col_1
,"{'boxes': [[338.01751708984375, 332.3070068359375, 429.05963134765625, 402.5361328125], [325.520263671875, 494.70501708984375, 564.3624877929688, 642.5904541015625], [877.457763671875, 332.5373840332031, 996.2501220703125, 418.9269104003906], [0.0, 563.2776489257812, 96.44212341308594, 675.6179809570312], [58.37788391113281, 415.9442138671875, 261.9402770996094, 513.2822875976562], [581.962646484375, 412.8414306640625, 680.565185546875, 520.5406494140625], [352.85223388671875, 317.1103515625, 432.13494873046875, 368.0482482910156], [477.98175048828125, 277.4363098144531, 538.7948608398438, 333.2647705078125], [260.54571533203125, 613.0361938476562, 384.6195068359375, 716.1728515625], [817.8084716796875, 267.61859130859375, 877.5631103515625, 320.37615966796875], [542.4912719726562, 267.6480712890625, 615.8351440429688, 314.6510925292969], [462.1524353027344, 595.5888061523438, 600.7080688476562, 708.8529052734375], [832.55419921875, 292.0086364746094, 910.7310791015625, 349.83056640625], [481.7981262207031, 625.1809692382812, 530.2610473632812, 698.927978515625], [832.3120727539062, 275.2059020996094, 889.4700317382812, 327.62255859375], [370.40655517578125, 303.6800231933594, 445.78363037109375, 353.4532470703125], [462.2282409667969, 588.984619140625, 600.685791015625, 704.4879760742188], [565.6113891601562, 227.9351043701172, 600.3529052734375, 253.93939208984375], [800.5158081054688, 271.3426513671875, 834.0835571289062, 304.6402282714844], [781.1487426757812, 204.70030212402344, 809.4356079101562, 228.89373779296875], [261.4763488769531, 600.704833984375, 390.4063415527344, 706.0755004882812], [514.0210571289062, 606.6480102539062, 582.5101318359375, 681.480712890625], [787.840087890625, 200.6818084716797, 821.304443359375, 227.33694458007812], [312.4492492675781, 494.48504638671875, 568.381591796875, 666.237060546875], [841.86572265625, 261.2329406738281, 887.4227905273438, 307.7902526855469], [416.1756591796875, 280.36444091796875, 452.0413818359375, 320.4415588378906], [637.2534790039062, 212.73162841796875, 661.20361328125, 236.27114868164062], [543.6298217773438, 556.7749633789062, 592.8336791992188, 609.948486328125], [1051.6065673828125, 370.049072265625, 1108.352783203125, 410.6287841796875], [571.7589721679688, 221.79039001464844, 608.386962890625, 249.79983520507812], [553.1995849609375, 566.97119140625, 601.721923828125, 633.8588256835938], [497.1732482910156, 237.0930938720703, 522.0634155273438, 260.05474853515625], [783.3855590820312, 172.9274139404297, 803.7145385742188, 188.106201171875], [480.0483093261719, 626.2273559570312, 531.7450561523438, 699.1278076171875], [364.04345703125, 282.718017578125, 397.2662658691406, 321.26129150390625], [587.1902465820312, 208.87158203125, 615.29931640625, 234.74034118652344], [582.338134765625, 406.96710205078125, 679.001708984375, 521.2232666015625], [783.3406982421875, 257.3706359863281, 812.037353515625, 292.5958251953125], [481.3854675292969, 267.3361511230469, 544.10888671875, 323.33917236328125], [543.6251831054688, 597.1929931640625, 597.8976440429688, 675.5040893554688], [56.222251892089844, 414.9085998535156, 262.7411804199219, 513.601318359375], [787.81494140625, 262.3848876953125, 823.2059936523438, 300.82415771484375], [43.36042404174805, 445.6544189453125, 109.14202117919922, 505.5900573730469], [644.1192016601562, 156.57199096679688, 659.0614013671875, 174.76467895507812], [4.256011486053467, 565.1294555664062, 93.84330749511719, 669.02001953125], [565.5870971679688, 227.5789337158203, 600.1444702148438, 254.0669403076172], [645.0855712890625, 137.42782592773438, 659.7767944335938, 154.7159423828125], [781.210693359375, 204.54339599609375, 809.3055419921875, 228.88003540039062], [497.069091796875, 236.71139526367188, 521.9779052734375, 260.26373291015625], [1051.447998046875, 370.76947021484375, 1107.5220947265625, 409.7809753417969], [545.8430786132812, 595.2821655273438, 597.3834228515625, 674.5313720703125], [375.08563232421875, 275.32391357421875, 404.319091796875, 309.786865234375], [210.41990661621094, 305.87103271484375, 237.21241760253906, 349.9939880371094], [375.1878967285156, 275.2143249511719, 404.1911315917969, 309.8097229003906], [787.8712768554688, 200.51026916503906, 821.1495971679688, 227.33047485351562], [543.375732421875, 556.5907592773438, 593.2089233398438, 609.5621337890625], [364.1552734375, 282.5019226074219, 397.0613098144531, 321.30194091796875], [552.5045166015625, 264.04388427734375, 619.2614135742188, 314.5805358886719], [464.40216064453125, 595.2394409179688, 575.1683349609375, 709.7114868164062], [1217.96826171875, 259.7019348144531, 1269.4993896484375, 315.411376953125], [698.3718872070312, 195.6782684326172, 726.0489501953125, 226.09625244140625], [627.2576904296875, 210.558349609375, 653.7562255859375, 236.50987243652344], [707.27783203125, 194.48370361328125, 729.6455078125, 219.4148406982422], [459.40093994140625, 584.2823486328125, 603.3065185546875, 704.3050537109375], [480.5785217285156, 625.1347045898438, 531.4030151367188, 698.1009521484375], [353.0716857910156, 289.2540588378906, 385.91119384765625, 326.8209228515625], [571.6904907226562, 221.46231079101562, 608.2069091796875, 249.9178009033203], [553.1578369140625, 567.3275146484375, 601.8734130859375, 633.24072265625], [507.7210388183594, 607.8045654296875, 585.5687866210938, 682.43310546875], [353.0852355957031, 288.9317321777344, 385.6903076171875, 326.8770751953125], [552.5414428710938, 568.832763671875, 601.738525390625, 634.8807983398438], [783.569091796875, 168.0931396484375, 804.8558349609375, 185.38092041015625], [828.9866333007812, 205.89918518066406, 847.806884765625, 228.9514923095703], [873.7218627929688, 330.4420166015625, 995.7734375, 419.1415710449219], [783.3650512695312, 172.9142608642578, 803.64892578125, 188.12713623046875]], 'labels': [3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 4, 3, 1, 3, 3, 3, 3, 3, 3, 3, 1, 3, 8, 3, 3, 3, 1, 3, 3, 1, 3, 3, 4, 3, 3, 8, 3, 8, 4, 8, 3, 3, 10, 8, 8, 10, 8, 8, 4, 1, 3, 1, 1, 8, 3, 1, 8, 1, 1, 3, 3, 3, 8, 3, 3, 8, 3, 4, 1, 4, 10, 3, 8, 8], 'scores': [0.9825785160064697, 0.9797981977462769, 0.9286067485809326, 0.8872905373573303, 0.864543080329895, 0.8518528342247009, 0.7539899945259094, 0.7321217656135559, 0.659579873085022, 0.5377905368804932, 0.490359365940094, 0.40867525339126587, 0.4038485884666443, 0.36029037833213806, 0.3227679431438446, 0.2956130802631378, 0.28263407945632935, 0.26782605051994324, 0.26622849702835083, 0.2356300950050354, 0.21421648561954498, 0.21407334506511688, 0.20922325551509857, 0.2043711543083191, 0.1951369345188141, 0.19202542304992676, 0.1870029866695404, 0.18682721257209778, 0.17190495133399963, 0.1614004224538803, 0.1611216813325882, 0.15472961962223053, 0.1483038067817688, 0.14572757482528687, 0.13231202960014343, 0.13104328513145447, 0.12564314901828766, 0.1249874010682106, 0.12427417933940887, 0.11434751749038696, 0.11423990875482559, 0.11169517040252686, 0.10880621522665024, 0.10775408148765564, 0.10670699924230576, 0.10328519344329834, 0.09854497760534286, 0.09559924900531769, 0.09531787782907486, 0.09462060779333115, 0.09369228035211563, 0.09344764798879623, 0.09314802289009094, 0.08740834891796112, 0.0869700014591217, 0.0842432752251625, 0.08313621580600739, 0.0776110365986824, 0.07742376625537872, 0.07416937500238419, 0.0738762617111206, 0.07367471605539322, 0.0723211020231247, 0.0687892884016037, 0.06723164767026901, 0.06691057235002518, 0.06409768760204315, 0.06371377408504486, 0.06175068020820618, 0.06163391098380089, 0.060353487730026245, 0.05640379711985588, 0.05513430014252663, 0.05204585939645767, 0.05057574063539505]}"


This works as expected, and we now add the detections as a computed column `detections` to the table.

Running model inference is generally an expensive operation; adding it as a computed column makes sure it only runs once, at the time the row is inserted. After that, the result is available as part of the stored table data.

Note that for computed columns of any type other than `image`, the computed values are **always** stored (ie, `stored=True`).

In [10]:
t.add_column(pt.Column('detections', computed_with=detect(t.frame)))

  0%|          | 0/462 [00:00<?, ?it/s]

'added 462 column values with 0 errors'

We can create a simple function `draw_boxes()` to visualize detections:

In [11]:
import PIL

@pt.function(return_type=pt.ImageType(), param_types=[pt.ImageType(), pt.JsonType()])
def draw_boxes(img, boxes):
    result = img.copy()
    d = PIL.ImageDraw.Draw(result)
    for box in boxes:
        d.rectangle(box, width=3)
    return result

This function takes two arguments:
- `img` has type `image` and receives an instance of `PIL.Image.Image`
- `boxes` has type `json` and receives a JSON-serializable structure, in this case a list of 4-element lists of floats

When we "call" this function, we need to pass in the frame and the bounding boxes identified in that frame. The latter can be selected with the JSON path expression `t.detections.boxes`:

In [12]:
t[t.frame_idx == 0][t.frame, draw_boxes(t.frame, t.detections.boxes)].show(1)

frame,col_1
,


Looking at individual frames gives us some idea of how well our detection algorithm works, but it would be more instructive to turn the visualization output back into a video.

We do that with the built-in function `make_video()`, which is an aggregation function that takes a frame index (actually: any expression that can be used to order the frames; a timestamp would also work) and an image, and then assembles the sequence of images into a video:

In [13]:
t[pt.make_video(t.frame_idx, draw_boxes(t.frame, t.detections.boxes))].group_by(t.video).show(1)

OpenCV: FFMPEG: tag 0x34363258/'X264' is not supported with codec id 27 and format 'mp4 / MP4 (MPEG-4 Part 14)'
OpenCV: FFMPEG: fallback to use tag 0x31637661/'avc1'
[ERROR:0@51.081] global cap_ffmpeg_impl.hpp:2991 open Could not find encoder for codec_id=27, error: Encoder not found
[ERROR:0@51.081] global cap_ffmpeg_impl.hpp:3066 open VIDEOIO/FFMPEG: Failed to initialize VideoWriter


col_0
