# Object Detection in Videos

In this tutorial we'll demonstrate how to use Pixeltable to do frame-by-frame object detection, made simple through Pixeltable's video-related functionality:
* automatic frame extraction
* running complex functions against frames (in this case, an object detection model)
* reassembling frames back into videos

We'll be working with a single video file (from Pixeltable's test data directory). Let's download that now:

In [1]:
import urllib.request

download_url = 'https://raw.github.com/mkornacker/pixeltable/master/docs/source/data/bangkok.mp4'
filename, _ = urllib.request.urlretrieve(download_url)

Let's also switch to using the full window width, which will make looking at videos later easier.

In [2]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

## Creating a tutorial directory and table

In Pixeltable, all data resides in tables, which themselves can be organized into a directory structure.

Let's start by creating a client and a `video_tutorial` directory:

In [3]:
import pixeltable as pxt

cl = pxt.Client()
cl.create_dir('video_tutorial', ignore_errors=True)

2024-01-10 11:24:30,976 INFO env env.py:172: found database postgresql://postgres:@/pixeltable?host=/Users/orm/Library/Caches/TemporaryItems/python_PostgresServer/dc4677b93f


2024-01-10 11:24:30,981 INFO env env.py:183: connecting to NOS


[32m2024-01-10 11:24:31.014[0m | [1mINFO    [0m | [36mnos.server[0m:[36minit[0m:[36m131[0m - [1mInference server already running (name=nos-inference-service-cpu, image=<Image: 'autonomi/nos:0.0.9-cpu'>, id=b4c529d9de2a).[0m


2024-01-10 11:24:31,015 INFO env env.py:186: waiting for NOS


2024-01-10 11:24:31,050 INFO env env.py:207: connecting to OpenAI


  from tqdm.autonotebook import tqdm


We create a table for our videos, with a single column:

In [4]:
video_path = 'video_tutorial.videos'
frame_path = 'video_tutorial.frames'
cl.drop_table(frame_path, ignore_errors=True)
cl.drop_table(video_path, ignore_errors=True)
v = cl.create_table(video_path, {'video': pxt.VideoType()})

In order to interact with the frames, we take advantage of Pixeltable's component view concept: we create a "view" of our video table that contains one row for each frame. Pixeltable provides the built-in `FrameIterator` class for this.

In [5]:
from pixeltable.iterators import FrameIterator
args = {'video': v.video, 'fps': 0}
f = cl.create_view(frame_path, v, iterator_class=FrameIterator, iterator_args=args)

created view frames with 0 rows, 0 exceptions


The `fps` parameter determines the frame rate, with `0` indicating the native frame rate.

Running this creates a view with six columns:
- `frame_idx`, `pos_msec`, `pos_frame` and `frame` are created by the `FrameIterator` class.
- `pos` is a system column in every component view
- `video` is the column for our base table (all base table columns are visible in the view, to facilitate querying)

Note that you could create additional views on the `videos` table, each with its own frame rate.

In [6]:
f

Column Name,Type,Computed With
pos,int,
frame_idx,int,
pos_msec,float,
pos_frame,float,
frame,image,
video,video,


We now insert a single row containing the name of the video file we just downloaded, which is expanded into 462 frames/rows in the `video_data` table.

In general, `insert()` takes as its first argument a list of rows, each of which is a dictionary mapping column names to column values (and in this case, we only need to supply data for the `video` column).

In [7]:
v.insert([{'video': filename}])

Inserting rows into table: 0rows [00:00, ?rows/s]

Inserting rows into table: 1rows [00:00, 308.25rows/s]




Inserting rows into table: 0rows [00:00, ?rows/s]

Inserting rows into table: 462rows [00:00, 20167.65rows/s]

inserted 463 rows with 0 errors 





UpdateStatus(num_rows=463, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])

We loaded a video that shows a busy intersection in Bangkok. Let's look at the first frame:

In [8]:
f.where(f.pos == 100).select(f.frame, f.frame.width, f.frame.height).show(1)

frame,width,height
,1280,720


When we created the `video_data` table with automatic frame extraction, Pixeltable does not physically store the frames. Instead, Pixeltable re-extracts the frames on retrieval using the frame index, which can be done very efficiently and avoids any storage overhead (which would be very substantial for video frames).

## Object Detection as a user-defined function

User-defined functions let you customize Pixeltable's functionality for your own data.

In this example, we're going use a `torchvision` object detection model (Faster R-CNN):

In [9]:
import torch, torchvision
from torchvision import transforms
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

model = torchvision.models.detection.fasterrcnn_mobilenet_v3_large_320_fpn(weights="DEFAULT")
_ = model.eval()  # switch to inference mode

Our function converts the image to PyTorch format and obtains a prediction from the model, which is a list of dictionaries with fields `boxes`, `labels`, and `scores` (one per input image). The fields themselves are PyTorch tensors, and we convert them to standard Python lists (so they become JSON-serializable data):

In [10]:
@pxt.udf(return_type=pxt.JsonType(), param_types=[pxt.ImageType()])
def detect(img):
    t = transforms.ToTensor()(img)
    t = transforms.ConvertImageDtype(torch.float)(t)
    result = model([t])[0]
    return {
        'boxes': result['boxes'].tolist(), 'labels': result['labels'].tolist(), 'scores': result['scores'].tolist()
    }

We can then use `detect()` in the Pixeltable index operator using standard Python function call syntax:

In [11]:
f.where(f.pos == 0).select(f.frame, detect(f.frame)).show(1)

frame,col_1
,"{'boxes': [[337.8647155761719, 332.46612548828125, 429.1667175292969, 402.5970764160156], [325.2051696777344, 494.3109436035156, 564.2702026367188, 642.1953735351562], [877.8134155273438, 333.18817138671875, 996.2906494140625, 418.9712219238281], [0.0, 563.2088623046875, 96.39903259277344, 675.6091918945312], [57.864742279052734, 416.06024169921875, 262.3301696777344, 513.0235595703125], [581.123046875, 411.90191650390625, 680.7025756835938, 520.515625], [352.4833068847656, 316.9505615234375, 432.12933349609375, 367.7530517578125], [260.4427490234375, 612.7318725585938, 384.6585693359375, 716.4093017578125], [477.9952392578125, 271.1726989746094, 544.2704467773438, 328.515625], [818.2565307617188, 267.601806640625, 876.8981323242188, 319.71856689453125], [542.8866577148438, 267.740234375, 616.1195068359375, 314.6986389160156], [462.3495788574219, 595.7562255859375, 600.5916748046875, 708.95361328125], [833.157958984375, 292.3353271484375, 910.7640380859375, 349.6744384765625], [481.7925109863281, 624.8700561523438, 530.3555908203125, 698.6751098632812], [369.8426208496094, 303.4560852050781, 445.5954895019531, 353.34796142578125], [832.91357421875, 274.48211669921875, 889.0774536132812, 326.4960021972656], [462.46722412109375, 589.193115234375, 600.4926147460938, 704.57568359375], [800.8341674804688, 271.388916015625, 833.6454467773438, 304.52362060546875], [565.8760375976562, 227.74388122558594, 600.130126953125, 253.74705505371094], [781.2398681640625, 204.24252319335938, 808.84765625, 228.20692443847656], [261.3731384277344, 600.3648071289062, 390.50372314453125, 706.3046875], [312.2277526855469, 494.092041015625, 568.6148071289062, 666.3599853515625], [514.38134765625, 606.7596435546875, 582.85791015625, 681.532470703125], [787.5870971679688, 200.4111785888672, 820.6024780273438, 227.10919189453125], [842.218017578125, 260.71630859375, 886.997802734375, 306.9303283691406], [637.4602661132812, 212.8022918701172, 661.2779541015625, 236.25576782226562], [543.4312744140625, 556.9361572265625, 592.2392578125, 609.7962036132812], [1050.582763671875, 370.1393127441406, 1108.7381591796875, 410.6513977050781], [571.872802734375, 221.76162719726562, 608.0042724609375, 249.77294921875], [416.2694396972656, 280.3238220214844, 451.97601318359375, 320.3995361328125], [497.1452941894531, 236.92086791992188, 521.7581787109375, 259.5404968261719], [553.1375122070312, 567.242919921875, 601.6338500976562, 633.9266357421875], [783.4125366210938, 172.77194213867188, 803.6380004882812, 188.11459350585938], [480.03485107421875, 625.904296875, 531.82861328125, 698.9019775390625], [363.76220703125, 282.57269287109375, 397.25872802734375, 320.6749267578125], [586.8895874023438, 208.94427490234375, 614.739501953125, 234.72715759277344], [569.2001953125, 400.0728454589844, 676.6893310546875, 520.8723754882812], [477.9044494628906, 270.02557373046875, 543.797119140625, 328.434326171875], [783.5492553710938, 257.28387451171875, 811.5955810546875, 292.35491943359375], [543.5305786132812, 597.376708984375, 597.9984741210938, 675.7633666992188], [43.38615798950195, 446.0279846191406, 109.14473724365234, 506.2408447265625], [55.71500015258789, 414.77703857421875, 263.2668762207031, 513.2964477539062], [565.8515014648438, 227.38748168945312, 599.9375, 253.88290405273438], [788.1870727539062, 262.3468933105469, 822.7427978515625, 300.6381530761719], [4.384808540344238, 565.1299438476562, 93.74496459960938, 669.048095703125], [374.89703369140625, 275.2002868652344, 404.3221740722656, 309.2579040527344], [497.0546569824219, 236.55413818359375, 521.6633911132812, 259.74957275390625], [646.0738525390625, 136.94869995117188, 661.1941528320312, 154.5495147705078], [644.50048828125, 155.02911376953125, 659.689697265625, 173.54000854492188], [374.9891662597656, 275.096435546875, 404.2046203613281, 309.2680969238281], [781.2963256835938, 204.08888244628906, 808.7265625, 228.20252990722656], [545.7551879882812, 595.4611206054688, 597.4846801757812, 674.7218627929688], [210.20579528808594, 307.05511474609375, 237.02890014648438, 350.6547546386719], [1050.4140625, 370.9371032714844, 1107.8831787109375, 409.7916259765625], [363.8630676269531, 282.3737487792969, 397.0633239746094, 320.69781494140625], [787.6190185546875, 200.24154663085938, 820.4561157226562, 227.1120147705078], [1217.6226806640625, 259.60467529296875, 1269.4833984375, 315.6789245605469], [543.1591186523438, 556.7106323242188, 592.6257934570312, 609.41162109375], [553.03271484375, 264.1148986816406, 619.9242553710938, 314.56103515625], [627.4351806640625, 210.60122680664062, 653.5518798828125, 236.43719482421875], [571.8056640625, 221.43743896484375, 607.8345336914062, 249.90699768066406], [464.59326171875, 595.7692260742188, 575.0718383789062, 709.7975463867188], [459.60546875, 584.369873046875, 603.13720703125, 704.4130859375], [699.329345703125, 195.35543823242188, 726.8754272460938, 225.8157196044922], [707.5513916015625, 194.1472625732422, 729.8897705078125, 219.15513610839844], [480.5521240234375, 624.8335571289062, 531.5151977539062, 697.8838500976562], [508.006103515625, 607.8663330078125, 585.8038940429688, 682.5639038085938], [352.92291259765625, 289.0428771972656, 385.89373779296875, 326.16046142578125], [552.4376220703125, 568.9983520507812, 601.6992797851562, 634.9937744140625], [553.0545043945312, 567.5036010742188, 601.8277587890625, 633.3245849609375], [352.9222412109375, 288.72247314453125, 385.6815185546875, 326.2076416015625], [783.43505859375, 168.10635375976562, 804.713134765625, 185.4266815185547], [828.9800415039062, 203.45030212402344, 850.667724609375, 228.11941528320312], [783.3914794921875, 172.7569580078125, 803.5745239257812, 188.13949584960938]], 'labels': [3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 4, 3, 1, 3, 3, 3, 3, 3, 3, 3, 8, 1, 3, 3, 3, 1, 3, 3, 3, 3, 1, 3, 4, 3, 3, 8, 8, 3, 4, 3, 8, 8, 3, 8, 3, 8, 10, 10, 1, 8, 1, 1, 4, 1, 8, 1, 3, 8, 3, 8, 1, 8, 3, 3, 3, 4, 3, 4, 3, 1, 10, 3, 8], 'scores': [0.9817476868629456, 0.9800548553466797, 0.9326778650283813, 0.8906172513961792, 0.8641411662101746, 0.8546723127365112, 0.7596473097801208, 0.6463725566864014, 0.6008047461509705, 0.5348609685897827, 0.4706256687641144, 0.4125201404094696, 0.38593801856040955, 0.3349098861217499, 0.2949298322200775, 0.29465413093566895, 0.2860276401042938, 0.2688336670398712, 0.26642683148384094, 0.22474899888038635, 0.22022758424282074, 0.2099137306213379, 0.20767010748386383, 0.201735720038414, 0.18779604136943817, 0.18276192247867584, 0.18137137591838837, 0.17148073017597198, 0.16914157569408417, 0.16422635316848755, 0.16117553412914276, 0.15970425307750702, 0.14917612075805664, 0.14409610629081726, 0.13898077607154846, 0.13192562758922577, 0.12419281154870987, 0.12141920626163483, 0.12026935070753098, 0.11862155795097351, 0.11491520702838898, 0.11342544108629227, 0.10917515307664871, 0.10670676827430725, 0.10433623939752579, 0.10169120877981186, 0.10004133731126785, 0.09740971028804779, 0.09660665690898895, 0.09428671002388, 0.0931597352027893, 0.0913764014840126, 0.08887746185064316, 0.08724220097064972, 0.08696912229061127, 0.08619822561740875, 0.0856681689620018, 0.08099907636642456, 0.0797283947467804, 0.07110270857810974, 0.07036152482032776, 0.07011409103870392, 0.0687038004398346, 0.0674753412604332, 0.06733217090368271, 0.0672658160328865, 0.0657624825835228, 0.06435704976320267, 0.06287618726491928, 0.06263156980276108, 0.05841396749019623, 0.055843256413936615, 0.05322226136922836, 0.05044027417898178]}"


This works as expected, and we now add the detections as a computed column `detections` to the table.

Running model inference is generally an expensive operation; adding it as a computed column makes sure it only runs once, at the time the row is inserted. After that, the result is available as part of the stored table data.

Note that for computed columns of any type other than `image`, the computed values are **always** stored (ie, `stored=True`).

In [12]:
f.add_column(detections=detect(f.frame))

Computing cells:   0%|          | 0/462 [00:00<?, ?cells/s]

Computing cells:   2%|▏         | 8/462 [00:02<02:08,  3.53cells/s]

Computing cells:   3%|▎         | 16/462 [00:04<02:08,  3.46cells/s]

Computing cells:   5%|▌         | 24/462 [00:07<02:08,  3.40cells/s]

Computing cells:   7%|▋         | 32/462 [00:09<02:06,  3.41cells/s]

Computing cells:   9%|▊         | 40/462 [00:11<02:03,  3.43cells/s]

Computing cells:  10%|█         | 48/462 [00:13<02:00,  3.43cells/s]

Computing cells:  12%|█▏        | 56/462 [00:16<01:58,  3.42cells/s]

Computing cells:  14%|█▍        | 64/462 [00:18<01:57,  3.39cells/s]

Computing cells:  16%|█▌        | 72/462 [00:21<01:56,  3.36cells/s]

Computing cells:  17%|█▋        | 80/462 [00:23<01:54,  3.35cells/s]

Computing cells:  19%|█▉        | 88/462 [00:26<01:52,  3.33cells/s]

Computing cells:  21%|██        | 96/462 [00:28<01:49,  3.35cells/s]

Computing cells:  23%|██▎       | 104/462 [00:30<01:47,  3.34cells/s]

Computing cells:  24%|██▍       | 112/462 [00:33<01:44,  3.35cells/s]

Computing cells:  26%|██▌       | 120/462 [00:35<01:40,  3.39cells/s]

Computing cells:  28%|██▊       | 128/462 [00:37<01:38,  3.40cells/s]

Computing cells:  29%|██▉       | 136/462 [00:40<01:36,  3.36cells/s]

Computing cells:  31%|███       | 144/462 [00:42<01:35,  3.34cells/s]

Computing cells:  33%|███▎      | 152/462 [00:45<01:33,  3.32cells/s]

Computing cells:  35%|███▍      | 160/462 [00:47<01:31,  3.32cells/s]

Computing cells:  36%|███▋      | 168/462 [00:49<01:28,  3.31cells/s]

Computing cells:  38%|███▊      | 176/462 [00:52<01:26,  3.31cells/s]

Computing cells:  40%|███▉      | 184/462 [00:54<01:24,  3.30cells/s]

Computing cells:  42%|████▏     | 192/462 [00:57<01:21,  3.33cells/s]

Computing cells:  43%|████▎     | 200/462 [00:59<01:18,  3.32cells/s]

Computing cells:  45%|████▌     | 208/462 [01:01<01:16,  3.32cells/s]

Computing cells:  47%|████▋     | 216/462 [01:04<01:14,  3.32cells/s]

Computing cells:  48%|████▊     | 224/462 [01:06<01:10,  3.38cells/s]

Computing cells:  50%|█████     | 232/462 [01:09<01:08,  3.35cells/s]

Computing cells:  52%|█████▏    | 240/462 [01:11<01:06,  3.33cells/s]

Computing cells:  54%|█████▎    | 248/462 [01:13<01:04,  3.33cells/s]

Computing cells:  55%|█████▌    | 256/462 [01:16<01:01,  3.32cells/s]

Computing cells:  57%|█████▋    | 264/462 [01:18<00:59,  3.35cells/s]

Computing cells:  59%|█████▉    | 272/462 [01:20<00:55,  3.41cells/s]

Computing cells:  61%|██████    | 280/462 [01:23<00:52,  3.44cells/s]

Computing cells:  62%|██████▏   | 288/462 [01:25<00:50,  3.43cells/s]

Computing cells:  64%|██████▍   | 296/462 [01:27<00:48,  3.45cells/s]

Computing cells:  66%|██████▌   | 304/462 [01:30<00:45,  3.47cells/s]

Computing cells:  68%|██████▊   | 312/462 [01:32<00:43,  3.42cells/s]

Computing cells:  69%|██████▉   | 320/462 [01:34<00:41,  3.41cells/s]

Computing cells:  71%|███████   | 328/462 [01:37<00:39,  3.41cells/s]

Computing cells:  73%|███████▎  | 336/462 [01:39<00:36,  3.42cells/s]

Computing cells:  74%|███████▍  | 344/462 [01:41<00:34,  3.45cells/s]

Computing cells:  76%|███████▌  | 352/462 [01:44<00:31,  3.46cells/s]

Computing cells:  78%|███████▊  | 360/462 [01:46<00:29,  3.46cells/s]

Computing cells:  80%|███████▉  | 368/462 [01:48<00:27,  3.48cells/s]

Computing cells:  81%|████████▏ | 376/462 [01:50<00:24,  3.50cells/s]

Computing cells:  83%|████████▎ | 384/462 [01:53<00:22,  3.50cells/s]

Computing cells:  85%|████████▍ | 392/462 [01:55<00:20,  3.50cells/s]

Computing cells:  87%|████████▋ | 400/462 [01:57<00:17,  3.50cells/s]

Computing cells:  88%|████████▊ | 408/462 [02:00<00:15,  3.49cells/s]

Computing cells:  90%|█████████ | 416/462 [02:02<00:13,  3.48cells/s]

Computing cells:  92%|█████████▏| 424/462 [02:04<00:11,  3.41cells/s]

Computing cells:  94%|█████████▎| 432/462 [02:07<00:08,  3.37cells/s]

Computing cells:  95%|█████████▌| 440/462 [02:09<00:06,  3.33cells/s]

Computing cells:  97%|█████████▋| 448/462 [02:12<00:04,  3.31cells/s]

Computing cells:  99%|█████████▊| 456/462 [02:14<00:01,  3.30cells/s]

Computing cells: 100%|██████████| 462/462 [02:16<00:00,  3.29cells/s]

Computing cells: 100%|██████████| 462/462 [02:16<00:00,  3.38cells/s]

added 462 column values with 0 errors





UpdateStatus(num_rows=462, num_computed_values=462, num_excs=0, updated_cols=[], cols_with_excs=[])

We can create a simple function `draw_boxes()` to visualize detections:

In [13]:
import PIL.ImageDraw

@pxt.udf(return_type=pxt.ImageType(), param_types=[pxt.ImageType(), pxt.JsonType()])
def draw_boxes(img, boxes):
    result = img.copy()
    d = PIL.ImageDraw.Draw(result)
    for box in boxes:
        d.rectangle(box, width=3)
    return result

This function takes two arguments:
- `img` has type `image` and receives an instance of `PIL.Image.Image`
- `boxes` has type `json` and receives a JSON-serializable structure, in this case a list of 4-element lists of floats

When we "call" this function, we need to pass in the frame and the bounding boxes identified in that frame. The latter can be selected with the JSON path expression `t.detections.boxes`:

In [14]:
f.where(f.frame_idx == 0).select(f.frame, draw_boxes(f.frame, f.detections.boxes)).show(1)

frame,col_1
,


Looking at individual frames gives us some idea of how well our detection algorithm works, but it would be more instructive to turn the visualization output back into a video.

We do that with the built-in function `make_video()`, which is an aggregation function that takes a frame index (actually: any expression that can be used to order the frames; a timestamp would also work) and an image, and then assembles the sequence of images into a video:

In [15]:
f.select(pxt.make_video(f.pos, draw_boxes(f.frame, f.detections.boxes))).group_by(v).show(1)

col_0
