# External Files

In Pixeltable, all media data (videos, images, audio) resides in external files, and Pixeltable stores references to those. The files can be local or remote (e.g., in S3). For the latter, Pixeltable automatically caches the files locally on access.

When interacting with media data via Pixeltable, either through queries or UDFs, the user sees the following Python types:

- `ImageType`: `PIL.Image.Image`
- `VideoType`: `string` (local path)
- `AudioType`: `string` (local path)

Let's create a table and load some data to see what that looks like:

In [None]:
%pip install -qU pixeltable boto3

In [1]:
import tempfile
import random
import shutil
import pixeltable as pxt

# First drop the `external_data` directory if it exists, to ensure
# a clean environment for the demo
pxt.drop_dir('external_data', force=True)
pxt.create_dir('external_data')

Connected to Pixeltable database at: postgresql+psycopg://postgres:@/pixeltable?host=/Users/asiegel/.pixeltable/pgdata
Created directory `external_data`.


<pixeltable.catalog.dir.Dir at 0x176646bb0>

In [2]:
v = pxt.create_table('external_data.videos', {'video': pxt.Video})

prefix = 's3://multimedia-commons/'
paths = [
    'data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4',
    'data/videos/mp4/ffe/feb/ffefebb41485539f964760e6115fbc44.mp4',
    'data/videos/mp4/ffe/f73/ffef7384d698b5f70d411c696247169.mp4'
]
v.insert({'video': prefix + p} for p in paths)

Created table `videos`.
Computing cells:   0%|                                                    | 0/6 [00:00<?, ? cells/s]
Inserting rows into `videos`: 3 rows [00:00, 1004.62 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 6/6 [00:00<00:00, 79.14 cells/s]
Inserted 3 rows with 0 errors.


UpdateStatus(num_rows=3, num_computed_values=6, num_excs=0, updated_cols=[], cols_with_excs=[])

UpdateStatus(num_rows=3, num_computed_values=0, num_excs=0, updated_cols=[], cols_with_excs=[])

We just inserted 3 rows with video files residing in S3. When we now query these, we are presented with their locally cached counterparts.

(Note: we don't simply display the output of `collect()` here, because that is formatted as an HTML table with a media player and so would obscure the file path.)

In [3]:
rows = list(v.select(v.video).collect())
rows[0]

{'video': '/Users/asiegel/.pixeltable/file_cache/682f022a704d4459adb2f29f7fe9577c_0_1fcfcb221263cff76a2853250fbbb2e90375dd495454c0007bc6ff4430c9a4a7.mp4'}

Let's make a local copy of the first file and insert that separately. First, the copy:

In [4]:
local_path = tempfile.mktemp(suffix='.mp4')
shutil.copyfile(rows[0]['video'], local_path)
local_path

'/var/folders/hb/qd0dztsj43j_mdb6hbl1gzyc0000gn/T/tmp1jo4a7ca.mp4'

Now the insert:

In [None]:
v.insert([{'video': local_path}])

Computing cells:   0%|                                                    | 0/2 [00:00<?, ? cells/s]
Inserting rows into `videos`: 1 rows [00:00, 725.78 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 2/2 [00:00<00:00, 53.23 cells/s]
Inserted 1 row with 0 errors.


UpdateStatus(num_rows=1, num_computed_values=2, num_excs=0, updated_cols=[], cols_with_excs=[])

When we query this again, we see that local paths are preserved:

In [6]:
rows = list(v.select(v.video).collect())
rows

[{'video': '/Users/asiegel/.pixeltable/file_cache/682f022a704d4459adb2f29f7fe9577c_0_1fcfcb221263cff76a2853250fbbb2e90375dd495454c0007bc6ff4430c9a4a7.mp4'},
 {'video': '/Users/asiegel/.pixeltable/file_cache/682f022a704d4459adb2f29f7fe9577c_0_fc11428b32768ae782193a57ebcbad706f45bbd9fa13354471e0bcd798fee3ea.mp4'},
 {'video': '/Users/asiegel/.pixeltable/file_cache/682f022a704d4459adb2f29f7fe9577c_0_b9fb0d9411bc9cd183a36866911baa7a8834f22f665bce47608566b38485c16a.mp4'},
 {'video': '/var/folders/hb/qd0dztsj43j_mdb6hbl1gzyc0000gn/T/tmp1jo4a7ca.mp4'}]

UDFs also see local paths:

In [7]:
@pxt.udf
def f(v: pxt.Video) -> int:
    print(f'{type(v)}: {v}')
    return 1

In [8]:
v.select(f(v.video)).show()

<class 'str'>: /Users/asiegel/.pixeltable/file_cache/682f022a704d4459adb2f29f7fe9577c_0_1fcfcb221263cff76a2853250fbbb2e90375dd495454c0007bc6ff4430c9a4a7.mp4
<class 'str'>: /Users/asiegel/.pixeltable/file_cache/682f022a704d4459adb2f29f7fe9577c_0_fc11428b32768ae782193a57ebcbad706f45bbd9fa13354471e0bcd798fee3ea.mp4
<class 'str'>: /Users/asiegel/.pixeltable/file_cache/682f022a704d4459adb2f29f7fe9577c_0_b9fb0d9411bc9cd183a36866911baa7a8834f22f665bce47608566b38485c16a.mp4
<class 'str'>: /var/folders/hb/qd0dztsj43j_mdb6hbl1gzyc0000gn/T/tmp1jo4a7ca.mp4


f
1
1
1
1


## Dealing with errors

When interacting with media data in Pixeltable, the user can assume that the underlying files exist, are local and are valid for their respective data type. In other words, the user doesn't need to consider error conditions.

To that end, Pixeltable validates media data on ingest. The default behavior is to reject invalid media files:

In [None]:
v.insert([{'video': prefix + 'bad_path.mp4'}])

Computing cells:   0%|                                                    | 0/2 [00:01<?, ? cells/s]


Error: Failed to download s3://multimedia-commons/bad_path.mp4: An error occurred (404) when calling the HeadObject operation: Not Found

The same happens for corrupted files:

In [None]:
# create invalid .mp4
with tempfile.NamedTemporaryFile(mode='wb', suffix='.mp4', delete=False) as temp_file:
    temp_file.write(random.randbytes(1024))
    corrupted_path = temp_file.name

v.insert([{'video': corrupted_path}])

Computing cells: 100%|██████████████████████████████████████████| 2/2 [00:00<00:00, 1084.64 cells/s]


Error: Not a valid video: /var/folders/hb/qd0dztsj43j_mdb6hbl1gzyc0000gn/T/tmp3djgfyjp.mp4

Alternatively, Pixeltable can also be instructed to record error conditions and proceed with the ingest, via the `on_error` flag (default: `'abort'`):

In [11]:
v.insert([{'video': prefix + 'bad_path.mp4'}, {'video': corrupted_path}], on_error='ignore')

Computing cells: 100%|████████████████████████████████████████████| 4/4 [00:00<00:00, 20.98 cells/s]
Inserting rows into `videos`: 2 rows [00:00, 671.63 rows/s]
Computing cells: 100%|████████████████████████████████████████████| 4/4 [00:00<00:00, 20.13 cells/s]
Inserted 2 rows with 4 errors across 2 columns (videos.video, videos.None).


UpdateStatus(num_rows=2, num_computed_values=4, num_excs=4, updated_cols=[], cols_with_excs=['videos.video', 'videos.None'])

Every media column has properties `errortype` and `errormsg` (both containing `string` data) that indicate whether the column value is valid. Invalid values show up as `None` and have non-null `errortype`/`errormsg`:

In [12]:
v.select(v.video == None, v.video.errortype, v.video.errormsg).collect()

col_0,video_errortype,video_errormsg
False,,
False,,
False,,
False,,
True,Error,Failed to download s3://multimedia-commons/bad_path.mp4: An error occurred (404) when calling the HeadObject operation: Not Found
True,Error,Not a valid video: /var/folders/hb/qd0dztsj43j_mdb6hbl1gzyc0000gn/T/tmp3djgfyjp.mp4


Errors can now be inspected (and corrected) after the ingest:

In [13]:
v.where(v.video.errortype != None).select(v.video.errormsg).collect()

video_errormsg
Failed to download s3://multimedia-commons/bad_path.mp4: An error occurred (404) when calling the HeadObject operation: Not Found
Not a valid video: /var/folders/hb/qd0dztsj43j_mdb6hbl1gzyc0000gn/T/tmp3djgfyjp.mp4


## Accessing the original file paths

In some cases, it will be necessary to access file paths (not, say, the `PIL.Image.Image`), and Pixeltable provides the column properties `fileurl` and `localpath` for that purpose:

In [14]:
v.select(v.video.fileurl, v.video.localpath).collect()

video_fileurl,video_localpath
s3://multimedia-commons/data/videos/mp4/ffe/ffb/ffeffbef41bbc269810b2a1a888de.mp4,/Users/asiegel/.pixeltable/file_cache/682f022a704d4459adb2f29f7fe9577c_0_1fcfcb221263cff76a2853250fbbb2e90375dd495454c0007bc6ff4430c9a4a7.mp4
s3://multimedia-commons/data/videos/mp4/ffe/feb/ffefebb41485539f964760e6115fbc44.mp4,/Users/asiegel/.pixeltable/file_cache/682f022a704d4459adb2f29f7fe9577c_0_fc11428b32768ae782193a57ebcbad706f45bbd9fa13354471e0bcd798fee3ea.mp4
s3://multimedia-commons/data/videos/mp4/ffe/f73/ffef7384d698b5f70d411c696247169.mp4,/Users/asiegel/.pixeltable/file_cache/682f022a704d4459adb2f29f7fe9577c_0_b9fb0d9411bc9cd183a36866911baa7a8834f22f665bce47608566b38485c16a.mp4
file:///var/folders/hb/qd0dztsj43j_mdb6hbl1gzyc0000gn/T/tmp1jo4a7ca.mp4,/var/folders/hb/qd0dztsj43j_mdb6hbl1gzyc0000gn/T/tmp1jo4a7ca.mp4
,
,


Note that for local media files, the `fileurl` property still returns a parsable URL.