# Working with Blob Storage in Pixeltable

Multimodal AI workflows generate a lot of files — extracted video frames, generated images, processed audio segments. By default, Pixeltable stores these files locally, which can quickly fill up your disk and make it hard to work with large datasets.

Cloud blob storage solves the capacity problem, but managing blobs directly requires tracking which files came from which inputs, storing metadata, maintaining relationships, and keeping everything in sync as you add data.

Pixeltable handles this for you. When you configure blob storage, Pixeltable automatically uploads files, tracks lineage and metadata, handles incremental updates, and lets you query across structured data and media files.

In this notebook, you'll learn how to configure Pixeltable to use blob storage and control where files are stored.

## What you'll learn

- Where Pixeltable stores files by default
- How to specify destinations for individual columns
- How to configure global destinations for all columns
- How destination precedence works

## How it works

Pixeltable decides where to store files using this priority:

1. **Column destination** (highest priority) — `destination=` in `add_computed_column()`
2. **Global configuration** — `input_media_dest` / `output_media_dest` in [config file](https://docs.pixeltable.com/platform/configuration)
3. **Local storage** (default) — Used if nothing else is configured

## Prerequisites

For this notebook, you'll need:
- `pixeltable` and `boto3` installed
- (Optional) Cloud storage credentials if you want to use a real provider

In [None]:
%pip install -qU pixeltable boto3

## Setup

Let's set up our demo environment. We'll use local paths in this notebook, but you can substitute cloud storage URIs anywhere you see a destination path.

In [None]:
import pixeltable as pxt
from pathlib import Path

In [None]:
# Clean slate for this demo
pxt.drop_dir('blob_storage_demo', force=True)
pxt.create_dir('blob_storage_demo')

In [None]:
# Create local destination directories
# For S3: dest_rotated = "s3://my-bucket/rotated/"
# For GCS: dest_rotated = "gs://my-bucket/rotated/"
base_path = Path.home() / 'Desktop' / 'pixeltable_outputs'
base_path.mkdir(parents=True, exist_ok=True)

dest_rotated = str(base_path / 'rotated')
dest_flipped = str(base_path / 'flipped')

# Create directories (only needed for local paths)
Path(dest_rotated).mkdir(exist_ok=True)
Path(dest_flipped).mkdir(exist_ok=True)

In [None]:
# Create table and insert sample image
t = pxt.create_table('blob_storage_demo.media', {'source_image': pxt.Image})
sample_image = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg'
t.insert(source_image=sample_image)

## Default destinations

By default, Pixeltable stores all media files in local storage:

- **Input files** (files you insert) — Stored locally
- **Output files** (files Pixeltable generates) — Stored locally

This works out of the box with no configuration. You can change these defaults, which we'll show later.

In [None]:
# Let's see where the source_image is stored by default
t.select(t.source_image, t.source_image.fileurl).collect()

## Per-column destinations

You can specify exactly where to store files for individual columns using the `destination=` parameter. This gives you fine-grained control.

Add computed columns - one **with** an explicit destination, and one **without** (uses the default).

In [None]:
# Column WITH explicit destination
t.add_computed_column(
    rotated=t.source_image.rotate(90),
    destination=dest_rotated
)

# Column WITHOUT destination - uses default (local storage)
t.add_computed_column(
    flipped=t.source_image.transpose(0)
)

View the results. Notice both files are stored, but `rotated` uses our explicit destination while `flipped` uses the default local storage.

In [None]:
t.select(t.source_image, t.rotated, t.flipped, t.rotated.fileurl, t.flipped.fileurl).collect()

## Changing global destinations

Instead of setting `destination=` on every column, you can change the global default for ALL columns.

### Output and input destinations

You can configure two types of global destinations:

- **`output_media_dest`** — Changes the default for files Pixeltable generates (computed columns)
- **`input_media_dest`** — Changes the default for files you insert into tables

You can set them to the same bucket or different buckets depending on your needs.

### How to configure

You have two options:

**Option 1: Configuration file** (`~/.pixeltable/config.toml`)

```toml
[pixeltable]
# Where files Pixeltable generates are stored
output_media_dest = "s3://my-bucket/output/"

# Where files you insert are stored  
input_media_dest = "s3://my-bucket/input/"
```

**Option 2: Environment variables**

```bash
export PIXELTABLE_OUTPUT_MEDIA_DEST="s3://my-bucket/output/"
export PIXELTABLE_INPUT_MEDIA_DEST="s3://my-bucket/input/"
```

### Supported providers and URI formats

| Provider | URI Format | Example |
|----------|-----------|---------|
| **Amazon S3** | `s3://bucket-name/prefix/` | `s3://my-bucket/media/` |
| **Google Cloud Storage** | `gs://bucket-name/prefix/` | `gs://my-gcs-bucket/media/` |
| **Azure Blob Storage** | `wasbs://container@account.blob.core.windows.net/prefix/` | `wasbs://media@myaccount.blob.core.windows.net/files/` |
| **Cloudflare R2** | `https://account-id.r2.cloudflarestorage.com/bucket-name/prefix/` | `https://abc123.r2.cloudflarestorage.com/media/files/` |
| **Backblaze B2** | `https://s3.region.backblazeb2.com/bucket-name/prefix/` | `https://s3.us-west-004.backblazeb2.com/media/files/` |
| **Tigris** | `https://fly.storage.tigris.dev/bucket-name/prefix/` | `https://fly.storage.tigris.dev/media/files/` |

For complete authentication and setup details, see the [Cloud Storage documentation](https://docs.pixeltable.com/integrations/cloud-storage).

## Precedence in action

Even if you configure global destinations, you can still override them for specific columns using the `destination=` parameter in `add_computed_column()`.

Let's add another column with an explicit destination to see the override in action.

In [None]:
# Add column with explicit destination (overrides any global default)
t.add_computed_column(
    thumbnail=t.source_image.resize((128, 128)),
    destination=dest_flipped  # Reusing our flipped directory
)

View the thumbnail and its file URL. The explicit `destination=` parameter always wins, regardless of global configuration.

In [None]:
t.select(t.thumbnail, t.thumbnail.fileurl).collect()

## Getting URLs for your files

When your files are in blob storage, you can get URLs that point directly to them. These URLs work in HTML, APIs, or any application you need to serve media with.

The `.fileurl` property gives you direct URLs you can use anywhere.

In [None]:
t.select(
    t.source_image.fileurl,
    t.rotated.fileurl,
    t.flipped.fileurl
).collect()

## Common patterns

Here are a few real-world patterns you might use:

### Pattern 1: All media in one place

If you want everything in the same bucket, configure both input and output destinations in `~/.pixeltable/config.toml`:

```toml
[pixeltable]
input_media_dest = "s3://my-bucket/media/"
output_media_dest = "s3://my-bucket/media/"
```

Or set environment variables:

```bash
export PIXELTABLE_INPUT_MEDIA_DEST="s3://my-bucket/media/"
export PIXELTABLE_OUTPUT_MEDIA_DEST="s3://my-bucket/media/"
```

### Pattern 2: Separate input and output

Keep source files separate from processed files in `~/.pixeltable/config.toml`:

```toml
[pixeltable]
input_media_dest = "s3://my-bucket/uploads/"
output_media_dest = "s3://my-bucket/processed/"
```

### Pattern 3: Override for specific columns

Use a global default, but send some columns elsewhere. First, set a global default in your config:

```toml
[pixeltable]
output_media_dest = "s3://my-bucket/processed/"
```

Then in your code, most columns use the global default, but you can override specific ones:

```python
# Uses global default (s3://my-bucket/processed/)
t.add_computed_column(
    thumbnail=t.image.resize((128, 128))
)

# Overrides global default - goes to different location
t.add_computed_column(
    large_thumbnail=t.image.resize((512, 512)),
    destination='s3://my-bucket/thumbnails/'
)
```

## Where do my files go?

Understanding how Pixeltable handles different types of input files helps you make better decisions about storage configuration.

| Configuration | Local File (`/path/to/image.jpg`) | URL (`https://example.com/image.jpg`) |
|---------------|-----------------------------------|---------------------------------------|
| **No `input_media_dest`** | Stores path reference. File stays in place (no copying, no caching needed). | Stores URL reference. Downloads to cache when accessed (lazy). |
| **Local `input_media_dest`** | Copies file to destination. | Stores URL reference. Downloads to cache when accessed (lazy). |
| **Cloud `input_media_dest`** | Uploads to destination, caches locally. | Downloads immediately on insert, uploads to destination, caches locally. |

When you configure a cloud destination, Pixeltable populates both the destination and the local cache efficiently during `insert()`. For URLs, this means downloading once and using that download for both the upload and cache—avoiding wasteful upload→download cycles.

## What you learned

- Pixeltable uses local storage by default for all media files
- You can override the default for specific columns with `destination=`
- You can change the global default with `input_media_dest` and `output_media_dest`
- Precedence: column destination > global config > local storage
- Use `.fileurl` to get URLs for your stored files
- Pixeltable handles caching intelligently to avoid wasteful operations

## Next steps

- See the [Cloud Storage documentation](https://docs.pixeltable.com/integrations/cloud-storage) for complete provider setup and authentication details
- Check out [Pixeltable Configuration](https://docs.pixeltable.com/platform/configuration) for all config options
- Join our [Discord community](https://pixeltable.com/discord) if you have questions