# Working with Blob Storage in Pixeltable

Pixeltable manages metadata, transformations, and workflows for your multimodal AI applications, while your actual media files (images, videos, audio) are stored in cloud blob storage services like S3, R2, B2, Azure Blob, GCS, or Tigris.

This guide shows you how to configure Pixeltable to automatically store your media files in blob storage and retrieve servable URLs for serving to browsers and applications.

## What you'll learn

- How Pixeltable resolves where to store media files (destination precedence).
- How to store media files you insert into tables.
- How to store media files that Pixeltable generates (computed columns).
- How to get servable URLs for all your media files.

## Why use blob storage

Pixeltable generates a lot of media files. When you create computed columns that process images, videos, or audio, each transformation creates new files. This can quickly consume your local storage.

- **Storage constraints.** You may not have enough local disk space to handle all the media files Pixeltable generates.
- **Integration needs.** You need to serve these files to downstream applications, browsers, or APIs — blob storage provides direct, queryable URLs.
- **Scalability.** As your data grows, local storage becomes impractical.

Example scenario: "I don't have a lot of space on my machine. Anything I do in Pixeltable, please store in blob storage so I can access it from anywhere and serve it to my applications."

By configuring blob storage destinations, you can:

- **Offload storage.** Store all generated media in cloud blob storage instead of local disk.
- **Get servable URLs.** Query URLs that point directly to your files for serving to browsers and applications.
- **Integrate seamlessly.** Works with services like [Backblaze B2](https://www.backblaze.com/blog/building-multimodal-ai-data-infrastructure-with-pixeltable/), [Tigris](https://www.tigrisdata.com/docs/quickstarts/pixeltable/), and Google Cloud Storage.

---

## How destinations are resolved

Before diving into configuration, it's essential to understand how Pixeltable resolves where to store media files. Pixeltable uses a **precedence hierarchy** to resolve the destination.

**Destination precedence (trumping order):**

1. **Explicit column destination** — Trumps everything else (computed columns only).
   - When you specify `destination=` parameter in `add_computed_column()`, that destination is **always used**.
   - Example: `t.add_computed_column(thumbnail=t.image.resize((100, 100)), destination='s3://bucket/thumbnails/')`.

2. **Global default** — Used if no explicit destination is specified.
   - For **input columns** (media you insert): Uses `input_media_dest` from config if set.
   - For **computed columns** (media Pixeltable generates): Uses `output_media_dest` from config if set.
   - Only applies when no explicit column destination is specified.

3. **Local storage** — Fallback if nothing else is configured.
   - If no explicit destination and no global default is configured, files are stored in Pixeltable's local storage.
   - This is the default behavior when no blob storage destinations are configured.

**Trumping rules:**

- Explicit column destinations **always trump** global defaults.
- Global defaults only apply when no explicit destination is specified.
- Input columns can only use global `input_media_dest` (they don't support per-column destinations).
- Computed columns can use either explicit `destination=` parameter or global `output_media_dest`.

---

## Prerequisites and setup

### Required packages

For cloud storage (S3 / R2 / B2), you'll need `boto3` installed. Pixeltable will show a helpful error message with installation instructions if it's missing.

Pixeltable also supports local filesystem paths for development / testing, but for production multimodal AI workflows at scale, you'll want to use cloud blob storage.

### Cloud storage credentials

Before using cloud storage destinations, you need to configure credentials.

**Credential requirements:**

- **S3-compatible** (S3, R2, B2, Tigris): AWS credentials configured (see below).
- **Azure Blob Storage**: Azure storage account credentials in `~/.pixeltable/config.toml` or environment variables.
- **Google Cloud Storage**: Google Cloud credentials via `GOOGLE_APPLICATION_CREDENTIALS` environment variable.

**Configuring AWS credentials (for S3, R2, B2, Tigris):**

1. **Create AWS credentials file** (`~/.aws/credentials`):
   ```ini
   [my-profile]
   aws_access_key_id = YOUR_ACCESS_KEY
   aws_secret_access_key = YOUR_SECRET_KEY
   ```

2. **(Optional) Set AWS region** (`~/.aws/config`):
   ```ini
   [profile my-profile]
   region = us-west-2  # replace with your bucket's region
   ```

3. **Tell Pixeltable which AWS profile to use** (choose one):
   - **Environment variable** (temporary): `export PIXELTABLE_S3_PROFILE="my-profile"`.
   - **Config file** (persistent): Add to `~/.pixeltable/config.toml`:
     ```toml
     [pixeltable]
     s3_profile = "my-profile"
     ```
   - For R2 and B2, use `PIXELTABLE_R2_PROFILE` / `r2_profile` or `PIXELTABLE_B2_PROFILE` / `b2_profile` instead.

4. **Verify permissions**: Your IAM user needs `s3:ListBucket`, `s3:PutObject`, and `s3:HeadBucket` permissions on your bucket.

**Destination URI formats:**

| **Storage Service** | **Destination URI Format** |
|---------------------|---------------------------|
| Amazon S3 | `s3://bucket-name/path/prefix` |
| Cloudflare R2 | `r2://bucket-name/path/prefix` |
| Backblaze B2 | `b2://bucket-name/path/prefix` |
| Google Cloud Storage | `gs://bucket-name/path/prefix` |
| Azure Blob Storage | `https://account.blob.core.windows.net/container/path` |
| Tigris | `https://t3.storage.dev/bucket-name/path` |

**Important notes:**

- Media metadata is always stored locally in Pixeltable's database.
- Cloud-stored media files still count toward storage quotas (metadata tracking).
- Always verify bucket permissions before using cloud destinations in production.

---

## Setup for this demo

Let's set up our demo environment. For this demo, we'll use local paths so you can run the example without cloud credentials. In production, replace these with cloud storage URIs from the table above.


In [None]:
%pip install -qU pixeltable boto3

In [None]:
import pixeltable as pxt
from pathlib import Path

# Remove the 'blob_storage_demo' directory if it exists
pxt.drop_dir('blob_storage_demo', force=True)
pxt.create_dir('blob_storage_demo')

# Create a local directory for demo outputs
local_dest = Path.home() / 'Desktop' / 'pixeltable_outputs'
local_dest.mkdir(parents=True, exist_ok=True)

# Create a table for our demo
t = pxt.create_table('blob_storage_demo.media', {'source_image': pxt.Image})

# Insert a sample image
sample_image = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg'
t.insert(source_image=sample_image)


---

## Storing media you insert

When you insert media files into a table (via `t.insert()`), you can configure Pixeltable to automatically upload those files to blob storage. This is how Pixeltable integrates with blob storage services like Backblaze B2, Tigris, and Google Cloud Storage.

### How it works

When you insert media (images, videos, audio) into a table:

1. Pixeltable automatically uploads those files to your configured blob storage (if `input_media_dest` is set).
2. Pixeltable maintains references to the files in the table — the actual files live in blob storage.
3. This happens automatically during insert operations — no manual upload code needed.

**Precedence:** Input columns use global `input_media_dest` if configured. They don't support per-column destinations.

### Configuration

Add this to your `~/.pixeltable/config.toml`:

```toml
[pixeltable]
input_media_dest = "s3://your-bucket-name/input-media/"
# For Tigris:
# input_media_dest = "https://t3.storage.dev/your-bucket/input"
# For Backblaze B2:
# input_media_dest = "b2://your-bucket-name/input-media/"
```

Or set via environment variable:
```bash
export PIXELTABLE_INPUT_MEDIA_DEST="s3://your-bucket-name/input-media/"
```

### Example

```python
# With global input_media_dest configured:

# Create a table
videos = pxt.create_table('content', {
    'video': pxt.Video,
    'title': pxt.String
})

# When you insert, the video is automatically uploaded to blob storage
videos.insert({
    'video': './my-video.mp4',
    'title': 'My Video'
})

# Pixeltable maintains a reference to the file in blob storage
# The actual file is stored in your configured blob storage service
```

### When to use this

Use a global input destination when:

- You want all media files inserted into Pixeltable to automatically upload to your blob storage service.
- You're building a multimodal AI pipeline where source media should live in blob storage.
- You're using Pixeltable with services like Backblaze B2, Tigris, or Google Cloud Storage and want automatic uploads.
- You want Pixeltable to manage the entire data lifecycle, from ingestion to storage.


---

## Storing media Pixeltable generates

When Pixeltable computes or generates media (via computed columns), you can store those results in blob storage. You have two options:

1. **Per-column destinations**: Specify a destination for each computed column (trumps global defaults).
2. **Global output destination**: Set a default destination for all computed columns (used if no explicit destination).

### Option 1: Per-column destinations

Specify where each computed column stores its media files by adding a `destination` parameter. This gives you fine-grained control over where each column's media files are stored.

**Precedence:** Explicit column destinations **trump** global defaults (they always win).

The `destination` parameter accepts blob storage URIs or local paths. See the URI formats table in the Prerequisites section.

#### Example: Per-column destinations

```python
# Each computed column can have its own destination
t.add_computed_column(
    rotated=t.source_image.rotate(90),
    destination='s3://my-bucket/rotated/'  # Priority 1: explicit destination
)

t.add_computed_column(
    flipped=t.source_image.flip('horizontal'),
    destination='s3://my-bucket/flipped/'  # Priority 1: explicit destination
)
```

### Option 2: Global output destination

Instead of specifying a destination for each computed column, you can set a global default for all generated / computed media. This is configured in `~/.pixeltable/config.toml`.

**Precedence:** Global output destination is used when no explicit column destination is specified (it's trumped by explicit destinations).

#### Configuration

Add this to your `~/.pixeltable/config.toml`:

```toml
[pixeltable]
output_media_dest = "s3://your-bucket-name/computed-media/"
```

Or set via environment variable:
```bash
export PIXELTABLE_OUTPUT_MEDIA_DEST="s3://your-bucket-name/computed-media/"
```

#### How it works

Once configured:

- Any computed column **without** an explicit `destination` parameter will use this global destination.
- Computed columns **with** an explicit `destination` parameter will use that specific destination instead (explicit destinations trump global).

#### Example: Global output destination

```python
# With global output_media_dest configured to S3:

# This computed column will go to S3 (uses global default)
t.add_computed_column(thumbnail=t.source_image.resize((100, 100)))

# This computed column will go to the specified local path (explicit destination trumps global)
t.add_computed_column(
    preview=t.source_image.resize((300, 300)),
    destination='/path/to/local/folder'  # Explicit destination trumps global
)
```

### When to use each option

**Use per-column destinations when:**

- Different computed columns need different storage locations.
- You want fine-grained control over each column's storage.
- You need to override the global default for specific columns.

**Use global output destination when:**

- You want all computed media to go to the same place.
- You're managing costs by storing computed results in cheaper cloud storage.
- You want a "set and forget" configuration.

**Use both when:**

- You want a default location for most computed columns, but need specific columns to go elsewhere.

In [None]:
# Example: Per-column destinations (explicit destinations trump global defaults)
# For this demo, using local paths (replace with cloud URIs in production)

t.add_computed_column(
    rotated=t.source_image.rotate(90),
    destination=str(local_dest / 'rotated'),  # In production: destination='s3://my-bucket/rotated/'
)

t.add_computed_column(
    flipped=t.source_image.flip('horizontal'),
    destination=str(local_dest / 'flipped'),  # In production: destination='s3://my-bucket/flipped/'
)

# View the results
t.select(t.source_image, t.rotated, t.flipped).show()


In [None]:
# Check current global output destination (if any)
from pixeltable.env import Env

env = Env.get()
if env.default_output_media_dest:
    print(f"Global output destination is set to: {env.default_output_media_dest}")
    print("\nComputed columns without explicit destinations will use this location.")
else:
    print("No global output destination configured (using default local storage)")
    print("\nTo set a global output destination, edit ~/.pixeltable/config.toml:")
    print("# [pixeltable]")
    print("# output_media_dest = \"s3://your-bucket-name/computed-media/\"")


In [None]:
# Check current global input destination (if any)
env = Env.get()
if env.default_input_media_dest:
    print(f"Global input destination is set to: {env.default_input_media_dest}")
    print("\nWith this configured, all inserted media will be copied to this location.")
else:
    print("No global input destination configured")
    print("Inserted media will be stored in Pixeltable's default local storage.")
    print("\nTo set a global input destination, edit ~/.pixeltable/config.toml:")
    print("# [pixeltable]")
    print("# input_media_dest = \"s3://your-bucket-name/input-media/\"")


---

## Getting URLs for serving media

One of the key benefits of storing media in blob storage is that you can serve files directly to browsers and applications. Pixeltable provides queryable URLs through the `.fileurl` property that point to your files in blob storage.

### How to get URLs

Query URLs using `.fileurl` in your select statements:

```python
# Query media with their blob storage URLs
t.select(
    t.source_image,
    t.source_image.fileurl,
    t.rotated,
    t.rotated.fileurl
).collect()

# Filter and get URLs for specific rows
t.where(t.some_condition).select(
    t.source_image.fileurl,
    t.processed_image.fileurl
).collect()
```

### Benefits

- **Direct serving.** URLs point directly to your blob storage, no proxy needed.
- **Scalable.** Blob storage services handle the traffic and bandwidth.
- **CDN-ready.** Many blob storage services offer CDN integration for fast global delivery.
- **Queryable.** Get URLs through Pixeltable queries, filter by metadata, then serve the results.

This makes it easy to build applications that serve media directly from blob storage while using Pixeltable to manage and query your data.


In [None]:
# Get queryable URLs for serving media from blob storage
# Query with .fileurl to get URLs that can be used directly in HTML, APIs, or applications
t.select(
    t.source_image,
    t.source_image.fileurl,
    t.rotated,
    t.rotated.fileurl,
    t.flipped,
    t.flipped.fileurl
).collect()

---

## Summary: Choosing the right configuration

### Destination precedence

Destinations are resolved using these precedence rules:

1. **Explicit column destination** — Trumps everything (computed columns only).
2. **Global default** — Used if no explicit destination (`input_media_dest` for input columns, `output_media_dest` for computed columns).
3. **Local storage** — Fallback if nothing else is configured.

### Decision guide

**Storing media you insert:**

| **Scenario** | **Approach** |
|-------------|-------------|
| You want all inserted media to automatically upload to blob storage | Configure `input_media_dest` in config |
| You're building a system where source media should live in blob storage | Configure `input_media_dest` in config |

**Storing media Pixeltable generates:**

| **Scenario** | **Approach** |
|-------------|-------------|
| Different computed columns need different storage locations | Use per-column `destination=` parameter (explicit destinations trump global) |
| All computed columns should go to the same blob storage service | Configure `output_media_dest` in config |
| You want fine-grained control over each column's storage | Use per-column `destination=` parameter (explicit destinations trump global) |
| You want a default location but need some columns to go elsewhere | Configure `output_media_dest` + use `destination=` for exceptions (explicit trumps global) |

**Using both input and output destinations:**

| **Scenario** | **Approach** |
|-------------|-------------|
| You want all media (input and computed) in the same blob storage service | Configure both `input_media_dest` and `output_media_dest` to the same location |

### Key takeaways

1. **Destination precedence.** Explicit column destinations trump global defaults, which trump local storage.

2. **Input vs output.** Input columns (what you insert) can only use global `input_media_dest`. Computed columns (what Pixeltable generates) can use either explicit `destination=` parameter or global `output_media_dest`.

3. **Pixeltable manages metadata.** Pixeltable manages metadata and workflows; blob storage holds the actual media files. This is an integrated system, not just storage.

4. **Servable URLs.** Regardless of how your media is stored, you can query URLs using `.fileurl` that point directly to your files in blob storage.

---

## Learn more

- [Pixeltable Configuration](https://docs.pixeltable.com/overview/configuration)
- [Working with External Files](https://docs.pixeltable.com/notebooks/feature-guides/working-with-external-files)
- [API Reference: add_computed_column](https://docs.pixeltable.com/api/pixeltable/table#pixeltable.Table.add_computed_column)

If you have questions about blob storage destinations, reach out on our [Discord community](https://pixeltable.com/discord).
