# Working with Blob Storage in Pixeltable

Pixeltable is declarative data infrastructure for multimodal AI applications. It manages metadata, transformations, and workflows, while your actual media files (images, videos, audio) are stored in cloud blob storage services.

Pixeltable stores structured metadata and manages your data pipeline, while blob storage (S3, R2, B2, Azure Blob, GCS, Tigris) holds the actual media files. There are three common workflows that Pixeltable supports for integrating with blob storage.

1. 
2. When Pixeltable generates or processes media, those results are stored in blob storage too. Pixeltable maintains references to the files, so you can query everything together.
3. When you insert media into Pixeltable, it can automatically upload files to your blob storage. 

This guide shows you three ways to store media in blob storage (per-column destinations, global output destinations, global input destinations) and how to get servable URLs for all your media files.

See how [Backblaze B2](https://www.backblaze.com/blog/building-multimodal-ai-data-infrastructure-with-pixeltable/), [Tigris](https://www.tigrisdata.com/docs/quickstarts/pixeltable/), and Google Cloud Storage integrate with Pixeltable for scalable multimodal AI workflows.

Key use cases for blob storage with Pixeltable:
- Storage: Scalable, durable storage for large volumes of media files
- Serving media: Directly serving images, videos, or documents to browsers and applications via queryable URLs
- Integration: Seamless integration with services like Backblaze B2, Tigris, and Google Cloud Storage for multimodal AI pipelines

## Overview: Storing Media in Blob Storage

Pixeltable can store your media files in cloud blob storage in two scenarios:

1. **Storing media you insert** - When you insert media files into a table, they can be automatically uploaded to blob storage
2. **Storing media Pixeltable generates** - When Pixeltable computes or generates media (via computed columns), those results can be stored in blob storage

This guide covers both scenarios. You'll also learn how to get servable URLs for all your media files stored in blob storage.

Storing media that Pixeltable generates: When Pixeltable creates media (computed columns, image generation, video processing), you can store it in blob storage using per-column destinations or a global output destination.

Storing media that you insert: When you insert media files into a table, you can configure them to automatically upload to blob storage using a global input destination.

Getting servable URLs: Regardless of how your media is stored, you can query URLs that point directly to your files in blob storage for serving to browsers and applications.

---

## Prerequisites

For cloud storage (S3/R2/B2):
- AWS credentials with write permissions to your target bucket
- `boto3` package installed (Pixeltable will show a helpful error message with installation instructions if it's missing)

Note: Pixeltable also supports local filesystem paths, but for multimodal AI workflows at scale, you'll want to use cloud blob storage where Pixeltable manages references to files stored in your blob storage service.

Important Notes:
- Media metadata is always stored locally in Pixeltable's database
- Cloud-stored media files still count toward storage quotas (metadata tracking)
- Always verify bucket permissions before using cloud destinations in production

Let's set up our demo environment and install the required packages.


## Setup

Install Pixeltable and boto3 (required for cloud storage):

In [None]:
%pip install -qU pixeltable boto3

Now let's create a Pixeltable directory to keep the tables for this demo separate from anything else you're working on.


In [None]:
import pixeltable as pxt

# Remove the 'blob_storage_demo' directory if it exists
pxt.drop_dir('blob_storage_demo', force=True)
pxt.create_dir('blob_storage_demo')


In [None]:
t = pxt.create_table('blob_storage_demo.media', {'source_image': pxt.Image})

---

## Configuring Credentials for Cloud Storage

Before using cloud storage destinations, you need to configure credentials:

Credential requirements:
- S3-compatible (S3, R2, B2, Tigris): AWS credentials configured (see below)
- Azure Blob Storage: Azure storage account credentials in `~/.pixeltable/config.toml` or environment variables
- Google Cloud Storage: Google Cloud credentials via `GOOGLE_APPLICATION_CREDENTIALS` environment variable

Configuring AWS credentials (for S3, R2, B2, Tigris)

Step 1: Create AWS credentials file

Create `~/.aws/credentials` with your access keys:

```ini
[my-profile]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
```

Replace `my-profile` with any name you choose, and use your real IAM keys.

Step 2: (Optional) Set AWS region

If your bucket is outside `us-east-1`, create `~/.aws/config`:

```ini
[profile my-profile]
region = us-west-2  # replace with your bucket's region
```

Step 3: Tell Pixeltable which AWS profile to use

Choose one:

Option A: Environment variable (temporary, per shell session)
```bash
export PIXELTABLE_S3_PROFILE="my-profile"
```

Option B: Config file (persistent)  
Add to `~/.pixeltable/config.toml`:
```toml
[pixeltable]
s3_profile = "my-profile"
```

For R2 and B2, use `PIXELTABLE_R2_PROFILE` / `r2_profile` or `PIXELTABLE_B2_PROFILE` / `b2_profile` instead.

Step 4: Verify permissions

Your IAM user needs:
- `s3:ListBucket` on `arn:aws:s3:::your-bucket-name`
- `s3:PutObject` on `arn:aws:s3:::your-bucket-name/your-prefix/*`
- `s3:HeadBucket` on the bucket

Test access:
```bash
aws s3 ls s3://your-bucket-name/your-prefix/ --profile my-profile
```

Destination URI formats for cloud storage:

| **Storage Service** | **Destination URI Format** |
|---------------------|---------------------------|
| Amazon S3 | `s3://bucket-name/path/prefix` |
| Cloudflare R2 | `r2://bucket-name/path/prefix` |
| Backblaze B2 | `b2://bucket-name/path/prefix` |
| Google Cloud Storage | `gs://bucket-name/path/prefix` |
| Azure Blob Storage | `https://account.blob.core.windows.net/container/path` |
| Tigris | `https://t3.storage.dev/bucket-name/path` |

Example: `destination='s3://my-images/thumbnails/'` saves to the `thumbnails/` prefix in the `my-images` S3 bucket.


---

## Per-Column Destinations

Specify where each computed column stores its media files by adding a `destination` parameter. This gives you fine-grained control over where each column's media files are stored.

The `destination` parameter accepts blob storage URIs or local paths. See the table above for cloud storage URI formats. For local filesystem paths, use `/path/to/directory` or `file:///path/to/directory` (mainly for development/testing).

Data flow with per-column destinations:

```mermaid
flowchart TD
    A[Computed Column 1<br/>destination='s3://bucket/path1/'] --> B[File stored in<br/>s3://bucket/path1/]
    C[Computed Column 2<br/>destination='s3://bucket/path2/'] --> D[File stored in<br/>s3://bucket/path2/]
    
    B --> E[Pixeltable Database<br/>Metadata + References]
    D --> E
    
    E --> F[Query returns<br/>.fileurl pointing to blob storage]
    
    style B fill:#e1f5ff
    style D fill:#e1f5ff
    style E fill:#fff4e1
    style F fill:#e1f5ff
```

For this demo: We'll use a local directory so you can run the example without cloud credentials. In production, replace these with cloud storage URIs from the table above.

Create a local directory for this demo:

In [None]:
from pathlib import Path

local_dest = Path.home() / 'Desktop' / 'pixeltable_outputs'
local_dest.mkdir(parents=True, exist_ok=True)
local_dest


Insert a sample image:


In [None]:
# Insert a sample image from the repo
sample_image = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images/000000000036.jpg'
t.insert(source_image=sample_image)


In [None]:
# Add a computed column with a destination parameter
# Format: destination='s3://bucket-name/path' for cloud storage
# For this demo, using a local path (replace with cloud URI in production)
t.add_computed_column(
    rotated_local=t.source_image.rotate(90),
    destination=str(local_dest),  # In production: destination='s3://my-bucket/rotated/'
)

# Check the results
t.select(t.source_image, t.rotated_local).show()


# Add another computed column with a different destination
# Each column can have its own destination
t.add_computed_column(
    flipped=t.source_image.flip('horizontal'),
    destination=str(local_dest / 'flipped'),  # In production: destination='s3://my-bucket/flipped/'
)

# View the results
t.select(t.source_image, t.rotated_local, t.flipped).show()


In [None]:
# Add another computed column with a different destination
# Each column can have its own destination
t.add_computed_column(
    flipped=t.source_image.flip('horizontal'),
    destination=str(local_dest / 'flipped'),  # In production: destination='s3://my-bucket/flipped/'
)

# View the results
t.select(t.source_image, t.rotated_local, t.flipped).show()


In [None]:
# Verify where our files are stored
t.select(
    t.source_image.fileurl,
    t.rotated_local.fileurl,
    t.flipped.fileurl,
).collect()

---

## Global Output Destination

Instead of specifying a destination for each computed column, you can set a global default for all generated/computed media. This is configured in `~/.pixeltable/config.toml`.

Data flow with global output destination:

```mermaid
flowchart TD
    A[Computed Column 1<br/>no destination specified] --> B[All computed files]
    C[Computed Column 2<br/>no destination specified] --> B
    D[Computed Column 3<br/>no destination specified] --> B
    
    B --> E[Stored in<br/>Global Output Destination<br/>s3://bucket/computed-media/]
    
    E --> F[Pixeltable Database<br/>Metadata + References]
    
    F --> G[Query returns<br/>.fileurl pointing to blob storage]
    
    style E fill:#e1f5ff
    style F fill:#fff4e1
    style G fill:#e1f5ff
```

When to use this

Use a global output destination when:
- You want all computed media (rotated images, generated videos, etc.) to go to the same place
- You're managing costs by storing computed results in cheaper cloud storage
- You want a "set and forget" configuration

Configuration

Add this to your `~/.pixeltable/config.toml`:

```toml
[pixeltable]
output_media_dest = "s3://your-bucket-name/computed-media/"
```

Or set via environment variable:
```bash
export PIXELTABLE_OUTPUT_MEDIA_DEST="s3://your-bucket-name/computed-media/"
```

How it works

Once configured:
- Any computed column without an explicit `destination` parameter will use this global destination
- Computed columns with an explicit `destination` parameter will use that specific destination instead
- Input media (what you insert) is unaffected and follows normal storage rules

Example

```python
# With global output_media_dest configured to S3:

# This computed column will go to S3 (uses global default)
t.add_computed_column(thumbnail=t.source_image.resize((100, 100)))

# This computed column will go to the specified local path (overrides global default)
t.add_computed_column(
    preview=t.source_image.resize((300, 300)),
    destination='/path/to/local/folder'
)
```


In [None]:
# Check current global output destination (if any)
from pixeltable.env import Env

env = Env.get()
if env.default_output_media_dest:
    print(f"Global output destination is set to: {env.default_output_media_dest}")
else:
    print("No global output destination configured (using default local storage)")

# To set a global output destination, edit ~/.pixeltable/config.toml:
# [pixeltable]
# output_media_dest = "s3://your-bucket-name/computed-media/"
#
# Then restart Pixeltable or run: pxt.init()
#
# After that, any computed column without an explicit destination will use this default

In [None]:
# Check current global input destination (if any)
env = Env.get()
if env.default_input_media_dest:
    print(f"Global input destination is set to: {env.default_input_media_dest}")
    print("\nWith this configured, all inserted media will be copied to this location.")
else:
    print("No global input destination configured")
    print("Inserted media will be stored in Pixeltable's default local storage.")

# To set a global input destination, edit ~/.pixeltable/config.toml:
# [pixeltable]
# input_media_dest = "s3://your-bucket-name/input-media/"
#
# Then restart Pixeltable or run: pxt.init()
#
# After that, any media you insert will be automatically copied to this destination


---

## Global Input Destination

When you insert media files into a table (via `t.insert()`), you can configure them to automatically upload to blob storage. This is how Pixeltable integrates with blob storage services like Backblaze B2, Tigris, and Google Cloud Storage—when you insert a local file or URL, it's automatically uploaded to blob storage and a reference is stored in the table.

Data flow with global input destination:

```mermaid
flowchart TD
    A[Insert: Local file<br/>./my-video.mp4] --> B[Automatic upload]
    C[Insert: URL<br/>https://example.com/image.jpg] --> B
    
    B --> D[File stored in<br/>Global Input Destination<br/>s3://bucket/input-media/]
    
    D --> E[Pixeltable Database<br/>Metadata + References]
    
    E --> F[Query returns<br/>.fileurl pointing to blob storage]
    
    style D fill:#e1f5ff
    style E fill:#fff4e1
    style F fill:#e1f5ff
```

When to use this

Use a global input destination when:
- You want all media files inserted into Pixeltable to automatically upload to your blob storage service
- You're building a multimodal AI pipeline where source media should live in blob storage
- You're using Pixeltable with services like Backblaze B2, Tigris, or Google Cloud Storage and want automatic uploads
- You want Pixeltable to manage the entire data lifecycle, from ingestion to storage

Configuration

Add this to your `~/.pixeltable/config.toml`:

```toml
[pixeltable]
input_media_dest = "s3://your-bucket-name/input-media/"
# For Tigris:
# input_media_dest = "https://t3.storage.dev/your-bucket/input"
# For Backblaze B2:
# input_media_dest = "b2://your-bucket-name/input-media/"
```

Or set via environment variable:
```bash
export PIXELTABLE_INPUT_MEDIA_DEST="s3://your-bucket-name/input-media/"
```

How it works

Once configured:
- When you insert media (images, videos, audio) into a table, Pixeltable automatically uploads those files to your configured blob storage
- Pixeltable maintains references to the files in the table—the actual files live in blob storage
- This happens automatically during insert operations—no manual upload code needed
- Computed columns still follow their own destination rules (either per-column or global output destination)

Example

```python
# With global input_media_dest configured to Tigris, Backblaze B2, or Google Cloud Storage:

# Create a table
videos = pxt.create_table('content', {
    'video': pxt.Video,
    'title': pxt.String
})

# When you insert, the video is automatically uploaded to blob storage
videos.insert({
    'video': './my-video.mp4',
    'title': 'My Video'
})

# Pixeltable maintains a reference to the file in blob storage
# The actual file is stored in your configured blob storage service
```

Important notes

- This enables seamless integration with blob storage services—files are automatically uploaded when inserted
- Pixeltable manages references, so you can query and work with the data as if it were local
- Works seamlessly with services like [Backblaze B2](https://www.backblaze.com/blog/building-multimodal-ai-data-infrastructure-with-pixeltable/), [Tigris](https://www.tigrisdata.com/docs/quickstarts/pixeltable/), and Google Cloud Storage
- The upload happens automatically—no additional code needed

---

## Getting URLs for Serving Media

One of the key benefits of storing media in blob storage is that you can serve files directly to browsers and applications. Pixeltable provides queryable URLs through the `.fileurl` property that point to your files in blob storage.

After configuring destinations and computing columns, you can get the URLs where files are stored:


Serving Media from Blob Storage

When your media is stored in blob storage, query URLs using `.fileurl` in your select statements:

```python
# Query media with their blob storage URLs
t.select(
    t.source_image,
    t.source_image.fileurl,
    t.rotated_local,
    t.rotated_local.fileurl
).collect()

# Filter and get URLs for specific rows
t.where(t.some_condition).select(
    t.source_image.fileurl,
    t.processed_image.fileurl
).collect()
```

Benefits:
- Direct serving: URLs point directly to your blob storage, no proxy needed
- Scalable: Blob storage services handle the traffic and bandwidth
- CDN-ready: Many blob storage services offer CDN integration for fast global delivery
- Queryable: Get URLs through Pixeltable queries, filter by metadata, then serve the results

This makes it easy to build applications that serve media directly from blob storage while using Pixeltable to manage and query your data.


In [None]:
# Get queryable URLs for serving media from blob storage
# Query with .fileurl to get URLs that can be used directly in HTML, APIs, or applications
t.select(
    t.source_image,
    t.source_image.fileurl,
    t.rotated_local,
    t.rotated_local.fileurl,
    t.flipped,
    t.flipped.fileurl
).collect()


---

## Summary: Choosing the Right Configuration

Here's a quick decision guide organized by what you're trying to store:

Storing Media That Pixeltable Generates

| **Scenario** | **Recommended Approach** |
|-------------|----------------------|
| Different computed columns need different storage locations | Per-column destinations |
| All computed columns should go to the same blob storage service | Global output destination |
| You want fine-grained control over each column's storage | Per-column destinations |

Storing Media That You Insert

| **Scenario** | **Recommended Approach** |
|-------------|----------------------|
| You want all inserted media to automatically upload to blob storage | Global input destination |
| You're building a system where source media should live in blob storage | Global input destination |

Using Both

| **Scenario** | **Recommended Approach** |
|-------------|----------------------|
| You want all media (input and computed) in the same blob storage service | Global input + Global output destinations |

Key Takeaways

1. Pixeltable manages metadata and workflows; blob storage holds the actual media files - This is an integrated system, not just storage
2. Per-column destinations (`destination=` parameter) give you fine-grained control over where computed results are stored
3. Global output destination (`output_media_dest` in config) automatically stores all generated/computed media in your blob storage service
4. Global input destination (`input_media_dest` in config) automatically uploads inserted media to blob storage—this is how Pixeltable integrates with services like Backblaze B2, Tigris, and Google Cloud Storage
5. Per-column destinations always override global settings when specified

---

## Learn More

- [Pixeltable Configuration](https://docs.pixeltable.com/overview/configuration)
- [Working with External Files](https://docs.pixeltable.com/notebooks/feature-guides/working-with-external-files)
- [API Reference: add_computed_column](https://docs.pixeltable.com/api/pixeltable/table#pixeltable.Table.add_computed_column)

If you have questions about blob storage destinations, reach out on our [Discord community](https://pixeltable.com/discord)!
