# Sharing and Replicating Datasets

Learn how to publish datasets to Pixeltable Cloud and replicate datasets from the cloud to your local environment.

## Overview

Pixeltable Cloud enables you to:
- **Publish** your datasets for sharing with teams or the public
- **Replicate** datasets from the cloud to your local environment
- Share multimodal AI datasets (images, videos, audio, documents) without managing infrastructure

This guide demonstrates both publishing and replicating datasets.

## Replicating Datasets

You can replicate any public dataset from Pixeltable Cloud to your local environment without needing an account or API key.

### Setup

In [1]:
%pip install -qU pixeltable

Note: you may need to restart the kernel to use updated packages.


In [3]:
import pixeltable as pxt

### Replicate a Public Dataset

Let's replicate a mini-version of the COCO-2017 dataset from Pixeltable Cloud. You can find this dataset at [pixeltable.com/t/pixeltable:fiftyone/coco_mini_2017](https://www.pixeltable.com/t/pixeltable:fiftyone/coco_mini_2017).

**Note:** You can replicate a specific version by adding `:version` to the URI (e.g., `'pxt://org/table:5'` for version 5). Table paths in Pixeltable must use underscores, not dashes (e.g., `'my_table'` not `'my-table'`).

See the [replicate() SDK reference](https://docs.pixeltable.com/sdk/latest/pixeltable#func-replicate) for full documentation.

In [None]:
# Replicate a public dataset to your local environment
coco_copy = pxt.replicate(
    remote_uri='pxt://pixeltable:main/examples/imagegen_compare',
    local_path='coco_copy'
)

You can check that the replica exists at the local path with `list_tables()`.

In [None]:
pxt.list_tables()

### Working with Replicas

Replicated datasets are read-only locally, but you can query, explore, and use them in powerful ways:

**1. Query and explore the data**

In [6]:
# View the replicated data
coco_copy.limit(3).collect()

image,coco_id,num_detections,width,height,caption
,41,5,640,427,"A person wearing a helmet and protective gear rides a skateboard down a residential street, with houses and parked cars visible in the background."
,47,9,426,640,"A young man in a red shirt and black cap is cooking at a grill in a diner-like setting, while various condiments and kitchen utensils are visible on the counter and walls."
,44,1,640,427,"A brown eagle with a white head is flying low over a body of water, its wings spread wide against the dark, rippling surface."


**2. Perform similarity searches**

Replicas include embedding indexes, so you can immediately perform similarity searches:

In [7]:
# Get a sample image to search with
sample_img = coco_copy.select(coco_copy.image).limit(1).collect()[0]['image']

# Perform similarity search using the replica's embedding index
sim = coco_copy.image.similarity(sample_img)
results = (
    coco_copy
    .order_by(sim, asc=False)  # Order by descending similarity
    .limit(5)  # Get top 5 similar images
    .select(coco_copy.image, sim)
    .collect()
)
results

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


image,similarity
,1.0
,0.708
,0.669
,0.607
,0.606


**3. Access replicas in new sessions**

In a new Pixeltable session, use `list_tables()` and `get_table()` to access your replicas:

In [None]:
# List all tables to see your replica
pxt.list_tables()

In [None]:
# Assign a handle to the replica
coco_copy = pxt.get_table('coco_copy')

**4. Create an independent copy**

To work with the data in new ways, create an independent table with the replica as the source:

In [None]:
# Create a fresh table with values only
my_coco = pxt.create_table('my_coco_table', source=coco_copy)

This copies the values in the source, but loses the computational definitions and cannot be updated if the source table changes.

### Updating Replicas with Pull

If the upstream table changes, you can update your local replica using `pull()`:

In [None]:
# Update your local replica with changes from the cloud
coco_copy.pull()

This synchronizes your local replica with any updates made to the source dataset.

## Publishing Datasets

**Requirements:**
- A Pixeltable Cloud account (Community Edition includes 1TB storage - see [pricing](https://www.pixeltable.com/pricing))
- Your API key from the [account dashboard](https://pixeltable.com/dashboard)

Publishing allows you to share your datasets with your team or make them publicly available.

### Configure Your API Key

Pixeltable looks for your API key in the `PIXELTABLE_API_KEY` environment variable. Choose one of these methods:

**Option 1: In your notebook (secure and convenient)**

Run this cell to securely enter your API key (get it from [pixeltable.com/dashboard](https://pixeltable.com/dashboard)):

In [None]:
from getpass import getpass
import os

os.environ['PIXELTABLE_API_KEY'] = getpass('Pixeltable API Key:')

**Option 2: Environment variable**

Add to your `~/.zshrc` or `~/.bashrc`:
```bash
export PIXELTABLE_API_KEY='your-api-key-here'
```

**Option 3: Config file**

Add to `~/.pixeltable/config.toml`:
```toml
[pixeltable]
api_key = 'your-api-key-here'
```

See the [Configuration Guide](https://docs.pixeltable.com/docs/overview/configuration) for details.

### Create a Sample Dataset

Let's create a table with images from this repository to publish. The `comment` parameter provides a description that will be visible on Pixeltable Cloud:

In [None]:
# Create a fresh directory
pxt.drop_dir('sample_images', force=True)
pxt.create_dir('sample_images')

In [None]:
t = pxt.create_table(
    'sample_images.photos',
    schema={'image': pxt.Image, 'description': pxt.String},
    comment='Sample image dataset for demonstrating Pixeltable Cloud publishing'
)

base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images'
t.insert([
    {'image': f'{base_url}/000000000009.jpg', 'description': 'Kitchen scene'},
    {'image': f'{base_url}/000000000025.jpg', 'description': 'Street view'},
    {'image': f'{base_url}/000000000042.jpg', 'description': 'Indoor setting'},
])

In [None]:
base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images'
t.insert([
    {'image': f'{base_url}/000000000009.jpg', 'description': 'Kitchen scene'},
    {'image': f'{base_url}/000000000025.jpg', 'description': 'Street view'},
    {'image': f'{base_url}/000000000042.jpg', 'description': 'Indoor setting'},
])

### Publish Your Dataset

Publish your table to Pixeltable Cloud with a destination URI in the format `pxt://orgname/dataset`. You can use either the table path or the table handle.

See the [publish() SDK reference](https://docs.pixeltable.com/sdk/latest/pixeltable#func-publish) for full documentation.

In [None]:
# Option 1: Publish using table path (string)
pxt.publish(
    source='sample_images.photos',  # Table path from list_tables()
    destination_uri='pxt://your-orgname/sample-images',
    access='private'  # or 'public' for public access
)

# Option 2: Publish using table handle
# pxt.publish(
#     source=t,  # Table handle you assigned
#     destination_uri='pxt://your-orgname/sample-images',
#     access='private'
# )

### Updating Published Datasets with Push

After you've published a dataset, you can update the cloud replica with local changes using `push()`:

In [None]:
# Make some changes to your local table
t.insert([{'image': f'{base_url}/000000000049.jpg', 'description': 'Outdoor scene'}])

# Push the changes to your published dataset
t.push()

This updates the published dataset on Pixeltable Cloud with your local changes.

Your dataset is now published and can be replicated by others using:

```python
import pixeltable as pxt
sample_images = pxt.replicate('pxt://your-orgname/sample-images', 'sample_images_copy')
```

## Access Control

When publishing datasets, you can control access:

- `access='public'`: Anyone can replicate your dataset
- `access='private'`: Only your team members can access it

Manage team access and permissions in your [Pixeltable dashboard](https://pixeltable.com/dashboard).

## Databases and Storage

**Understanding the URI format:**

The full URI format for Pixeltable Cloud is:
```
pxt://org:database/path
```

- **org**: Your organization name
- **database**: Database name (optional - defaults to `main`)
- **path**: Directory and table path

**Examples:**
- `pxt://orgname/my-dataset` → uses the `main` database (default)
- `pxt://orgname:main/my-dataset` → explicitly specifies `main` database
- `pxt://orgname:analytics/my-dataset` → uses the `analytics` database

**Key concepts:**
- Every Pixeltable Cloud account has a `main` database by default
- Each database has its own default storage bucket
- If you omit `:database` from the URI, `main` is used

## See Also

- [Pixeltable Cloud Documentation](https://docs.pixeltable.com/cloud/)
- [SDK Reference](https://docs.pixeltable.com/sdk/latest/)