# Export data for ML training

Convert Pixeltable data to PyTorch DataLoader format for model training.

## Problem

You have prepared training data—images with labels, text with embeddings, or multimodal data—and need to export it for PyTorch model training.

| Data type | Use case |
|-----------|----------|
| Images + labels | Image classification |
| Text + embeddings | Fine-tuning embeddings |
| Audio + transcripts | Speech model training |

## Solution

**What's in this recipe:**
- Convert query results to PyTorch Dataset
- Use with DataLoader for batch training
- Export to Parquet for external tools

You use `query.to_pytorch_dataset()` to create an iterable dataset compatible with PyTorch DataLoader.

### Setup

In [None]:
%pip install -qU pixeltable torch torchvision

In [None]:
import pixeltable as pxt
import torch
from torch.utils.data import DataLoader

In [None]:
# Create a fresh directory
pxt.drop_dir('pytorch_demo', force=True)
pxt.create_dir('pytorch_demo')

### Create sample training data

In [None]:
# Create table with images and labels
training_data = pxt.create_table(
    'pytorch_demo.training_data',
    {'image': pxt.Image, 'label': pxt.Int}
)

In [None]:
# Insert sample images with labels
base_url = 'https://raw.githubusercontent.com/pixeltable/pixeltable/main/docs/resources/images'
samples = [
    {'image': f'{base_url}/000000000036.jpg', 'label': 0},  # cat
    {'image': f'{base_url}/000000000090.jpg', 'label': 1},  # other
    {'image': f'{base_url}/000000000139.jpg', 'label': 1},  # other
]
training_data.insert(samples)
print(f"Inserted {len(samples)} training samples")

### Export to PyTorch dataset

In [None]:
# Add a resize step to ensure all images have the same size
training_data.add_computed_column(
    image_resized=training_data.image.resize((224, 224))
)

# Convert to PyTorch dataset
# 'pt' format returns images as CxHxW tensors with values in [0,1]
pytorch_dataset = training_data.select(
    training_data.image_resized,
    training_data.label
).to_pytorch_dataset(image_format='pt')

print(f"Dataset type: {type(pytorch_dataset)}")

In [None]:
# Use with PyTorch DataLoader
dataloader = DataLoader(pytorch_dataset, batch_size=2)

# Iterate through batches
for batch in dataloader:
    print(f"Image batch shape: {batch['image_resized'].shape}")
    print(f"Label batch: {batch['label']}")
    break  # Just show first batch

### Export to Parquet for external tools

In [None]:
import tempfile
from pathlib import Path

# Export to Parquet for use with other ML tools
export_path = Path(tempfile.mkdtemp()) / 'training_data'

pxt.io.export_parquet(
    training_data.select(training_data.label),  # Non-image columns
    parquet_path=export_path
)
print(f"Exported to: {export_path}")

## Explanation

**Export methods:**

| Method | Output | Use case |
|--------|--------|----------|
| `to_pytorch_dataset()` | PyTorch IterableDataset | Direct training |
| `export_parquet()` | Parquet files | External tools, sharing |
| `export_lancedb()` | LanceDB | Vector search apps |
| `to_coco_dataset()` | COCO JSON | Object detection |

**Image format options:**

| Format | Shape | Values | Use |
|--------|-------|--------|-----|
| `'pt'` | CxHxW | [0, 1] float32 | PyTorch models |
| `'np'` | HxWxC | [0, 255] uint8 | NumPy processing |

**DataLoader tips:**
- Data is cached to disk for efficient repeated loading
- Use `num_workers > 0` for parallel data loading
- Filter/transform data before export to reduce size

## See also

- [Sample data for training](https://docs.pixeltable.com/howto/cookbooks/data/data-sampling) - Stratified sampling
- [Import Parquet files](https://docs.pixeltable.com/howto/cookbooks/data/data-import-parquet) - Parquet import/export