# FFCV Dataset Support

This notebook demonstrates slipstream's support for FFCV `.beton`/`.ffcv` files.

**Supported formats:**
- Local `.ffcv` / `.beton` files
- Remote S3 paths (auto-download with s5cmd or fsspec)
- Automatic format detection via `SlipstreamDataset`

**What's happening under the hood:**
- `FFCVFileReader` reads the binary FFCV format directly (no FFCV dependency needed)
- Image bytes, labels, and metadata are extracted from the memory-mapped file
- `SlipstreamLoader` builds an optimized cache for fast iteration

## 1. Dataset Creation via SlipstreamDataset

`SlipstreamDataset` auto-detects `.ffcv`/`.beton` files from any argument.

In [None]:
from slipstream import SlipstreamDataset

FFCV_VAL_PATH = "s3://visionlab-datasets/imagenet1k/pre-processed/s256-l512-jpgbytes-q100-ffcv/imagenet1k-s256-l512-jpg-q100-cs100-val-7ac6386e.ffcv"

# Just pass the path — auto-detected as FFCV
dataset = SlipstreamDataset(FFCV_VAL_PATH)

print(f"Dataset type: {type(dataset).__name__}")
print(f"Reader type: {type(dataset._reader).__name__}")
print(f"Number of samples: {len(dataset):,}")
print(f"Field types: {dataset.field_types}")
print(f"Image fields: {dataset.image_fields}")

In [None]:
# repr shows FFCV-specific info
dataset

## 2. Inspect Samples

In [None]:
from slipstream import decode_image

sample = dataset[0]
print(f"Sample keys: {list(sample.keys())}")
print(f"Label: {sample['label']}")
print(f"Image bytes (first 20): {sample['image'][:20]}...")
print(f"Image size: {len(sample['image']):,} bytes")
print(f"Path: {sample['path']}")

In [None]:
# Decode and display
img = decode_image(sample['image'], to_pil=True)
print(f"Decoded size: {img.size}")
img

In [None]:
# Or use decode_images=True for automatic decoding
dataset_decoded = SlipstreamDataset(FFCV_VAL_PATH, decode_images=True)
sample = dataset_decoded[0]
print(f"Image type: {type(sample['image']).__name__}")
sample['image']

## 3. Alternative Creation Methods

In [None]:
# All of these are equivalent:
dataset1 = SlipstreamDataset(FFCV_VAL_PATH)                    # positional (remote_dir)
dataset2 = SlipstreamDataset(remote_dir=FFCV_VAL_PATH)         # explicit remote_dir
dataset3 = SlipstreamDataset(input_dir=FFCV_VAL_PATH)          # input_dir also works

print(f"All have same length: {len(dataset1) == len(dataset2) == len(dataset3)}")
print(f"All use FFCVFileReader: {all(type(d._reader).__name__ == 'FFCVFileReader' for d in [dataset1, dataset2, dataset3])}")

In [None]:
from slipstream.readers import FFCVFileReader

# Or create the reader directly (power user)
reader = FFCVFileReader(FFCV_VAL_PATH)
print(reader)

## 4. SlipstreamLoader with Pipelines

In [None]:
from slipstream import SlipstreamLoader, DecodeCenterCrop, ToTorchImage

dataset = SlipstreamDataset(FFCV_VAL_PATH)

loader = SlipstreamLoader(
    dataset,
    batch_size=32,
    pipelines={
        "image": [
            DecodeCenterCrop(size=224),
        ]
    },
    force_rebuild=True,
)

batch = next(iter(loader))
print(f"Batch keys: {list(batch.keys())}")
print(f"Image shape: {batch['image'].shape}")
print(f"Image dtype: {batch['image'].dtype}")
print(f"Labels: {batch['label'][:8].tolist()}...")

### Using Pipeline Presets

In [None]:
from slipstream import SlipstreamDataset, SlipstreamLoader, DecodeCenterCrop, ToTorchImage
from slipstream.pipelines import supervised_train, supervised_val

FFCV_VAL_PATH = "s3://visionlab-datasets/imagenet1k/pre-processed/s256-l512-jpgbytes-q100-ffcv/imagenet1k-s256-l512-jpg-q100-cs100-val-7ac6386e.ffcv"

# Just pass the path — auto-detected as FFCV
dataset = SlipstreamDataset(FFCV_VAL_PATH)
# Validation preset
val_loader = SlipstreamLoader(
    dataset,
    batch_size=32,
    shuffle=False,
    pipelines=supervised_val(size=224, device="cpu"),
    # exclude_fields=['path']
)
batch = next(iter(val_loader))
print(f"Val batch: {batch['image'].shape}, dtype={batch['image'].dtype}")

In [None]:
from slipstream import SlipstreamLoader, DecodeCenterCrop, ToTorchImage

# Training preset
train_loader = SlipstreamLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    pipelines=supervised_train(size=224, seed=42, device="cpu"),
    # exclude_fields=['path']
)

batch = next(iter(train_loader))
print(f"Train batch: {batch['image'].shape}, dtype={batch['image'].dtype}")

## 5. Visualize

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(2, 4, figsize=(12, 6))

for i, ax in enumerate(axes.flat):
    sample = dataset[i * 100]
    img = decode_image(sample['image'], to_pil=True)
    ax.imshow(img)
    ax.set_title(f"label={sample['label']}")
    ax.axis('off')

plt.suptitle("FFCV Dataset Samples", fontsize=14)
plt.tight_layout()
plt.show()

## Summary

**Dataset Creation:**
```python
# Auto-detection (recommended)
dataset = SlipstreamDataset("s3://bucket/data.ffcv")
dataset = SlipstreamDataset(local_dir="/path/to/data.beton")

# Explicit
reader = FFCVFileReader("/path/to/data.ffcv")
```

**DataLoader Creation:**
```python
# High-performance training
loader = SlipstreamLoader(
    dataset,
    batch_size=256,
    pipelines=supervised_train(224, device="cuda"),
)
```