# SlipstreamDataset Basics

This notebook demonstrates the basic usage of `SlipstreamDataset` for loading datasets from various sources.

## Features

- **Auto-detection**: Automatically detects format (LitData streaming, FFCV, ImageFolder)
- **Composition pattern**: Wraps pluggable readers (StreamingReader, FFCVFileReader, SlipstreamImageFolder)
- **Flexible decoding**: Raw bytes (for training) or decoded images (for exploration)
- **Pipeline support**: Per-field transforms via `SlipstreamLoader`

In [None]:
# Test dataset paths
LITDATA_VAL_PATH = "s3://visionlab-datasets/imagenet1k/pre-processed/s256-l512-jpgbytes-q100-streaming/val/"
FFCV_VAL_PATH = "s3://visionlab-datasets/imagenet1k/pre-processed/s256-l512-jpgbytes-q100-ffcv/imagenet1k-s256-l512-jpg-q100-cs100-val-7ac6386e.ffcv"

## 1. Basic Usage: Load and Inspect Dataset

Use `decode_images=True` for interactive exploration. The reader type is auto-detected from the path.

In [None]:
from slipstream import SlipstreamDataset

# Create dataset with automatic decoding (for exploration)
dataset = SlipstreamDataset(
    remote_dir=LITDATA_VAL_PATH,
    decode_images=True,
    to_pil=True,
)

# Show dataset info
print(f"Reader type: {type(dataset._reader).__name__}")
dataset

In [None]:
# Get a sample
sample = dataset[0]
print(f"Sample keys: {list(sample.keys())}")
print(f"Image type: {type(sample['image'])}")
print(f"Label: {sample['label']}")

In [None]:
# Display the image
sample['image']

## 2. Raw Bytes Mode (for high-performance training)

For training, use `decode_images=False` and let `SlipstreamLoader` handle decoding.

In [None]:
# Create dataset WITHOUT automatic decoding
# This is what you'd use with SlipstreamLoader for training
dataset_raw = SlipstreamDataset(
    remote_dir=LITDATA_VAL_PATH,
    decode_images=False,
)

dataset_raw

In [None]:
# Get raw sample
sample_raw = dataset_raw[0]
print(f"Image type: {type(sample_raw['image'])}")
print(f"Image size: {len(sample_raw['image'])} bytes")
print(f"First 16 bytes (JPEG header): {sample_raw['image'][:16].hex()}")

In [None]:
# Manual decoding (what the loader will do)
from slipstream import decode_image

image_tensor = decode_image(sample_raw['image'], to_pil=False)
print(f"Decoded tensor shape: {image_tensor.shape}")
print(f"Decoded tensor dtype: {image_tensor.dtype}")

## 3. SlipstreamLoader with Pipeline Presets

For training, use `SlipstreamLoader` with pipeline presets. This handles:
- Building an optimized cache for fast I/O
- Batch decoding with random/center crop
- GPU transfers and normalization

In [None]:
from slipstream import SlipstreamDataset, SlipstreamLoader
from slipstream.pipelines import supervised_train, supervised_val

# Training loader with random resized crop + augmentations
dataset = SlipstreamDataset(remote_dir=LITDATA_VAL_PATH, decode_images=False)

train_loader = SlipstreamLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    pipelines=supervised_train(size=224, seed=42, device="cpu"),
)

batch = next(iter(train_loader))
print(f"Batch keys: {list(batch.keys())}")
print(f"Image shape: {batch['image'].shape}")  # [B, C, H, W]
print(f"Image dtype: {batch['image'].dtype}")
print(f"Labels: {batch['label'][:8].tolist()}")
train_loader.shutdown()

In [None]:
# Validation loader with center crop
val_loader = SlipstreamLoader(
    dataset,
    batch_size=32,
    shuffle=False,
    pipelines=supervised_val(size=224, device="cpu"),
)

batch = next(iter(val_loader))
print(f"Val batch shape: {batch['image'].shape}")
val_loader.shutdown()

## 4. Custom Pipelines with Decoders

For more control, use decoder stages directly. All decoders have a `Decode` prefix.

In [None]:
from slipstream import SlipstreamLoader, DecodeCenterCrop, DecodeRandomResizedCrop

# Custom pipeline: just decode + crop (no normalization)
loader_custom = SlipstreamLoader(
    dataset,
    batch_size=16,
    shuffle=False,
    pipelines={'image': [DecodeCenterCrop(224)]},
    exclude_fields=['path'],
)

batch = next(iter(loader_custom))
print(f"Image shape: {batch['image'].shape}")
print(f"Image dtype: {batch['image'].dtype}")  # uint8 (no normalization)
loader_custom.shutdown()

In [None]:
# Add normalization for training
from slipstream import Normalize, ToTorchImage
from slipstream.transforms import IMAGENET_MEAN, IMAGENET_STD

loader_norm = SlipstreamLoader(
    dataset,
    batch_size=16,
    shuffle=True,
    pipelines={'image': [
        DecodeRandomResizedCrop(224),
        ToTorchImage(device='cpu', dtype='float32'),
        Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
    ]},
    exclude_fields=['path'],
)

batch = next(iter(loader_norm))
print(f"Image shape: {batch['image'].shape}")
print(f"Image dtype: {batch['image'].dtype}")
print(f"Image range: [{batch['image'].min():.3f}, {batch['image'].max():.3f}]")
loader_norm.shutdown()

## 5. FFCV File Format

`SlipstreamDataset` auto-detects `.ffcv`/`.beton` files and uses `FFCVFileReader` internally.

In [None]:
from slipstream import SlipstreamDataset, decode_image

# Just pass the FFCV path - auto-detected
dataset_ffcv = SlipstreamDataset(FFCV_VAL_PATH)
print(f"Reader type: {type(dataset_ffcv._reader).__name__}")
dataset_ffcv

In [None]:
sample = dataset_ffcv[0]
print(f"Sample keys: {list(sample.keys())}")
print(f"Label: {sample['label']}")
print(f"Path: {sample['path']}")

In [None]:
# Decode and display
decode_image(sample['image'], to_pil=True)

In [None]:
# Use with SlipstreamLoader (builds optimized cache on first use)
from slipstream import SlipstreamLoader, DecodeRandomResizedCrop

loader_ffcv = SlipstreamLoader(
    dataset_ffcv,
    batch_size=256,
    pipelines={'image': [DecodeRandomResizedCrop(224)]},
    exclude_fields=['path'],
)

loader_ffcv

In [None]:
batch = next(iter(loader_ffcv))
print(f"Batch keys: {list(batch.keys())}")
print(f"Image shape: {batch['image'].shape}")
print(f"Labels: {batch['label'][:5].tolist()}")
loader_ffcv.shutdown()

## Summary

**SlipstreamDataset** provides:
- **Auto-detection**: Detects LitData, FFCV, or ImageFolder from the path
- **Composition pattern**: Wraps readers (StreamingReader, FFCVFileReader, SlipstreamImageFolder)
- **Flexible decoding**: `decode_images=True` for exploration, `False` for training

**SlipstreamLoader** provides:
- **Optimized cache**: Builds a slip cache for fast batch I/O
- **Pipeline presets**: `supervised_train()`, `supervised_val()`, `simclr()`, `lejepa()`, etc.
- **Custom pipelines**: Combine `DecodeCenterCrop`, `DecodeRandomResizedCrop`, `Normalize`, etc.

**Key patterns**:
- `decode_images=True` for interactive exploration (PIL/tensor output)
- `decode_images=False` + `SlipstreamLoader(pipelines=...)` for training
- Decoder stages use `Decode` prefix: `DecodeCenterCrop`, `DecodeRandomResizedCrop`

**Next**: See `02_field_indexes.ipynb` for class-based subsetting.